Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.
-
@stephenw10
Thank you for the hints. I’ll test what I can do and report back as soon as possible. -
@stephenw10 said in Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.:
If you can replicate that try running bt at the db> prompt after it crashes to get the backtrace.
It just hangs there, no input possible.
@stephenw10 said in Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.:
and run: memmap
-
@stephenw10 said in Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.:
Are you able to get the full output from a UEFI boot leading up to the failure?
Not yet. I think capturing is possible, but didn't have tried yet.
@stephenw10 said in Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.:
Are you able to try upgrading from 24.11 to 25.07.1 directly?
I’m not sure here, maybe I still have an old boot environment left, then theoretically it’s possible, but it would require my presence, so it can only be done later after work.
-
Ok thanks. I'll relay that to the devs, make sure it looks rational.
-
If you're able to get the full boot output try doing so whilst booting verbose. So
boot -v
at the loader prompt. That should give us more. -
@stephenw10 said in Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.:
stephenw10
Netgate
Administrator
about 5 hours agoIf you're able to get the full boot output try doing so whilst booting verbose. So boot -v at the loader prompt. That should give us more.
pfsense_capture.zip Not very good, but I don't think I can do better.
Will see if I can find 24.11... -
No luck, after upgrading on 24.11 the symptoms remain the same.
-
I think we should be able to see something there. Let's see....
-
Hmm unfortunately it's obscuring some of the most useful output.
Are you able to boot with acpi disabled entirely?
set hint.acpi.0.disabled=1
at the loader prompt. -
@stephenw10 said in Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.:
set hint.acpi.0.disabled=1
Is it the same as the 'ACPI off' option in the loader prompt under boot options?
I think I can provide the full UEFI boot output using the Netgate installer boot or even the console output — I’ll try something over the weekend.
On the other side, what changes were made in the bootloader between 25.07 and 25.07.1?
-
@w0w "ACPI off" is probably the same.
The difference between the 25.07 and 25.07.1 loader is the expansion of the memory area used to load the kernel and modules; the kernel has grown and some systems (such as those with many devices) rely on more memory to be reserved by the loader to be able to boot the kernel. We can't revert this expansion because it actually affects a substantial number of systems.
My guess with what's happening on your system is a BIOS firmware bug where the faulting address is not reported to be reserved for ACPI system use, and it coincidentally is where the kernel got loaded into. Because it's kernel code memory, the pages are marked as read-only, so a page-fault occurred when the ACPI driver tried to write to it.
If disabling ACPI doesn't work, then another thing to try is telling the loader to add slop space, which is a memory range to add on top of the expanded space. Go into the loader prompt and issue the command below to tell it to increase it further to 256MB, so that the kernel code doesn't overlap with what ACPI is trying to access:
staging_slop 268435456 boot -v
Unfortunately,
staging_slop
is a command, so it can't be added to loader.conf.Once the pfSense is booted (even if in 24.11), can you collect the ACPI tables to help us see what resources the BIOS is access or owns?
acpidump -dt | gzip -c > acpi.asl.gz
-
@ldangpfng
Thank you for support!
8c2764e4-d3ec-4872-b293-5d8d26535d1a-acpi.asl.gz
I hope this helps. This is from a 25.07.1 system booted with the recommended slop. -
There it is, a memory mapped PCI config space but the firmware has not marked the memory as reserved. Instead it's still in the Conventional Memory block, which means a loader or kernel could incorrectly allocate memory into that space.
Scope (_SB.PCI0.HEC2) { Name (H2BR, 0xBFF01000) <--- Name (H2ST, 0x0B) OperationRegion (NMFS, PCI_Config, 0x40, 0x04) Field (NMFS, DWordAcc, NoLock, Preserve) { , 30, DMEN, 1, NMEN, 1 }
This is a firmware bug; it should be reported to the vendor to do the right thing with memory map. Maybe they have a BIOS update available.
-
Thank you for you time and help.
FYI, the workaround is to addexec="staging_slop 268435456"
to the loader.conf
I do not think Asrock will fix anything, this is slightly old hardware anyway -
Ah so that worked for you? Nice!
-
@stephenw10 said in Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.:
Ah so that worked for you?
Yes, and loader.conf.local works too, I just forgot that I have Filer package
-
@w0w said in Upgrading from 25.07 to 25.07.1 causes a fatal trap 12 on boot.:
I just forgot that I have Filer package
Ha, I've done that.
-
So I investigated a bit deeper and found that I can eliminate this problem by disabling the “Above 4G decoding” option in the BIOS. However, I now recall why it was enabled in the first place — pfSense refused to boot without it in some earlier version.
Anyway, it looks like
exec="staging_slop 268435456"
fixes the issue in both cases. But after discussing it further, I realized that this only works around the bug at the boot stage, not at the OS level. The suggested workaround is to also disable the use of the ACPI MCFG table and fall back to classic PCI config access by adding:
hw.pci.mcfg=0
There are other possible methods, such as manually patching the ACPI tables and loading them from disk, but this seems too complicated for little real benefit.
All of this explains why I occasionally had unexplained fatal traps on this motherboard, with a similar access range…
-
Vendors have firmware bugs similar to this quite often, and usually you'd see the OS failing suddenly for no good reason. Falling back to PCI IO access for the config space will probably only allowing boot-through, but any time something allocates memory into that region, it won't be going to DRAM... leading to data corruption.
If you enable above 4GB decoding (and it successfully boots), there could be a change to the ACPI tables that the firmware generates.
In any case, we'll need to find a way to let FreeBSD know not to use that memory region for allocations.