Panic booting 2.6.0 on Jetway NF692G6-420
-
"The NF692G6-420 motherboard features the Intel Apollo Lake Pentium N4200 quad core processor and six Gigabit Ethernet LAN Ports [...]".
The upgrade from 2.5.2 to 2.6.0 went fine until it failed to return from the automatic reboot. When I restarted it by hand, it ran into a kernel panic:
Configuring loopback interface...done. Configuring LAGG interfaces...done. Configuring VLAN interfaces...done. Configuring WAN interface...done. Configuring LAN interface...done. Configuring SYNC interface...done. Configuring CARP settings...done. MCA: Bank 0, Status 0xb200000010000400 MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000004 MCA: Vendor "GenuineIntel", ID 0x506c9, APIC ID 4 MCA: CPU 2 UNCOR PCC internal timer error timeout stopping cpus panic: Unrecoverable machine check exception cpuid = 2 time = 1645961035 KDB: enter: panic [ thread pid 24487 tid 100543 ] Stopped at kdb_enter+0x37: movq $0,0x28f4676(%rip) db:0:kdb.enter.default> textdump set textdump set db:0:kdb.enter.default> capture on db:0:kdb.enter.default> run lockinfo db:1:lockinfo> show locks No such command; use "help" to list available commands db:1:lockinfo> show alllocks No such command; use "help" to list available commands db:1:lockinfo> show lockedvnods Locked vnodes db:0:kdb.enter.default> show pcpu cpuid = 2 dynamic pcpu = 0xfffffe007f0b9200 curthread = 0xfffff8000669a740: pid 24487 tid 100543 "rtsold" curpcb = 0xfffff8000669ace0 fpcurthread = 0xfffff8000669a740: pid 24487 "rtsold" idlethread = 0xfffff80005323000: tid 100005 "idle: cpu2" curpmap = 0xfffff8004456f138 tssp = 0xffffffff83719870 commontssp = 0xffffffff83719870 rsp0 = 0xfffffe004d993bc0 kcr3 = 0xffffffffffffffff ucr3 = 0xffffffffffffffff scr3 = 0x0 gs32p = 0xffffffff83720088 ldt = 0xffffffff837200c8 tss = 0xffffffff837200b8 tlb gen = 1665 curvnet = 0 db:0:kdb.enter.default> bt Tracing pid 24487 tid 100543 td 0xfffff8000669a740 kdb_enter() at kdb_enter+0x37/frame 0xfffffe0002420e50 vpanic() at vpanic+0x197/frame 0xfffffe0002420ea0 panic() at panic+0x43/frame 0xfffffe0002420f00 mca_intr() at mca_intr+0x9b/frame 0xfffffe0002420f20 mchk_calltrap() at mchk_calltrap+0x8/frame 0xfffffe0002420f20 --- trap 0x1c, rip = 0xffffffff80ddc9a2, rsp = 0xfffffe004d993750, rbp = 0xfffffe004d993760 --- lock_delay() at lock_delay+0x32/frame 0xfffffe004d993760 __rw_wlock_hard() at __rw_wlock_hard+0x188/frame 0xfffffe004d993810 pmap_remove_pages() at pmap_remove_pages+0x676/frame 0xfffffe004d993910 vmspace_exit() at vmspace_exit+0x9e/frame 0xfffffe004d993950 exit1() at exit1+0x55b/frame 0xfffffe004d9939b0 sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe004d9939c0 amd64_syscall() at amd64_syscall+0x387/frame 0xfffffe004d993af0 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe004d993af0 --- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x8003ac5fa, rsp = 0x7fffffffeba8, rbp = 0x7fffffffebc0 --- db:0:kdb.enter.default> ps [... mountains of stack traces ...] Tracing command zpool-zroot pid 31 tid 100203 td 0xfffff8000625d000 sched_switch() at sched_switch+0x630/frame 0xfffffe004d77b9a0 mi_switch() at mi_switch+0xd4/frame 0xfffffe004d77b9d0 sleepq_wait() at sleepq_wait+0x2c/frame 0xfffffe004d77ba00 _sleep() at _sleep+0x253/frame 0xfffffe004d77ba80 taskqueue_thread_loop() at taskqueue_thread_loop+0xe9/frame 0xfffffe004d77bab0 fork_exit(
This is where the output stopped.
After the automatic restart there was no activity on the serial console before I pulled the plug. I reinstalled 2.5.2 and attempted the upgrade again; this time, the console output just stopped after "Configuring CARP settings...". After returning to 2.5.2 again, it runs fine.
I can find no reports of panics on this hardware, either mainboard or CPU, of either pfSense or FreeBSD.
If I started on an extensive testing project (try upgrading an identical second system, try installing FreeBSD 12.3 instead of pfSense, try installing pfSense 2.6.0 instead of upgrading, ...) would the results be of any help to anyone to either tell me my hardware is bad, or find a way to work around the problem?
Thanks for any suggestions.
-
An MCA error is almost always a hardware problem so if you're seeing it in 2.6 consistently and not at all in 2.5.2 it's probably some device that's not enabled in 2.5.2.
Steve
-
$ cat mce.log MCA: Bank 0, Status 0xb200000010000400 MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000004 MCA: Vendor "GenuineIntel", ID 0x506c9, APIC ID 4 MCA: CPU 2 UNCOR PCC internal timer error
$ mcelog --no-dmi --ascii --file mce.log mcelog: Family 6 Model 92 CPU: only decoding architectural errors mcelog: Family 6 Model 92 CPU: only decoding architectural errors Hardware event. This is not a software error. CPU 2 BANK 0 MCG status:MCIP STATUS b200000010000400 MCGSTATUS 4 MCGCAP c07 APICID 4 SOCKETID 0 CPUID Vendor Intel Family 6 Model 92 Step 9
tl;dr: Hardware issue.
Might be something in the EFI/BIOS, an EFI or BIOS update might help, maybe switching between EFI and legacy booting, but that's just a guess.
Jetway is not known for quality hardware, though, so it's also possible it's an actual hardware problem with that CPU.
-
@jimp Actually, I cannot complain about the Jetway hardware I have in use (not a lot lot, but double digits). This is the first time I have any significant issue with any of it.
Even if the actual error is caused by something in the hardware, because 2.5.2 runs perfectly fine the suggestion upthread that it may be exposed by a kernel change makes sense to me.
It looks like there is a BIOS update available, I will try that soon.
I have not booted into the 2.6.0 installer at all yet; if that works, perhaps it will give me a clue. Most likely I will end up bisecting the kernel, which will be so much fun ...
-
On occasion a newer base OS will utilize some new feature of the hardware and uncover a latent problem as well. So even if it is related to the newer base OS that doesn't necessarily rule out a hardware problem, though it may be a specific hardware device or function that wasn't touched in the old version.
-
There is more wrong here than just a "simple" hardware issue that is exposed by pfSense 2.6.
- FreeBSD 12.3 and 13.0 install fine as well, and I can bring up the network and do some basic testing without any indication of trouble.
- pfSense 2.6.0 installs (from scratch) without problems, and reboots and runs without the network connected, but when I plug in the WAN link it freezes after no more than ten seconds.
- Same for the SYNC link while pfsync is enabled at the other end.
However, I also cannot get it to send or receive anything on the LAN interface(s). The original configuration was with an LACP LAGG over igb0/igb1. I reduced it to a static LAGG, then to a single interface, and consistently only saw outgoing traffic on both the firewall and the switch respectively. Neither side received anything from the other, and I tried every combination and several cables, of course.
I am now back on 2.5.2 with the original configuration, and everything is working just as before.
A second NF692G6-420 behaves the same insofar as it panics on the first reboot after installation. Experimenting any further seems pointless.
Conclusion: If I want to use any pfSense after 2.5.2, I need different hardware. How nice.
-
@chrullrich said in Panic booting 2.6.0 on Jetway NF692G6-420:
There is more wrong here than just a "simple" hardware issue that is exposed by pfSense 2.6.
- FreeBSD 12.3 and 13.0 install fine as well, and I can bring up the network and do some basic testing without any indication of trouble.
Just curious -- which FreeBSD? Did you try STABLE or RELEASE? pfSense is now using the STABLE branch, and it is different than the same version number in RELEASE. pfSense 2.5.2 was FreeBSD 12.2 STABLE. The 2.6.0 pfSense is based on FreeBSD 12.3 STABLE.
So a fair test would need to be done on the STABLE branch for FreeBSD. Just mentioning this because some folks grab RELEASE and don't realize that STABLE can be quite different when it comes to drivers (and bugs).
So with all that said, it is true that pfSense runs on a "customized" FreeBSD, so there are some changes. If you see different behavior between FreeBSD 12.3 STABLE and pfSense, then it might point to a pfSense issue (or still might be the particular patch level between the 12.3 STABLE you test on versus what pfSense 2.6.0 is built on).
-
Does it make any difference which NIC you have assigned as WAN?
Do all 6 NICs use the igb(4) driver?
Steve
-
OK, I think I figured it out, and this is embarrassing. Short version: The ACPI OS selection was on Windows, and it works much better when set to Linux, although I'm not completely sure that fixed the panics. It fixed something, though.
Long version:
The BIOS on the NF692G6 has the usual ACPI OS selection, which (of course) defaults to Windows. The other options available are Linux and MSDOS, and since FreeBSD is neither Linux nor MSDOS, I figured I might as well leave it at the default. Big mistake.
I set up a test lab with a single WAN instead of two and a single LAN instead of ~10. From the start, I saw an entirely different problem than before: Rather than panicing or just freezing once they received CARP or pfSync traffic, each individual network interface stopped working when it saw the first TCP packet (or possibly anything but ICMP). I could literally ping forever without trouble, but as soon I tried to get to the web configurator, the ping responses immediately stopped (and the browser timed out). As usual, this did not reproduce on vanilla FreeBSD (12.3, 13-RELEASE, 14-CURRENT). OPNsense 22.1 (with 13.0-RELEASE) did the same, and 21.7 (12.1) did not.
Then I noticed the OS option again, set it to Linux, and the new problem went away like the wind. If only I had not already replaced the hardware with nicer (read: much pricier) things. -
@stephenw10 To answer your questions: No, it makes no difference which is the WAN, and yes, the six interfaces on this board are igb0 through igb5.
-
Hmm, that sounds like it could be some hardware off loading in the NIC that isn't implemented as it's reporting. You might try comparing the output between 2.5.2 and 2.6.0 of:
ifconfig -vvvm igb0
TCP Segmentation Offloading should be disabled by default.
It's possible the BIOS reports different capabilities there to Windows. That's not something I've seen on any other hardware though.
Steve
-
@stephenw10 Looks identical to me. This is with the two versions on the two NF692G6s, and the one with 2.6.0 had link. I didn't notice until I started comparing them, by which time it was too late.
pfSense 2.6.0, ACPI OS = "Intel Linux":
[2.6.0-RELEASE][root@pfSense.home.arpa]/root: ifconfig -vvvm igb0 igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e100bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6,TXCSUM_IPV6> capabilities=f53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:30:18:09:13:75 inet6 fe80::230:18ff:fe09:1375%igb0 prefixlen 64 scopeid 0x1 inet 0.0.0.0 netmask 0xff000000 broadcast 255.255.255.255 media: Ethernet autoselect (1000baseT <full-duplex>) status: active supported media: media autoselect media 1000baseT media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
pfSense 2.6.0, ACPI OS = "Windows":
[2.6.0-RELEASE][root@pfSense.home.arpa]/root: ifconfig -vvvm igb0 igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e100bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6,TXCSUM_IPV6> capabilities=f53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:30:18:09:13:75 inet6 fe80::230:18ff:fe09:1375%igb0 prefixlen 64 scopeid 0x1 inet 0.0.0.0 netmask 0xff000000 broadcast 255.255.255.255 media: Ethernet autoselect (1000baseT <full-duplex>) status: active supported media: media autoselect media 1000baseT media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
pfSense 2.5.2, ACPI OS = "Intel Linux":
[2.5.2-RELEASE][root@pfSense.home.arpa]/root: ifconfig -vvvm igb0 igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e100bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6,TXCSUM_IPV6> capabilities=f53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:30:18:09:12:df inet6 fe80::230:18ff:fe09:12df%igb0 prefixlen 64 scopeid 0x1 media: Ethernet autoselect status: no carrier supported media: media autoselect media 1000baseT media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
pfSense 2.5.2, ACPI OS = "Windows", after it spontaneously rebooted once at "Configuring LAN interface...", with no additional output on the serial console, on the first boot after changing the OS option:
[2.5.2-RELEASE][root@pfSense.home.arpa]/root: ifconfig -vvvm igb0 igb0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 options=e100bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6,TXCSUM_IPV6> capabilities=f53fbb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_UCAST,WOL_MCAST,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6> ether 00:30:18:09:12:df inet6 fe80::230:18ff:fe09:12df%igb0 prefixlen 64 scopeid 0x1 media: Ethernet autoselect status: no carrier supported media: media autoselect media 1000baseT media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
-
Mmm, I agree looks to be configured the same in all cases.