Kernel Panic on Temporarily Disable CARP with ixgbe driver
-
Hi,
I've setup a couple of new pfSense boxes and am hitting a kernel panic each time I click on the Temporarily Disable CARP button on the primary while it's under load. I'm running iperf3 on two machines (one on LAN side, other on the WAN side) with approximately a 1.8Gbps flow going in each direction. The panic doesn't occur when I'm not load testing. Panic seems to be related to the ix driver.
Hardware is HP DL360 Gen9 with HP-561T (2 port - Intel X540-T2) 10 Gbps PCI NIC, and HP 331i (4 port - Broadcom 5719 chip) onboard 1Gbps NIC. The Intel ix driver is configured on the WAN side in a dual-port LACP LAGG, and the broadcom bge driver is configured on the LAN side in a dual-port LACP LAGG. Running the latest BIOS and NIC firmware from HP.
BIOS configuration includes running in legacy BIOS mode, disabled x2APIC, power profile and regulator in OS Control Mode with powerd running (crashes with or without powerd running and with or without BIOS managed power profile).
Only sysctl's that I've changed are:
-
Set kern.ipc.nmbclusters to 131072
-
Set kern.ipc.nmbjumbop to 524288
I've submitted a crash report today at approx. 12:30 EST (UTC-4) from WAN IP of 216.220.x.x (running in my lab, may appear as 207.34.x.x).
Update: tried with latest 2.3-BETA from today, same problem. Also found this which is a similar error, https://forum.pfsense.org/index.php?topic=55433.0
Can anyone help me fix this? Thanks!
Fatal trap 12: page fault while in kernel mode cpuid = 2; Fatal trap 12: page fault while in kernel mode apic id = 04 cpuid = 4; apic id = 08 fault virtual address = 0x378 fault virtual address = 0x378 fault code = supervisor read data, page not present fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80abf3d9 stack pointer = 0x28:0xfffffe00003cb740 instruction pointer = 0x20:0xffffffff80abf3d9 stack pointer = 0x28:0xfffffe00003df740 frame pointer = 0x28:0xfffffe00003cb7d0 frame pointer = 0x28:0xfffffe00003df7d0 code segment = base 0x0, limit 0xfffff, type 0x1b code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, processor eflags = IOPL = 0 interrupt enabled, resume, IOPL = 0 current process = 12 (irq267: ix0:que 2) c[ thread pid 12 tid 100047 ] Stopped at __rw_rlock+0x1c9: movl 0x378(%r14),%eax
db:0:kdb.enter.default> show pcpu cpuid = 3 dynamic pcpu = 0xfffffe00d7127780 curthread = 0xfffff80003649000: pid 12 "irq275: ix1:que 3" curpcb = 0xfffffe0061266b80 fpcurthread = none idlethread = 0xfffff80003388000: tid 100006 "idle: cpu3" curpmap = 0xffffffff82182058 tssp = 0xffffffff8219d148 commontssp = 0xffffffff8219d148 rsp0 = 0xfffffe0061266b80 gs32p = 0xffffffff8219eba0 ldt = 0xffffffff8219ebe0 tss = 0xffffffff8219ebd0 db:0:kdb.enter.default> bt Tracing pid 12 tid 100063 td 0xfffff80003649000 __rw_rlock() at __rw_rlock+0x1c9/frame 0xfffffe00612667d0 carp_forus() at carp_forus+0x49/frame 0xfffffe0061266800 ether_nh_input() at ether_nh_input+0x2cc/frame 0xfffffe0061266860 netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe00612668d0 ixgbe_rxeof() at ixgbe_rxeof+0x618/frame 0xfffffe0061266990 ixgbe_msix_que() at ixgbe_msix_que+0xbe/frame 0xfffffe00612669e0 intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe0061266a20 ithread_loop() at ithread_loop+0x96/frame 0xfffffe0061266a70 fork_exit() at fork_exit+0x9a/frame 0xfffffe0061266ab0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0061266ab0 --- trap 0, rip = 0, rsp = 0xfffffe0061266b70, rbp = 0 ---
-
-
…
fault code = supervisor read data, page not present
...Could be an use-after-free.
pfsense 2.2.6 is FreeBSD 10.1-RELEASE-p25 which has mainly security fixes from the 10-Branch but is missing some other fixes.
https://svnweb.freebsd.org/base?view=revision&revision=277625 should fix it.
Could someone of the pfsense-folks check if that's in their tree?
-
Could be an use-after-free.
pfsense 2.2.6 is FreeBSD 10.1-RELEASE-p25 which has mainly security fixes from the 10-Branch but is missing some other fixes.
https://svnweb.freebsd.org/base?view=revision&revision=277625 should fix it.
Yes, that's what I thought. Although that patch appears to be in the devel branch https://github.com/pfsense/FreeBSD-src/commit/f72184af7f1b19f99893f951a64a22f22ec344ba. I tried a beta build of 2.3 last week and same problem. Are the beta snapshots taken off the devel branch?
-
I guess it is.
Can you ssh into the pfsense and do an "uname -a" on the shell?
-
Can you ssh into the pfsense and do an "uname -a" on the shell?
FreeBSD <redacted> 10.2-STABLE FreeBSD 10.2-STABLE #317 58b7eab(devel): Fri Jan 15 04:28:46 CST 2016 root@pfs23-amd64-builder:/usr/home/pfsense/pfsense/tmp/obj/usr/home/pfsense/pfsense/tmp/FreeBSD-src/sys/pfSense amd64</redacted>
-
https://github.com/pfsense/FreeBSD-src/commit/f72184af7f1b19f99893f951a64a22f22ec344ba#diff-2a75ab8f3cf1e4838de5abd9c14a1870
seems to be in there. If thats the tree the beta is built from.
-
Yeah, it looks like it is. The commit hash in uname (58b7eab) is from the devel branch.
As I mentioned earlier, this sounds similar to https://forum.pfsense.org/index.php?topic=55433.0, which wasn't actually solved, just worked around by using a different NIC. I've traced the code back from carp_forus() which attemps to grab the lock, but going back to the ixgbe driver it just gets too complicated for me and I haven't managed to find what may be freeing the ifp pointer https://github.com/pfsense/FreeBSD-src/blob/945ed01c4bae06169f63978e43029c04d4abd731/sys/netinet/ip_carp.c#L1126.
-
I should add that ether_input() does check if ifp isn't a NULL pointer, but maybe there's a race condition here where something else is clearing it. https://github.com/pfsense/FreeBSD-src/blob/5aba7ffcfb97d9b6f4ce464de77b02ad4d7b8ad3/sys/net/if_ethersubr.c#L628.
-
Did you try
hw.pci.enable_msix=0
btw?
-
Did you try
hw.pci.enable_msix=0
Yep, that worked! What's the impact of disabling MSI-X though? Would be nice to not have to disable MSI-X and get to the bottom of the bug.
-
MSI-X is an extension to MSI which afaik implements separate capabilty structure, offers more vectors:
https://en.wikipedia.org/wiki/Message_Signaled_Interrupts
Now that it works without MSI-X you could try a different Slot for the ixgbe-card (HP should have a best practice document for that)
And you could try to update the Servers Bios. Maybe MSI-X Setup is somewhat borked.
-
Now that it works without MSI-X you could try a different Slot for the ixgbe-card (HP should have a best practice document for that)
And you could try to update the Servers Bios. Maybe MSI-X Setup is somewhat borked.
Server BIOS is up to date, running the latest release from HP which came out last month. I'll see if I can try a different slot. On a somewhat related note, I had to disable x2APIC in the BIOS for the machine to boot. Not sure if that's a BIOS or FreeBSD issue.