Pfsense crashes after update to 2.2.6
-
I have an identical pair of Dell Poweredge 610 firewalls that were recently upgraded to version 2.2.6-RELEASE (amd64).
They are configured to synchronse with pfsync and with all traffic directed to VIPs managed by CARP and which are held on the primary firewall by default.
For a couple of days after the upgrade, both firewalls were stable. Then the secondary firewall took to crashing 3-4 times a day - that is the one that is carrying a negligible amount of traffic.
The primary firewall, that carries all the traffic (>100Mb/s) has remained stable.They each have a built-in 4-port Broadcomm interface:
bce0: <qlogic netxtreme="" ii="" bcm5709="" 1000base-t="" (c0)="">mem 0xd6000000-0xd7ffffff irq 36 at device 0.0 on pci1</qlogic>
and two Intel PCI cards (4-ports each):
em0:
They each have the following lines in /boot/loader.conf as are recommended for these interfaces.
kern.ipc.nmbclusters="1048576" hw.bce.tso_enable=0 hw.pci.enable_msix=0
They have worked fine and stably with older versions of pksense for the last 3 years.
The crash dumps show that there are two slightly different crashes. One seems to be linked to each of the NIC drivers, bce and em. The bits of the crash dumps that led me to think this are given below - they show the bit at the end, showing the nature of the crash followed by the bit that matches the frame pointer of the trap.
This is extracted from one of the crash dumps associated with the em driver
Fatal trap 9: general protection fault while in kernel mode cpuid = 1; apic id = 22 instruction pointer = 0x20:0xffffffff80b2ee60 stack pointer = 0x28:0xfffffe009f7dd680 frame pointer = 0x28:0xfffffe009f7dd6a0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (em0 que) version.txt06000027512663341510 7620 ustarrootwheelFreeBSD 10.1-RELEASE-p25 #0 c39b63e(releng/10.1)-dirty: Mon Dec 21 15:20:13 CST 2015 root@pfs22-amd64-builder:/usr/obj.RELENG_2_2.amd64/usr/pfSensesrc/src.RELENG_2_2/sys/pfSense_SMP.10 ... db:0:kdb.enter.default> bt Tracing pid 0 tid 100064 td 0xfffff800038b5920 m_freem() at m_freem+0x20/frame 0xfffffe009f7dd6a0 carp_input_c() at carp_input_c+0x24b/frame 0xfffffe009f7dd7a0 ip_input() at ip_input+0x118/frame 0xfffffe009f7dd7f0 netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dd860 ether_demux() at ether_demux+0x149/frame 0xfffffe009f7dd890 ether_nh_input() at ether_nh_input+0x347/frame 0xfffffe009f7dd8f0 netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dd960 ether_demux() at ether_demux+0xa5/frame 0xfffffe009f7dd990 ether_nh_input() at ether_nh_input+0x347/frame 0xfffffe009f7dd9f0 netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dda60 em_rxeof() at em_rxeof+0x40a/frame 0xfffffe009f7ddaf0 em_handle_que() at em_handle_que+0x41/frame 0xfffffe009f7ddb30 taskqueue_run_locked() at taskqueue_run_locked+0xe5/frame 0xfffffe009f7ddb80 taskqueue_thread_loop() at taskqueue_thread_loop+0xa8/frame 0xfffffe009f7ddbb0 fork_exit() at fork_exit+0x9a/frame 0xfffffe009f7ddbf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009f7ddbf0 --- trap 0, rip = 0, rsp = 0xfffffe009f7ddcb0, rbp = 0 ---
This is extracted from one of the crash dumps associated with the bce driver:
Fatal trap 9: general protection fault while in kernel mode cpuid = 0; apic id = 20 instruction pointer = 0x20:0xffffffff80b30a53 stack pointer = 0x28:0xfffffe009f7d0a60 frame pointer = 0x28:0xfffffe009f7d0a90 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq259: bce3) ... db:0:kdb.enter.default> bt Tracing pid 12 tid 100063 td 0xfffff800037a2000 m_cat() at m_cat+0x13/frame 0xfffffe009f7d0a90 bce_intr() at bce_intr+0x4f9/frame 0xfffffe009f7d0b20 intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe009f7d0b60 ithread_loop() at ithread_loop+0x96/frame 0xfffffe009f7d0bb0 fork_exit() at fork_exit+0x9a/frame 0xfffffe009f7d0bf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009f7d0bf0 --- trap 0, rip = 0, rsp = 0xfffffe009f7d0cb0, rbp = 0 --- db:0:kdb.enter.default> ps pid ppid pgrp uid state wmesg wchan cmd ... 12 0 0 0 RL (threaded) [intr]
The iDRAC console on the crashing server shows that all of its health checks are good.
I am puzzled as to what could have changed to cause this sort of failure only on the idle server.
Any ideas?
-
Using limiters? Can't combine pfsync and limiters in 2.2.x and newer.
-
Thanks very much cmb. I do have a limiter configured. I'll disable it and report back.