Pfsense crashes after update to 2.2.6

dogvoiceman

I have an identical pair of Dell Poweredge 610 firewalls that were recently upgraded to version 2.2.6-RELEASE (amd64).

They are configured to synchronse with pfsync and with all traffic directed to VIPs managed by CARP and which are held on the primary firewall by default.

For a couple of days after the upgrade, both firewalls were stable. Then the secondary firewall took to crashing 3-4 times a day - that is the one that is carrying a negligible amount of traffic.
The primary firewall, that carries all the traffic (>100Mb/s) has remained stable.

They each have a built-in 4-port Broadcomm interface:


bce0: <qlogic netxtreme="" ii="" bcm5709="" 1000base-t="" (c0)="">mem 0xd6000000-0xd7ffffff irq 36 at device 0.0 on pci1</qlogic>

and two Intel PCI cards (4-ports each):


em0:

They each have the following lines in /boot/loader.conf as are recommended for these interfaces.


kern.ipc.nmbclusters="1048576"
hw.bce.tso_enable=0
hw.pci.enable_msix=0

They have worked fine and stably with older versions of pksense for the last 3 years.

The crash dumps show that there are two slightly different crashes. One seems to be linked to each of the NIC drivers, bce and em. The bits of the crash dumps that led me to think this are given below - they show the bit at the end, showing the nature of the crash followed by the bit that matches the frame pointer of the trap.

This is extracted from one of the crash dumps associated with the em driver


Fatal trap 9: general protection fault while in kernel mode
cpuid = 1; apic id = 22
instruction pointer	= 0x20:0xffffffff80b2ee60
stack pointer	        = 0x28:0xfffffe009f7dd680
frame pointer	        = 0x28:0xfffffe009f7dd6a0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 0 (em0 que)
version.txt06000027512663341510  7620 ustarrootwheelFreeBSD 10.1-RELEASE-p25 #0 c39b63e(releng/10.1)-dirty: Mon Dec 21 15:20:13 CST 2015
    root@pfs22-amd64-builder:/usr/obj.RELENG_2_2.amd64/usr/pfSensesrc/src.RELENG_2_2/sys/pfSense_SMP.10

...

db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100064 td 0xfffff800038b5920
m_freem() at m_freem+0x20/frame 0xfffffe009f7dd6a0
carp_input_c() at carp_input_c+0x24b/frame 0xfffffe009f7dd7a0
ip_input() at ip_input+0x118/frame 0xfffffe009f7dd7f0
netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dd860
ether_demux() at ether_demux+0x149/frame 0xfffffe009f7dd890
ether_nh_input() at ether_nh_input+0x347/frame 0xfffffe009f7dd8f0
netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dd960
ether_demux() at ether_demux+0xa5/frame 0xfffffe009f7dd990
ether_nh_input() at ether_nh_input+0x347/frame 0xfffffe009f7dd9f0
netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dda60
em_rxeof() at em_rxeof+0x40a/frame 0xfffffe009f7ddaf0
em_handle_que() at em_handle_que+0x41/frame 0xfffffe009f7ddb30
taskqueue_run_locked() at taskqueue_run_locked+0xe5/frame 0xfffffe009f7ddb80
taskqueue_thread_loop() at taskqueue_thread_loop+0xa8/frame 0xfffffe009f7ddbb0
fork_exit() at fork_exit+0x9a/frame 0xfffffe009f7ddbf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009f7ddbf0
--- trap 0, rip = 0, rsp = 0xfffffe009f7ddcb0, rbp = 0 ---

This is extracted from one of the crash dumps associated with the bce driver:


Fatal trap 9: general protection fault while in kernel mode
cpuid = 0; apic id = 20
instruction pointer     = 0x20:0xffffffff80b30a53
stack pointer           = 0x28:0xfffffe009f7d0a60
frame pointer           = 0x28:0xfffffe009f7d0a90
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 12 (irq259: bce3)

...

db:0:kdb.enter.default>  bt
Tracing pid 12 tid 100063 td 0xfffff800037a2000
m_cat() at m_cat+0x13/frame 0xfffffe009f7d0a90
bce_intr() at bce_intr+0x4f9/frame 0xfffffe009f7d0b20
intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe009f7d0b60
ithread_loop() at ithread_loop+0x96/frame 0xfffffe009f7d0bb0
fork_exit() at fork_exit+0x9a/frame 0xfffffe009f7d0bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009f7d0bf0
--- trap 0, rip = 0, rsp = 0xfffffe009f7d0cb0, rbp = 0 ---
db:0:kdb.enter.default>  ps
  pid  ppid  pgrp   uid   state   wmesg         wchan        cmd
...
   12     0     0     0  RL      (threaded)                  [intr]

The iDRAC console on the crashing server shows that all of its health checks are good.

I am puzzled as to what could have changed to cause this sort of failure only on the idle server.

Any ideas?

cmb

Using limiters? Can't combine pfsync and limiters in 2.2.x and newer.

dogvoiceman

Thanks very much cmb. I do have a limiter configured. I'll disable it and report back.