Pfsense crashes after update to 2.2.6



  • I have an identical pair of Dell Poweredge 610 firewalls that were recently upgraded to version 2.2.6-RELEASE (amd64).

    They are configured to synchronse with pfsync and with all traffic directed to VIPs managed by CARP and which are held on the primary firewall by default.

    For a couple of days after the upgrade, both firewalls were stable. Then the secondary firewall took to crashing 3-4 times a day - that is the one that is carrying a negligible amount of traffic.
    The primary firewall, that carries all the traffic (>100Mb/s) has remained stable.

    They each have a built-in 4-port Broadcomm interface:

    
    bce0: <qlogic netxtreme="" ii="" bcm5709="" 1000base-t="" (c0)="">mem 0xd6000000-0xd7ffffff irq 36 at device 0.0 on pci1</qlogic> 
    

    and two Intel PCI cards (4-ports each):

    
    em0: 
    

    They each have the following lines in /boot/loader.conf as are recommended for these interfaces.

    
    kern.ipc.nmbclusters="1048576"
    hw.bce.tso_enable=0
    hw.pci.enable_msix=0
    
    

    They have worked fine and stably with older versions of pksense for the last 3 years.

    The crash dumps show that there are two slightly different crashes. One seems to be linked to each of the NIC drivers, bce and em. The bits of the crash dumps that led me to think this are given below - they show the bit at the end, showing the nature of the crash followed by the bit that matches the frame pointer of the trap.

    This is extracted from one of the crash dumps associated with the em driver

    
    Fatal trap 9: general protection fault while in kernel mode
    cpuid = 1; apic id = 22
    instruction pointer	= 0x20:0xffffffff80b2ee60
    stack pointer	        = 0x28:0xfffffe009f7dd680
    frame pointer	        = 0x28:0xfffffe009f7dd6a0
    code segment		= base 0x0, limit 0xfffff, type 0x1b
    			= DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags	= interrupt enabled, resume, IOPL = 0
    current process		= 0 (em0 que)
    version.txt06000027512663341510  7620 ustarrootwheelFreeBSD 10.1-RELEASE-p25 #0 c39b63e(releng/10.1)-dirty: Mon Dec 21 15:20:13 CST 2015
        root@pfs22-amd64-builder:/usr/obj.RELENG_2_2.amd64/usr/pfSensesrc/src.RELENG_2_2/sys/pfSense_SMP.10
    
    ...
    
    db:0:kdb.enter.default>  bt
    Tracing pid 0 tid 100064 td 0xfffff800038b5920
    m_freem() at m_freem+0x20/frame 0xfffffe009f7dd6a0
    carp_input_c() at carp_input_c+0x24b/frame 0xfffffe009f7dd7a0
    ip_input() at ip_input+0x118/frame 0xfffffe009f7dd7f0
    netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dd860
    ether_demux() at ether_demux+0x149/frame 0xfffffe009f7dd890
    ether_nh_input() at ether_nh_input+0x347/frame 0xfffffe009f7dd8f0
    netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dd960
    ether_demux() at ether_demux+0xa5/frame 0xfffffe009f7dd990
    ether_nh_input() at ether_nh_input+0x347/frame 0xfffffe009f7dd9f0
    netisr_dispatch_src() at netisr_dispatch_src+0x62/frame 0xfffffe009f7dda60
    em_rxeof() at em_rxeof+0x40a/frame 0xfffffe009f7ddaf0
    em_handle_que() at em_handle_que+0x41/frame 0xfffffe009f7ddb30
    taskqueue_run_locked() at taskqueue_run_locked+0xe5/frame 0xfffffe009f7ddb80
    taskqueue_thread_loop() at taskqueue_thread_loop+0xa8/frame 0xfffffe009f7ddbb0
    fork_exit() at fork_exit+0x9a/frame 0xfffffe009f7ddbf0
    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009f7ddbf0
    --- trap 0, rip = 0, rsp = 0xfffffe009f7ddcb0, rbp = 0 ---
    
    

    This is extracted from one of the crash dumps associated with the bce driver:

    
    Fatal trap 9: general protection fault while in kernel mode
    cpuid = 0; apic id = 20
    instruction pointer     = 0x20:0xffffffff80b30a53
    stack pointer           = 0x28:0xfffffe009f7d0a60
    frame pointer           = 0x28:0xfffffe009f7d0a90
    code segment            = base 0x0, limit 0xfffff, type 0x1b
                            = DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags        = interrupt enabled, resume, IOPL = 0
    current process         = 12 (irq259: bce3)
    
    ...
    
    db:0:kdb.enter.default>  bt
    Tracing pid 12 tid 100063 td 0xfffff800037a2000
    m_cat() at m_cat+0x13/frame 0xfffffe009f7d0a90
    bce_intr() at bce_intr+0x4f9/frame 0xfffffe009f7d0b20
    intr_event_execute_handlers() at intr_event_execute_handlers+0xab/frame 0xfffffe009f7d0b60
    ithread_loop() at ithread_loop+0x96/frame 0xfffffe009f7d0bb0
    fork_exit() at fork_exit+0x9a/frame 0xfffffe009f7d0bf0
    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009f7d0bf0
    --- trap 0, rip = 0, rsp = 0xfffffe009f7d0cb0, rbp = 0 ---
    db:0:kdb.enter.default>  ps
      pid  ppid  pgrp   uid   state   wmesg         wchan        cmd
    ...
       12     0     0     0  RL      (threaded)                  [intr]
    
    

    The iDRAC console on the crashing server shows that all of its health checks are good.

    I am puzzled as to what could have changed to cause this sort of failure only on the idle server.

    Any ideas?



  • Using limiters? Can't combine pfsync and limiters in 2.2.x and newer.



  • Thanks very much cmb. I do have a limiter configured. I'll disable it and report back.