Fatal error every other day

  • First of all, thank you for all your effort; that a corporation like mine can rely on your products for two separate solutions is pretty awesome!
    (Sorry if it's incorrect to use the exclamation mark in as the message Icon. I just found it appropriate for a crash  :) )

    Now, to

    My setup:

    I have a couple of PfSense boxes located on two Dell blades (iDracs), PowerEdge R210 II. Each have a virtual bridged interface between WAN and LAN and function as a bridged firewall. They are redundantly configured via STP, so that connection is cut to the secondary firewall when ever the primary firewall is responding with BPDU-packets.

    _igb0-3 (the bridged interfaces):
    Intel(R) PRO/1000 Network Connection version - 2.4.0
    Using MSIX interrupts with 5 vectors

    Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz
    Current: 3100 MHz, Max: 3101 MHz
    4 CPUs: 1 package(s) x 4 core(s)_

    And my build:
    2.2.6-RELEASE (amd64)
    built on Mon Dec 21 14:50:08 CST 2015
    FreeBSD 10.1-RELEASE-p25

    The problem:

    Every other or third day, the primary firewall crashes, failing over to the secondary. I have attached a text-file with a dump.
    I take note of the following message, even though I am not 100% sure of how I should interpret it:

    Fatal trap 12: page fault while in kernel mode
    cpuid = 2; apic id = 04
    fault virtual address    = 0x1d
    fault code        = supervisor read data, page not present
    instruction pointer    = 0x20:0xffffffff80b904b7
    stack pointer            = 0x28:0xfffffe001a3d06c0
    frame pointer            = 0x28:0xfffffe001a3d0740
    code segment        = base 0x0, limit 0xfffff, type 0x1b
                = DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags    = interrupt enabled, resume, IOPL = 0
    current process        = 12 (irq276: igb2:que 2)
    version.txt06000027512746101761  7624 ustarrootwheelFreeBSD 10.1-RELEASE-p25 #0 c39b63e(releng/10.1)-dirty: Mon Dec 21 15:20:13 CST 2015


    I have monitored traffic on the inside (LAN) interfaces of the firewalls, and you can see two attached images of our primary and secondary firewalls.
    On the graphs, "outbound" means outbound from the firewall via the LAN-interface, i.e. from WAN to LAN.

    Firstly, I have attached an image of what I believe to be a precursor;

    Normally, I expect equal amounts of traffic on both firewalls, as they function as bridges and simply pass on all packets (firewalled, of course). Packets are blocked by STP on a later switch on the WAN-side. On the "precursor-graphs", we see a sudden spike in traffic on only the primary firewall, after which traffic flows unevenly. The spike is around 200 Mbit, which is also observed in other "precursors".

    Next, I have attached an image of the actual crash;

    About an hour or two later, everything looks fine, except that the primary firewall just "disappears" on the graphs all of a sudden. This is because of the kernel crash.

    Now I do not know if the spikes and the crashes are even related - they may not be. I just found it odd. Especially since this abnormality has been observed more than once. See the file "another-crash".


    Since the crash report says "current process        = 12 (irq276: igb2:que 2)", I have given it some thought that it may be because our TCP queue length is insufficient on the WAN-interface (igb2), and that a queue too large triggers a crash. The queue is set to a default of 1000, which can be turned up in case of heavy load. This guy (https://forum.pfsense.org/index.php?topic=68919.0) has done something similar, although he doesn't experience crashes as we do.

    I would love any feedback on this, as it is hard for me to troubleshoot this.
    Remember, I am not sure my "precursor"-observations are even relevant. It just seems odd.

    Cheers! :)


Log in to reply