Kernel crash - nmbufs?



  • We're periodically (reasonably regularly) seeing kernel panics, on a pfSense 2.2.3 setup as a transparant bridge, with bge Broadcom drivers (Dell PowerEdge server).

    Can post the full crash message, but it ends with

    ….
    <118>Bootup complete
    [zone: mbuf] kern.ipc.nmbufs limit reached
    [zone: mbuf] kern.ipc.nmbufs limit reached
    [zone: mbuf] kern.ipc.nmbufs limit reached
    [zone: mbuf] kern.ipc.nmbufs limit reached

    Fatal trap 12: page fault while in kernel mode
    cpuid = 4; apic id = 04
    fault virtual address              = 0x1d
    fault code                              = supervisor read data, page not present
    instruction pointer = 0x20:0xffffffff80b90647
    stack pointer                  = 0x28:0xfffffe001e1b56f0
    frame pointer                = 0x28:0xfffffe001e1b5770
    code segment                        = base 0x0, limit 0xfffff, type 0x1b
                                                    = DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags  = interrupt enabled, resume, IOPL = 0
    current process                    = 12 (irq16: bge0 bge2+)
    version.txt06000024712572265060  7623 ustarrootwheelFreeBSD 10.1-RELEASE-p13 #0 c77d1b2(releng/10.1)-dirty: Tue Jun 23 17:00:47 CDT 2015
        root@pfs22-amd64-builder:/usr/obj.amd64/usr/pfSensesrc/src/sys/pfSense_SMP.10

    The NICs stop passing traffic while it recovers  (which it almost always does).

    We've made the config changes as per https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Broadcom_bge.284.29_Cards
    but still seems to be occuring.

    From what I read there, it looks like its bge0 and bge2 that are failing (bge2 isn't even wired up / config'd! bge0 isn't a member of the bridge - so handles very little traffic (Respectively).

    Any further thoughts over the ones from the Tuning article? Even if it's a 'replace the nic with Intel model XYZ' we're open to suggestions.



  • To what you have changed the mbuf sizes to?



  • It's set to 1,000,000 now, and we're still expericing the issue.  (the unit has 16GB RAM in it, so should be able to handle that)

    The dashboard panel, and the RRD graphs for MBUF usage show it sitting idle at 1% usage - so unless it's an instantaneous spike - it doesn't look like we're actually reaching that cap and it's a red herring to some degree.

    Can anybody clarify what the bge2+  section means?  We're not actually using interface bge2  -  instead bge0, bge4, and bge5… so seeing 2+ seems odd?



  • Crash log attached…

    [pfsense crash.txt](/public/imported_attachments/1/pfsense crash.txt)



  • on a pfSense 2.2.3 setup as a transparant bridge,

    Can you short explain what is in front of the pfSense and behind of the pfSense?
    As an example:
    Internet –- ISP --- modem --- Cisco Router --- pfSense --- LAN Switch --- LAN



  • Internet – ISP link (colo'd kit) -- pfSense as bridge -- LAN switch -- LAN

    There's 2 interfaces making the bridge, and an extra interface on a management network.



  • pfSense as bridge

    Is bridging the ports together a so called "must be" for you or would also try out routing that
    you come closer to the point that the problem is not based on the bridge here in this game?



  • Can you replace the hardware or the physical NICs?

    If the kernel is panicking, something really bad is happening. My quick guess is hardware failing and would recommend testing on new or replacement hardware.



  • Bridge setup is a definite requirement. We've got very similar hardware doing NAT / routing as well, and thats toddling along quite happily by itself.

    Can replace the NICs without a prob - any users have strong recommendations? This is production grade, requiring 1GB RJ45 connectivity…
    Looking through the tuning stuff, seems like a lot of Broadcom and Intel cards may have similar probs with nmbufs.

    Looks like it might be bge0 or bge2+ which is failing (though I still don't get the 2+ bit). There's a PCI card in there as well as the onboard (ie. daughter card), so trying to ID which one is causing the issue could be fun!



  • Looking through the tuning stuff,

    It is not a must be, then more a can be done stuff. And with each CPU core one queue would be opened
    per LAN port! So a 8 Core CPU is opening 8 queues for only one LAN Port, and this can be really tricky
    if then not enough space is there, so highhing up the mbufs size will be a real gain for many of us.

    seems like a lot of Broadcom

    This is all driver pending and related stuff. The better the driver support the better you
    pfSense will work with the LAN ports for sure. At the moment you will be really running
    well with Intel cards! Intel Dual or Quad Port server adapter, i210, i350 or i354 would be
    the best from the older and newer ones.

    and Intel cards may have similar probs with nmbufs.

    Once more again this is a problem with the FreeBSD kernel space size and historical grown up
    until today and for freeing up much space from this kernel space we all get now the chance to
    hug up the mbuf size and this can be done easily by adding some RAM inside of the pfSense
    box as well as other tuning things named on the side under your link above.



  • What is kern.ipc.nmbufs set to on your system? Run:

    sysctl kern.ipc.nmbufs
    

    to see.



  • kern.ipc.nmbufs: 1,019,445
    (for a little while, pre-reboot, it was set to >1mill in the tunables.)

    We haven't actually had it panic in > 30 hrs now, which is the longest it's gone without any interruption in about 2 weeks…



  • @jasperdillon:

    kern.ipc.nmbufs: 1,019,445
    (for a little while, pre-reboot, it was set to >1mill in the tunables.)

    We haven't actually had it panic in > 30 hrs now, which is the longest it's gone without any interruption in about 2 weeks…

    Perhaps you should tell us some hardware tech. specs. over the pfSense box it self, likes CPU,
    Cores and SSD/HDD. To bring perhaps more stability to the entire pfSense box.



  • @jasperdillon:

    kern.ipc.nmbufs: 1,019,445
    (for a little while, pre-reboot, it was set to >1mill in the tunables.)

    Ok that's fine, maybe those logs were from before that change was applied. Just wanted to make sure since nmbclusters is usually what gets set, that it didn't somehow get set differently.



  • Just to put some closure on this - looks like the problem has just 'gone away'.
    Changing it to 1mill (but not over) certainly helped, but didn't resolve it completely.

    Nothing has changed since in the pfSense config, but it's just not occuring anymore…



  • Probably well worthwhile to update to 2.2.5.

    In your case there may be a small "risk" in that you don't really know what "fixed" your issue, but the stability of 2.2.5 over older releases is worth it in my mind.


Log in to reply