Kernel crash - nmbufs?

jasperdillon

We're periodically (reasonably regularly) seeing kernel panics, on a pfSense 2.2.3 setup as a transparant bridge, with bge Broadcom drivers (Dell PowerEdge server).

Can post the full crash message, but it ends with

….
<118>Bootup complete
[zone: mbuf] kern.ipc.nmbufs limit reached
[zone: mbuf] kern.ipc.nmbufs limit reached
[zone: mbuf] kern.ipc.nmbufs limit reached
[zone: mbuf] kern.ipc.nmbufs limit reached

Fatal trap 12: page fault while in kernel mode
cpuid = 4; apic id = 04
fault virtual address = 0x1d
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80b90647
stack pointer = 0x28:0xfffffe001e1b56f0
frame pointer = 0x28:0xfffffe001e1b5770
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (irq16: bge0 bge2+)
version.txt06000024712572265060 7623 ustarrootwheelFreeBSD 10.1-RELEASE-p13 #0 c77d1b2(releng/10.1)-dirty: Tue Jun 23 17:00:47 CDT 2015
root@pfs22-amd64-builder:/usr/obj.amd64/usr/pfSensesrc/src/sys/pfSense_SMP.10

The NICs stop passing traffic while it recovers (which it almost always does).

We've made the config changes as per https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Broadcom_bge.284.29_Cards
but still seems to be occuring.

From what I read there, it looks like its bge0 and bge2 that are failing (bge2 isn't even wired up / config'd! bge0 isn't a member of the bridge - so handles very little traffic (Respectively).

Any further thoughts over the ones from the Tuning article? Even if it's a 'replace the nic with Intel model XYZ' we're open to suggestions.

Guest

To what you have changed the mbuf sizes to?

jasperdillon

It's set to 1,000,000 now, and we're still expericing the issue. (the unit has 16GB RAM in it, so should be able to handle that)

The dashboard panel, and the RRD graphs for MBUF usage show it sitting idle at 1% usage - so unless it's an instantaneous spike - it doesn't look like we're actually reaching that cap and it's a red herring to some degree.

Can anybody clarify what the bge2+ section means? We're not actually using interface bge2 - instead bge0, bge4, and bge5… so seeing 2+ seems odd?

jasperdillon

Crash log attached…

[pfsense crash.txt](/public/imported_attachments/1/pfsense crash.txt)

Guest

on a pfSense 2.2.3 setup as a transparant bridge,

Can you short explain what is in front of the pfSense and behind of the pfSense?
As an example:
Internet –- ISP --- modem --- Cisco Router --- pfSense --- LAN Switch --- LAN

jasperdillon

Internet – ISP link (colo'd kit) -- pfSense as bridge -- LAN switch -- LAN

There's 2 interfaces making the bridge, and an extra interface on a management network.

Guest

pfSense as bridge

Is bridging the ports together a so called "must be" for you or would also try out routing that
you come closer to the point that the problem is not based on the bridge here in this game?

tim.mcmanus

Can you replace the hardware or the physical NICs?

If the kernel is panicking, something really bad is happening. My quick guess is hardware failing and would recommend testing on new or replacement hardware.

jasperdillon

Bridge setup is a definite requirement. We've got very similar hardware doing NAT / routing as well, and thats toddling along quite happily by itself.

Can replace the NICs without a prob - any users have strong recommendations? This is production grade, requiring 1GB RJ45 connectivity…
Looking through the tuning stuff, seems like a lot of Broadcom and Intel cards may have similar probs with nmbufs.

Looks like it might be bge0 or bge2+ which is failing (though I still don't get the 2+ bit). There's a PCI card in there as well as the onboard (ie. daughter card), so trying to ID which one is causing the issue could be fun!

Guest

Looking through the tuning stuff,

It is not a must be, then more a can be done stuff. And with each CPU core one queue would be opened
per LAN port! So a 8 Core CPU is opening 8 queues for only one LAN Port, and this can be really tricky
if then not enough space is there, so highhing up the mbufs size will be a real gain for many of us.

seems like a lot of Broadcom

This is all driver pending and related stuff. The better the driver support the better you
pfSense will work with the LAN ports for sure. At the moment you will be really running
well with Intel cards! Intel Dual or Quad Port server adapter, i210, i350 or i354 would be
the best from the older and newer ones.

and Intel cards may have similar probs with nmbufs.

Once more again this is a problem with the FreeBSD kernel space size and historical grown up
until today and for freeing up much space from this kernel space we all get now the chance to
hug up the mbuf size and this can be done easily by adding some RAM inside of the pfSense
box as well as other tuning things named on the side under your link above.

cmb

What is kern.ipc.nmbufs set to on your system? Run:

sysctl kern.ipc.nmbufs

to see.

jasperdillon

kern.ipc.nmbufs: 1,019,445
(for a little while, pre-reboot, it was set to >1mill in the tunables.)

We haven't actually had it panic in > 30 hrs now, which is the longest it's gone without any interruption in about 2 weeks…

Guest

@jasperdillon:

kern.ipc.nmbufs: 1,019,445
(for a little while, pre-reboot, it was set to >1mill in the tunables.)

We haven't actually had it panic in > 30 hrs now, which is the longest it's gone without any interruption in about 2 weeks…

Perhaps you should tell us some hardware tech. specs. over the pfSense box it self, likes CPU,
Cores and SSD/HDD. To bring perhaps more stability to the entire pfSense box.

cmb

@jasperdillon:

kern.ipc.nmbufs: 1,019,445
(for a little while, pre-reboot, it was set to >1mill in the tunables.)

Ok that's fine, maybe those logs were from before that change was applied. Just wanted to make sure since nmbclusters is usually what gets set, that it didn't somehow get set differently.

jasperdillon

Just to put some closure on this - looks like the problem has just 'gone away'.
Changing it to 1mill (but not over) certainly helped, but didn't resolve it completely.

Nothing has changed since in the pfSense config, but it's just not occuring anymore…

divsys

Probably well worthwhile to update to 2.2.5.

In your case there may be a small "risk" in that you don't really know what "fixed" your issue, but the stability of 2.2.5 over older releases is worth it in my mind.