Fatal error - regularly

JDK

This is a duplicate of post 115893 to change the posts ownership

First of all, thank you for all your effort; that a corporation like mine can rely on your products for two separate solutions is pretty awesome!

Now, to

My setup:

I have a couple of PfSense boxes located on two Dell blades (iDracs), PowerEdge R210 II. Each have a virtual bridged interface between WAN and LAN and function as a bridged firewall. They are redundantly configured via STP, so that connection is cut to the secondary firewall when ever the primary firewall is responding with BPDU-packets.

Hardware:

igb0-3 (the bridged interfaces):
Intel(R) PRO/1000 Network Connection version - 2.4.0
Using MSIX interrupts with 5 vectors

Intel(R) Xeon(R) CPU E3-1220 V2 @ 3.10GHz
Current: 3100 MHz, Max: 3101 MHz
4 CPUs: 1 package(s) x 4 core(s)

And my build:
2.2.6-RELEASE (amd64)
built on Mon Dec 21 14:50:08 CST 2015
FreeBSD 10.1-RELEASE-p25

The problem:

Every other or third day, the primary firewall crashes, failing over to the secondary. I have attached a text-file with a dump.
I take note of the following message, even though I am not 100% sure of how I should interpret it:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0x1d
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80b904b7
stack pointer = 0x28:0xfffffe001a3d06c0
frame pointer = 0x28:0xfffffe001a3d0740
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (irq276: igb2:que 2)
version.txt06000027512746101761 7624 ustarrootwheelFreeBSD 10.1-RELEASE-p25 #0 c39b63e(releng/10.1)-dirty: Mon Dec 21 15:20:13 CST 2015
root@pfs22-amd64-builder:/usr/obj.RELENG_2_2.amd64/usr/pfSensesrc/src.RELENG_2_2/sys/pfSense_SMP.10

Observations:

I have monitored traffic on the inside (LAN) interfaces of the firewalls, and you can see two attached images of our primary and secondary firewalls.
On the graphs, "outbound" means outbound from the firewall via the LAN-interface, i.e. from WAN to LAN.

Firstly, I have attached an image of what I believe to be a precursor;

Normally, I expect equal amounts of traffic on both firewalls, as they function as bridges and simply pass on all packets (firewalled, of course). Packets are blocked by STP on a later switch on the WAN-side. On the "precursor-graphs", we see a sudden spike in traffic on only the primary firewall, after which traffic flows unevenly. The spike is around 200 Mbit, which is also observed in other "precursors".

Next, I have attached an image of the actual crash;

About an hour or two later, everything looks fine, except that the primary firewall just "disappears" on the graphs all of a sudden. This is because of the kernel crash.

Now I do not know if the spikes and the crashes are even related - they may not be. I just found it odd. Especially since this abnormality has been observed more than once. See the file "another-crash".

Dianosis?:

Since the crash report says "current process = 12 (irq276: igb2:que 2)", I have given it some thought that it may be because our TCP queue length is insufficient on the WAN-interface (igb2), and that a queue too large triggers a crash. The queue is set to a default of 1000, which can be turned up in case of heavy load. This guy (https://forum.pfsense.org/index.php?topic=68919.0) has done something similar, although he doesn't experience crashes as we do.

I would love any feedback on this, as it is hard for me to troubleshoot this.
Remember, I am not sure my "precursor"-observations are even relevant. It just seems odd.

Cheers! :)

firewall-precursor.PNG_thumb

firewall-crash.PNG_thumb

another-crash.PNG_thumb
fw-1-panick.txt

heper

putting up the entire crash report on pastebin (or similar) might provide more clues for some of the veteran members or developers.

JDK

Please see below pastebin:

http://pastebin.com/wsiTU46i

Thank you :)

heper

Try increasing your mbufs


zone: mbuf] kern.ipc.nmbufs limit reached
[zone: mbuf] kern.ipc.nmbufs limit reached
[zone: mbuf] kern.ipc.nmbufs limit reached
[zone: mbuf] kern.ipc.nmbufs limit reached

https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#mbuf_.2F_nmbclusters

JDK

Thank you very much for your reply :)

I tried doing the following changes to system tunables:

kern.ipc.nmbclusters="131072"

Which is actually down from 1.000.000 which we had it at before, because BlueKobold from https://forum.pfsense.org/index.php?topic=107217.0
suggests that large mbufs can incur stability issues.

Furthermore, I made changes to the bootloader, because we have the firewall on Dell blades, and that Dell machinery with Broadcom bce(x) interfaces have had problems with stability as a result of mbuf-size, tso and msix (https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#mbuf_.2F_nmbclusters):

kern.ipc.nmbclusters="131072"
hw.bce.tso_enable=0
hw.pci.enable_msix=0

Will keep you posted :)

JDK

This did not help.

I still have the same report: Mbuf limit reached (even though I am nowhere near Mbuf exhaustion (we're talking 7%), and I have ~3 gigs of ram still available.

See attachments :)

![26-08-16 - mbuf.png](/public/imported_attachments/1/26-08-16 - mbuf.png)
![26-08-16 - mbuf.png_thumb](/public/imported_attachments/1/26-08-16 - mbuf.png_thumb)
![26-08-16 - memory.png](/public/imported_attachments/1/26-08-16 - memory.png)
![26-08-16 - memory.png_thumb](/public/imported_attachments/1/26-08-16 - memory.png_thumb)

divsys

Sorry I'm not good enough to properly diagnose the dump log, but I have seen references to tuneables affecting the igb interfaces you're using.

Perhaps some of the notes in:https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards would be helpful?

Harvy66

Just taking a guess, but a Seg fault about not enough memory when there is sounds like a hardware error, or possibly a driver bug. If you have ECC memory, is there a way you can check for memory errors?

jimp

Add a tunable for kern.ipc.nmbufs=1000000 and see if that helps.

Also post the output of "netstat -m" just after a reboot and then after running a day or so.