PfSense Crash, cannot find root cause. Help!!
-
Hey everyone, I am hoping I can get a resolution as to the root cause as to why our pfsense crashed. I have the crash dumps, and I have googled just about everything in there and none of it seems to fit. I have looked into mbuf, however with 32gb of ram, I find it extremely hard to believe. All hardware checks are coming back as good and I have tested all the NICs and their ports and all of them are good. I have hit a wall and hoping someone can find the cause of the crash from the dump. It doesn't look to be a kernel panic as there is no trace or anything.
Full textdump.tar.0 is attached: textdump.tar.0
info.0 file:
Dump header from device: /dev/da1p2 Architecture: amd64 Architecture Version: 1 Dump Length: 156160 Blocksize: 512 Dumptime: Mon Apr 29 16:08:00 2019 Hostname: pfSense.mydomain.net Magic: FreeBSD Text Dump Version String: FreeBSD 11.2-RELEASE-p3 #17 e6b497fa0a3(RELENG_2_4_4): Thu Sep 20 09:04:45 EDT 2018 root@buildbot3:/crossbuild/ce-244/obj/amd64/WvDslnYb/crossbuild/ce-244/pfSense/tmp/FreeBSD-src/sys/pfS Panic String: Dump Parity: 2586087224 Bounds: 0 Dump Status: good
Thank you!
-
Kernel panics are usually caused by misbehaving hardware. I'm not a FreeBSD tech but nobody else has replied yet. I may be totally off-base here.
When your crash happens, it seems to be servicing the NIC:
curthread = 0xfffff8000b9ec620: pid 12 "irq296: igb4:que 0" current process = 12 (irq296: igb4:que 0)
Also:
<7>sonewconn: pcb 0xfffff804073d21d0: Listen queue overflow: 193 already in queue awaiting acceptance (27 occurrences)
which might be fixed by adding kern.ipc.somaxconn=4096 in System - Advanced - System Tunables.
Read this and pay attention to the section on igb(4) cards. Try what is recommended re: setting kern.ipc.nmbclusters.
https://docs.netgate.com/pfsense/en/latest/hardware/tuning-and-troubleshooting-network-cards.html
-
Thank you, I read that but the system has been running smoothly for over a year that I thought it couldnt it so I stopped reading before getting to the cards. All my WANs are located on "bce" card (4 port, 4 WANs) and my LAN is on the "igb" card (4 port, 1 used for LAN)
So basically it looks like there was a mbufs overflow on the NIC(s) (from what you can tell, I mean obviously there was something happening as this is repeated 50 times in the dump
sonewconn: pcb 0xfffff804073d21d0: Listen queue overflow: 193 already in queue awaiting acceptance (27 occurrences)
So basically I just need to increase the memory allocation size for my NICs? The reason I find it hard to believe is looking at the backup pfsense currently running, right now is about the peak traffic so it is under the most load right now and looking at MBUF Usage: 3% (29136/1000000)
And it never really moves from that 3% (I have yet to see it above 3%) -
The crash happened while the system was talking to the igb NIC driver. What it was doing I can't tell you. Those sonewconn errors might have nothing to do with it, or everything. I don't know that either. I'm just trying to give you suggestions and options. What you do is up to you.
I also noticed snort in your process list. While debugging this, you might want to temporarily disable any heavy packages like snort, suricata, or pfblocker just to rule them out. For example, there was an issue several months ago where a pfB list exceeded some threshold which started causing problems for people until they bumped a system tunable.
-
@KOM said in PfSense Crash, cannot find root cause. Help!!:
I don't know that either. I'm just trying to give you suggestions and options. What you do is up to you.
I understand completely, just trying to understand
For example, there was an issue several months ago where a pfB list exceeded some threshold which started causing problems for people into they bumped a system tunable.
Do you happen to know what this is? (the tuneable).
I was running Snort, pfBlockerNG, SquidProxy and SquidGuard at the time of the crash. Since the crash all services have been disabled. The only thing I can think of that would cause this is the OpenVAS Vulnerability Scan going running on our networks, but we have been hit with them from the outside and this isn't the first time I have ran the scan - the scan is ran about once every 3 months or so. So this pfsense has gone through at least 4 internal scans, and I know our servers have been hit with the same scanners as I see them on snort.
-
@scottys said in PfSense Crash, cannot find root cause. Help!!:
Do you happen to know what this is? (the tuneable).
It was actually the firewall state table size, which is controlled via System - Advanced - Firewall & NAT - Firewall Maximum States. Default is 200000 and they recommend bumping it to 400000.
-
@KOM Looking at the description, I think this could be the culprit
"Maximum number of table entries for systems such as aliases, sshguard, snort, etc, combined"Since I did see some stuff with sshguard (OpenVAS scanning) and tens of thousands of sorts alerts, add pfBlockerNG country blocking and SquidGuard's list blocking, i think it could easily hit 400,000 entries.
Besides bumping it up, do you know of some kind of maintenance I can do to ensure that table stays under 400k? (if that was the culprit of the crash)
-
No, not really. There are several Zabbix packages, but I don't know if that metric is being tracked or not with the FreeBSD OS template.
-
bump just in case that isn't the issue and it is something else
@KOM Thank you for your help. I am in no way disreguarding what you have told me. Currently in testing with our backup to ensure stability with the new tunables. You did say
I'm not a FreeBSD tech but nobody else has replied yet. I may be totally off-base here
I just need to ensure that you are right on target
Thank you for all your help
-
@KOM said in PfSense Crash, cannot find root cause. Help!!:
Kernel panics are usually caused by misbehaving hardware. I'm not a FreeBSD tech but nobody else has replied yet. I may be totally off-base here.
When your crash happens, it seems to be servicing the NIC:
curthread = 0xfffff8000b9ec620: pid 12 "irq296: igb4:que 0" current process = 12 (irq296: igb4:que 0)
Also:
<7>sonewconn: pcb 0xfffff804073d21d0: Listen queue overflow: 193 already in queue awaiting acceptance (27 occurrences)
which might be fixed by adding kern.ipc.somaxconn=4096 in System - Advanced - System Tunables.
Read this and pay attention to the section on igb(4) cards. Try what is recommended re: setting kern.ipc.nmbclusters.
https://docs.netgate.com/pfsense/en/latest/hardware/tuning-and-troubleshooting-network-cards.html
Nothing really to add but I find it ironic that you say "I'm not a FreeBSD tech..." and then go on to troubleshoot the crash dump, suggest what appears to be a kernel change in the System Tunables, and give references. Then start talking about adjusting the Firewall State sizes. I kinda think that makes you "...a FreeBSD tech...", at least more than you think you are. :)
-
I try to help out where I can. Even though I've been here five years or so, I still remember the feeling of being new and posing a question into the void and getting no response. If I think I can even point them in the right direction, I'll reply. You might notice that this forum has very few unanswered posts. Not all issues can be resolved via the community forums, but I think we have a pretty high success rate and that helps the project's reputation & success.