PfSense becomes unresponsive occasionally (Alix 2d13, pfSense 2.2.2)
I’m having problems with occasional crashes of my pfSense box (nano install on an Alix 2d13, pfSense 2.2.2). This has happened with previous versions of pfSense although not as frequently as in the last months. Currently I have such events once or twice a week.
If this happens the pfSense box gets completely unresponsive: I cannot access the web GUI, via ssh or even ping the box. It does not restart on its own and I have to power cycle to force a restart. I do not find any hint in the system log after the restart as the log starts with the boot process.
I have never been able to trace these events to user behavior (specific web pages being browsed to, usage of specific other internet services, times of day, etc.).
I have only on package installed („FTP Client Proxy“). I use several VLANs on LAN and WAN side, the IGMP-Proxy, no VPN, a simple traffic shaper setup, some NAT rules, captive portal, some firewall rules of course. I have not been able to trace the events down to specific configuration elements or configuration changes.
I cannot trace the problem down to a pfSense version. I did a clean install when switching to Version 2.2.2 and I had the problem before, so this should not be caused by some leftover configuration junk.
The system shows a memory usage of typically around or below 50%. I cannot find any suspicious system log entries.
I don’t even know where I should begin to search for the cause of the problem.
Does anyone have a hint on how to tackle this?
Did anybody out there have similar problems?
Can this be due to hardware problems and how can I possibly check this?
How can I get diagnostic data for such events?
Are any configuration options known to have caused such behavior?
For hardware, first prepare a new CF-card and fresh basic 2.2.2 install.
For present running box, take the problem solving exclusion route, start with excl. Traffic Shaping.
Are you able to connect on the serial console?
What sort of traffic shaping are you using?
It happened again. This time I have some more information. First I extracted some log messages from a syslog server:
29.06.15 16:00:47 172.27.2.1 Unknown Critical [zone: mbuf] kern.ipc.nmbufs limit reached
29.06.15 16:01:58 172.27.2.1 Unknown Critical vr1_vlan8: unable to prepend VLAN header
This is probably incomplete (my syslog server is somehow broken …)
I checked the RRD graphs and found that the mbufs are probably not the problem. I guess something else consumed the memory. I have some graphs attached. I increased the maximum value of mbufs, this is visible in the graph. The mbuf usage until midnight before the crash looks normal.
Apparently my system breaks every 8 or 9 days. I wasn't aware how regular these breakdown actually occured. The most iteresting graph seemed the memory graph, see below. I had the RRD backup set to daily. There is a gap in the graph because of that periodically. The "wired" memory has been increasing continuously before the crash.
Has this been observed by anyone else before? What can be the cause for this / how can I analyze this?
As the the question regarding the kind of traffic shaping: This is a simple setup based on PRIQ for prioritization of VoIP traffic (3 queues only). I have NOT YET started the "problem solving exclusion route" actually due to lack of time. Also I experienced this problem before I added traffic shaping. So I don't expect this to help much.
Running out of mbufs will definitely cause you to not be able to access the box. Increase those before you try anything else:
Do you have a WLAN card by chance? ath? And the shaper active on it??
I had the very same problem a couple of years ago, very hard to debug, but the problem was somehow related to the traffic shaper being active on the ath0 interface. You can check out that post here
I never found a proper solution, but at least identified the cause
I have a traffic shaping in place but not on a WLAN. My Alix has only the built in LAN ports.
I increased the maximum limit of mbufs now. I have observed an absolute stable and very low amount of mbufs allocated at all times (in the RRD graphs). I'm not expert enough to understand which facts have an influence on used mbufs so it's difficult for me to trace an increase down to specific behavior of hosts in the network.
Because the RRD graphs stop to display data on midnight before a crash of my pfSense I did not observe the amount of used mbufs shortly before a crash yet. I changed the RRD backup cycle to one hour now. Maybe the next time I can actually see an increase of used mbufs in the RRD graphs (if this does not occur only within minutes before a crash).
Are there know typical scenarios which cause used mbufs to increase dramatically?
I rather suspect that something else is eating up memory which has been reserved for mbufs. In other words something else / another process has higher priority when requesting memory than the network processes / the mbufs reservation. Is this possible at all in FreeBSD?