Multiple pfSense firewalls Deployed in a DC environment crash on the same day
-
We use dual pfsense firewalls at our Data Centres configured in high-availability with CARP automatic-failover. Over the past few years we have had no issues with stability running pfsense, the primary to secondary fail-over works perfectly but is rarely needed.
Roughly 60 days ago (19/12/2018, between 04:34 and 04:54 GMT) we updated to patch 2.4.4-p1, went through the usual process of disabling the HA sync prior to the upgrade then re-enabling after the upgrade was complete on both primary and secondary.
Yesterday (18/02/2019) the primary pfsense at DC1 crashed at 04:23 GMT, the secondary pfsense kicked in until the primary pfsense was back at 04:26 GMT (3 minutes later). Later that day the primary pfsense at DC2 crashed at 16:23 GMT, again the secondary pfsense kicked in until the primary pfsense was back at 16:28 GMT (5 minutes later).
There have been no recent changes and there were no scheduled tasks running at the time of either of these incidents but it initially appears that what triggered this is somehow time related.
Crash reports attached, we've replaced sensitive information such as IP addresses and device names for security reasons.
0_1550574399630_DC1 Submitted Crash Report.txt
0_1550574428903_DC2 Submitted Crash Report.txtCan anyone please help shed some light on the cause of these crashes?
-
The way snort is crashing on both with signal 11, 10, and 4, plus the different backtraces leads me to suspect a physical problem: Hardware, power, cooling, or other environmental issues.
-
The firewalls have dual power supplies on separate feeds, we can see on the iDRAC for these firewalls there have been no issues with power, voltage or temperature (cooling). We also monitor via SNMP along with an entire rack of servers in the same location and everything looks normal at both data centres.
When the first instance occurred I believed this could possibly be hardware related but when the second instance occurred at precisely 12 hours later at a different data centre with another provider miles away after a long period of stability this reduced the likelihood of that. We rarely have hardware failures, on average there may be 1 every 2 years.
We had issues with Snort occasionally stopping in the early days of using pfsense more than 3 years ago across all types of hardware, we installed the package 'Service Watchdog' as a workaround which resolved this by restarting Snort automatically if it stops but we never got to the bottom of why this happens.
The only additional packages we use with pfsense are:
- pfBlockerNG
- Service_Watchdog
- snort
Do you think this could be an issue with Snort and\or possibly a bottleneck on the hardware (NIC, Hard Disks)? The processor is more than capable and there's plenty of RAM but not sure how I/O intensive pfsense is on storage whether Enterprise SSDs would be better for the job than 15k SAS.
It's strange that this occurred precisely 12 hours later in another location, is it possible this could have been caused by carefully crafted packets that when processed by pfsense and\or snort resulted in this kind of crash?
Any other thoughts?
-
It's possible, but unlikely, to be software.
Repeated crashes with multiple signals are almost always an indicator of a hardware issue.
-
This is very odd.
The secondary firewalls are identical to the primaries in terms of hardware and have now been stable with an uptime of > 62 days.
The primary firewalls had the same uptimes (> 61 days) each crashed once like clockwork exactly 12 hours apart.
All had previously been stable prior to the 2.4.4-p1 upgrade for around 3 years.
We will continue to monitor but this kind of behaviour seems very strange, if it was a hardware problem triggered by the hardware I would expect these crashes to repeat across all firewalls not just the primaries, not just once and not at a precise 12 hour interval.
We will continue to monitor.
-
No further problems thus far, doesn't seem like a hardware issue.
The primary pfsense at both DCs have remained stable since the incident, neither of the secondary pfsense were affected but all use identical hardware.
-
Do NOT use Service Watchdog with Snort or with Suricata! Service Watchdog is not designed to work with applications like Snort and Suricata that open multiple copies of themselves (one per configured interface). Also, Service Watchdog does not understand that Snort and Suricata restart themselves after rules updates. Service Watchdog sees a process missing and immediately calls its shell script to restart it. The problem is Snort or Suricata is already in the middle of restarting itself. This leads to all kinds of issues.
I am assuming you are using Service Watchdog for Snort since you only list pfBlockerNG and Snort as installed packages.
-
We had been using Service Watchdog with Snort but have now removed it as per your recommendation.
We originally had issues with Snort crashing occasionally but it would never restart on its own and was otherwise working as expected, Service Watchdog was used as a workaround but that was more than 3 years ago. It's possible this is no longer an issue given there have been a vast number of changes\updates since then.
We will continue to monitor.
-
@shaunjstokes said in Multiple pfSense firewalls Deployed in a DC environment crash on the same day:
We had been using Service Watchdog with Snort but have now removed it as per your recommendation.
We originally had issues with Snort crashing occasionally but it would never restart on its own and was otherwise working as expected, Service Watchdog was used as a workaround but that was more than 3 years ago. It's possible this is no longer an issue given there have been a vast number of changes\updates since then.
We will continue to monitor.
Okay. I misinterpreted your most recent post to the thread to mean there might still be problems (meaning you had ruled out hardware but there could still be a software issue).
I looked into making the necessary changes to Service Watchdog so it could work with Snort and Suricata, but the required changes were quite massive and having the compatibility would likely only help a very small number of users. Based on that, I dropped the initiative.
-
There have been no hardware issues that we can find, I believe software is the most likely cause. Given what you've said about Service Watchdog it's possible that was in some way related to the crashes in this incident, if Service Watchdog tried to start Snort while Snort was updating then it's possible it would have crashed which may explain some of what we see in the dumps, as updates happen at set intervals it could also explain why the crashes were precisely 12 hours apart.
It's possible Service Watchdog is no longer needed if Snort is now stable but we can't be sure so we will just have to monitor, if Snort does still occasionally crash then what might be useful is some options with-in Service Watchdog such as a 10 or even 20 minute delay so an application has to be continuously down for 10 or 20 minutes before Service Watchdog initiates the start.
-
Just thinking out loud : same hardware ... same software ... same moment :
What about a DDOS issue ?
Master breaks down, slave takes over the load. Software hardware is the same so same result . -
@shaunjstokes said in Multiple pfSense firewalls Deployed in a DC environment crash on the same day:
There have been no hardware issues that we can find, I believe software is the most likely cause. Given what you've said about Service Watchdog it's possible that was in some way related to the crashes in this incident, if Service Watchdog tried to start Snort while Snort was updating then it's possible it would have crashed which may explain some of what we see in the dumps, as updates happen at set intervals it could also explain why the crashes were precisely 12 hours apart.
It's possible Service Watchdog is no longer needed if Snort is now stable but we can't be sure so we will just have to monitor, if Snort does still occasionally crash then what might be useful is some options with-in Service Watchdog such as a 10 or even 20 minute delay so an application has to be continuously down for 10 or 20 minutes before Service Watchdog initiates the start.
Snort should be pretty stable these days. About the only thing that might take it down (really, it would just prevent startup after a rules update) is a bad rule. This has happened a few times over the last few years. A rule syntax error gets introduced via rules update and that prevents Snort from restarting after the update.
I suspect Service Watchdog played a role in your Snort crashes. You could try greatly extending the delay in Service Watchdog, but that only fixes one problem. Another larger one pops up if you run Snort on multiple interfaces. In that case, there is a separate Snort process for each interface. That fools the check done by Service Watchdog when it checks if Snort is running. Service Watchdog simply does an equivalent of:
ps -ax | grep snort
If it gets a response, it assumes Snort is good. The problem here is Snort may have crashed on the WAN but be running on the LAN, or crashed on one VLAN but running on others. Service Watchdog doesn't understand how to look for multiple Snort instances and match them to the configured interfaces.
-
Everything is monitored, there were no abnormal fluctuations in traffic etc, unless this happened between polling periods which is possible, although technically if the software crashed because of a DDOS that would be a problem to overcome in the software.
-
This thread has given me an idea for a new "health" feature, though. I might be able to put some checks into a cron task and then let the Snort GUI itself (through that cron task) send the admin an email if a configured Snort instance crashes. I could add some configuration info onto the GLOBAL SETTINGS tab. I will consider this for a future update.
-
The health feature would be a good idea. Although it's been over a month now and Snort has been stable with-out Service Watchdog, the problems we had with Snort in the earlier versions of pfSense no longer appear to be present.
At this stage I suspect the crash may have been the result of a conflict between Snort and Service Watchdog possibly while Snort was updating.