Intermittent crashes on 2.2 (bare metal)
-
Hi everyone
I admin a pair of pfSense 2.2 servers, fw1 and fw2, running on bare metal Dell R210 IIs, with dual onboard Broadcom bce interfaces and a quad Intel igb card each. We're using CARP between the two firewalls, on top of VLANs and LAGGs. fw1 is the default CARP master node with fw2 as backup. We're also using the load-balancer and OpenVPN and have 3 packages installed: Snort, mtr-nox11 and OpenVPN Client Export Utility.
They were originally installed with 2.1 and upgraded through 2.1.4 and 2.1.5 to 2.2. The 2.2 upgrade brought a month of major headaches so we've been reluctant to upgrade further unless strictly necessary and I have seen a number of people on these forums reporting crashy 2.2.4 systems.
The servers came with fakeraid, so we configured software gmirrors post-install as install time RAID creation wasn't supported in the 2.1 installer. Essentially we created a gmirror for the whole disk as described at:
https://doc.pfsense.org/index.php/Create_a_Software_RAID1_(gmirror)#Older_manual_instructions
The upgrade to 2.2 meant that such arrays were no longer supported so we had to add:
kern.geom.part.check_integrity=0
in /boot/loader.conf.local to be able to boot.
About 6 weeks ago, we were notified that fw1 had crashed, rebooted and wouldn't come up. Investigation showed that the boot process couldn't find the root device /dev/mirror/gm0s1a:
Trying to mount root from ufs:/dev/mirror/gm0s1a [rw]... mountroot: waiting for device /dev/mirror/gm0s1a ... Mounting from ufs:/dev/mirror/gm0s1a failed with error 19.
I was able to bring the machine up using the first hard disk as the root device, ada0s1a. I later rebooted to a pfSense install USB disk, rebuilt the gmirror manually and the system booted from the gmirror normally.
This crash and failure to boot from the gmirror has since occurred another 4 or 5 times on fw1. As I wasn't seeing any errors in /var/log/system.log before the crash and there weren't any crash dumps in /var/crash/ I initally suspected a hardware fault or flaky power provision, so we had the hosting provider check the security of the power cables. I ran memtest for 36 hours, checked the BIOS hardware event log and ran the onboard Dell hardware diagnostics long test and all came up clean.
Last week the same problem occurred on fw2 for the first time and as both are now crashy and unable to boot when it occurs, it's getting pretty worrying. It seems notable that the firewalls have only crashed while they are the CARP master node and handling all of the traffic passing through them. Most of the crashes seem to occur around 07:00-08:00 when traffic is fairly moderate, there certainly aren't any spikes at those times.
The other thing I have noticed is that in most cases, the crashes seem to occur when inactive memory is high (75-90% of 4GB) and free memory is very low (<5%). The RRD graphs show fairly clear trajectories in opposite directions for the two values in the lead up to a crash. The only reason I can think for this is Snort, so I have just temporarily disabled it and have observed the beginning of an increase in free memory and decrease in inactive memory. It may be a false lead.
Can anybody help or suggest how to diagnose the cause?
One thing that occurs to me is that as we're using a whole disk gmirror, when the system comes up on one disk, no swap space is enabled as it's on gm0s1b in fstab and that is perhaps why we're not seeing any crash dumps. Is it possible to specify the swap space as ada0s1b while using a whole disk gmirror or will that cause the sky to fall in? Is there a better way to handle this?
Additionally, if I were to perform a fresh install, can I safely export the config from a 2.2 pfSense and import it into a fresh 2.2.4 and expect it to work reliably?
Any assistance appreciated.
-
Why you've set up RAID on two pfSenses running CARP? That's a double redundancy.
Maybe an upgrade to 2.2.4 will help.
I'd also to struggle with rare crashes of a CARP setup on two DELL R210 II in 2.1.x, but only at backup box. After some BIOS updates and clean pfSense installations it didn't crash anymore. -
How long did the machines run fine on 2.2 before this issue started happening?
Some Googling says that this might be due to GPT instead of MBR being used on FreeBSD 9.X and later.
If you run gpart show on the command line, which partition table type is it using? -
The machines were running fine for around 18 months before this problem first occurred, perhaps 6 months running 2.2. They may have crashed before but nobody noticed (we don't look after them day to day, just provide support when requested).
Output of gpart show:
[2.2-RELEASE][admin@fw1]/root: gpart show => 63 7831489 da0 MBR (3.7G) 63 1985 - free - (993K) 2048 7829504 1 !12 [active] (3.7G) => 63 7831489 diskid/DISK-659D583A6DFBBC91 MBR (3.7G) 63 1985 - free - (993K) 2048 7829504 1 !12 [active] (3.7G) => 0 15634432 da1 BSD (7.5G) 0 638656 1 freebsd-ufs (312M) 638656 14995776 - free - (7.2G) => 0 15634432 diskid/DISK-AAZ5DUAAJGNYRQOA BSD (7.5G) 0 638656 1 freebsd-ufs (312M) 638656 14995776 - free - (7.2G) => 63 976773104 mirror/gm0 MBR (466G) [CORRUPT] 63 976773105 1 freebsd [active] (466G) => 0 976773105 mirror/gm0s1 BSD (466G) 0 16 - free - (8.0K) 16 959995873 1 freebsd-ufs (458G) 959995889 16777216 2 freebsd-swap (8.0G)
So that's MBR, right?
I notice it reports mirror/gm0 is corrupt, which is news to me. It may be because we are using a FreeBSD 8.3/pfSense 2.1 style gmirror which doesn't support 10.1/2.2 integrity checking (which is why we had to set kern.geom.part.check_integrity=0, or it's actually corrupted (it rebuilt and booted fine).
fw2 is currently booted from a single disk while we wait for remote hands to insert a pfSense USB to boot from and rebuild the RAID pair, but running the gpart show on fw1 reports clean MBR filesystems on ada0 and ada1. I'll run the same check on fw2 once the RAID is rebuilt.
Thanks for your suggestion.
-
FWIW, the firewalls have been stable for 2 weeks since disabling Snort.
Memory usage graph attached. This node was in persistent maintenance mode at the beginning of the graph but became CARP master where free memory starts to decrease as the other node also crashed. This node crashed where the gap is, 3 days after becoming master. A few days after being brought back up, Snort was disabled (the slight increase in free memory) and persistent maintenance mode was disabled so it became CARP master again. Free memory has remained high since.
As you will see the complete inversion of free and inactive memory in the time between becoming CARP master and then crashing is telling. This has not occurred since Snort was disabled.
I've upgraded the Snort packages and will wait to see if they remain stable before re-enabling Snort.
-
Are you getting an out of memory error for your crash? That is a fatal situation, when your kernel can't find any free memory. It just means you need to reduce your memory usage or add more.
-
No, nothing logged prior to the crash.
I suspect this is either a Snort memory leak or badly configured Snort. I don't know Snort that well so I could have done something stupid.