Weird Issue: pfSense hangs/freezes silently with no log messages - not hardware?
-
Hi all,
I've got myself a new box to pfSense-ify after storm Doris destroyed the HDD in my Watchguard: A Celestix Scorpio II RSA SecurID Appliance. Features include:
-
Very nice 40x2 LCD & Jog Wheel
-
2x em(4) NICs
-
2x fxp(4) NICs
-
2x Serial port
-
VGA port
-
4GB RAM
I've run into a problem where pfSense installs fine and runs great, but after what seems like a random length of time between 1 minute and an hour, the LCD displays "OFF" and the box is unresponsive on network or serial, a monitor connected to VGA goes black, yet the fans keep running at the same speed and holding down the power button for 5 seconds works. Nothing is mentioned in the logs, the entries just stop.
This can even happen in the boot process before pfSense loads completely, in which case holding down the power button for 5 seconds doesn't work.
I'm 99.9% sure this isn't a hardware issue, and I'll explain what brings me to that conclusion. (Or at least the underlying hardware isn't broken, there may be a insidious incompatibility.)
So I've followed https://doc.pfsense.org/index.php/Boot_Troubleshooting and https://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards to the letter.
The first suggestion was to run a memory test, that did actually find a bad byte in one of the 4 sticks of RAM (Kingston Value RAM unsurprisingly.) so I removed that stick, taking the total down to 3GB (2 sticks in dual channel, one in single.) until I order a new stick.
I did get one kernel panic (see attached.) but that was fixed after removing the faulty memory module. Interestingly, the mangled entry that caused it was at the EXACT same inode as another user: https://technologyand.me/2015/08/25/pfsense-boot-loop/
Still not working - Memory not the cause.
Next I tried modifying the BIOS to ensure it was set to LBA and not Auto. There are extensive settings in the BIOS (Award, "04/27/2005-Springdale-G-6A79AWD9C-00") for Power Management, which control global timers for S1 and S3 standby, I wondered if the board was somehow going into standby, but I've tried disabling standby, resetting the timers on HDD, USB and network activity and also allowing wake from USB keyboard - none of which work.
Still not working - BIOS not the cause. (Although if you know of an updated BIOS I will try installing it, I couldn't find one though.)
I've tried running a pf live environment off a usb, this seems to last a bit longer before it hangs (30mins as opposed to 10-20mins), but it still dies like the rest. I don't know if this is due to the "using multiple small partitions" issue or not.
Still not working - HDD not the cause.
I've also tried running a Debian live system on the hardware, and even with stress-ng maxing everything out it doesn't hang or crash or anything. Just in case it needed to be installed on the hardware I also installed Debian to disk and ran another stress test for several hours - no crash. So the Linux kernel works faultlessly on this hardware.
Still not working - Hardware not the cause.
Originally I was restoring the xml config from my old router to this one, but I stopped doing that just in case there was an incompatibility, nope, still crashes.
I've tried pfSense versions 2.2.3, 2.2.6 and 2.3.4 (The very latest i386 out there).
There's other steps I tried too, but I've forgotten them. ::) (It's been 3 days so far…)
So I need your help. Is there a a way to get more verbose logging so I can diagnose the issue more thoroughly, because at the moment I am completely in the dark as to the underlying cause.
Or perhaps you have experience with something similar?
Tomorrow I'm going to try to run it on a hypervisor over the same hardware and see if that abstraction layer has any effect.
Thanks.
-
-
Interesting development:
It doesn't crash in debian (I have it running another stress-ng at 100% CPU and 90% RAM with 4 HDD workers right now.) but it does crash in grub!
I tried to start debian up (from the HDD) this morning and it got as far as grub and then crashed 1 second after grub displayed on the screen. It did this consistently 4 times in a row, until I removed the USB mouse and keyboard. I don't know if that was the cause or just a coincidence.
Thoughts?
-
At the risk of being lynched, OPNSense has been running on the box for 12 hours now, no crashes… :-X
As this router does a lot of business critical things for me, I need a solution or a workaround of any kind, and this will do.
I am happy to spend some time with someone knowledgeable about pfSense's internal workings trying to figure out the root cause, if you have any suggestions, I will still try them. I want to support the continued development of pfSense.
-
Take this for what it is worth but that is actually a pretty old box to be doing the job of "business critical". ;)
My guess is going to be a driver error of some kind. Id be curious if pfSense 2.1.5 would run fine for you. But only as a test as that version is not supported and includes all that goes with "not supported".
Other- have you tried with powerd disabled?
-
Take this for what it is worth but that is actually a pretty old box to be doing the job of "business critical". ;)
Fair point! I am cheap. ::) To paraphrase Sam Vimes; "there's nothing quite as expensive as being poor."
To stop powerd it would be as simple as "service powerd stop" yes? I'll try that.
Thanks for the tip.
-
Simpler..
System/Advanced/Miscellaneous/ Power Savings- uncheck the box.. :)
I can't remember if it is checked by default or not..
-
I think it could be memory issue, it just fails with power saving modes and it can be stable under load, when full voltage applied. To check this you need to install memtester package.
For 2.4 version it would be:fetch http://pkg.freebsd.org/freebsd:11:x86:64/latest/All/memtester-4.3.0.txz
pkg install memtester-4.3.0.txz
to run, use
memtester (size to test in MB) (loops)
memtester 512 10Memtester for other FreeBSD version can be found here — http://portsmon.freebsd.org/portoverview.py?category=sysutils&portname=memtest
For examle 2.3 pfSense based on freebsd 10 needs this package http://pkg.freebsd.org/freebsd:10:x86:64/latest/All/memtester-4.3.0.txzIf pfSense also would not hang, then it definitely memory powersavings incompatibility issue.