pfSense critical faults at boot after a console halt
-
This is running on a:
Dell R320 (Xeon E5-2407 2.20ghz x1, 48GB ram and an 80gb ssd (Intel SSD X25-M Series 80GB, 2.5in SATA 3Gb/s, 34nm, MLC).Setup has been rock solid for over a year now. Recently upgraded to 2.4.5-release (patch 1)
and rebooted without issue a few weeks ago. I am running suricata and pfblockerng, but thats about it and performance has been good.We just had a power outage so I manually powered off pfSense at the console to avoid abrupt shutdown once the UPS started to get low. The power came back after a few hours and I had no web access to pFsense. I went to the VGA console and noticed on boot up several errors appear (and start showing during boot, before the 16 option menu appears) such as:
vm_fault: pager read error, pid 342 (php-fpm) << it also has this error with other pids such as 477 (php), pid 503 (php-cgi) etc. Failed to fully fault in a core file segment at VA 0x800aa0000 with a size 0x7000 to be written at offset 0x146000 for process php-fpm exited on signal 11 (core dumped)
once at the console menu, it continues to fault periodically, again "Vm_fault: pager read error pid 509 (php-cgi) ,etc etc.
Trying most of the options, for example option 1 to assign interfaces gives the same errors. Pfsense is not pingable from other machines on the network and I cannot ping anything from it. Pretty much the only console menu option that works is to enter the shell prompt.
I also tried rebooting into singleuser mode and ran "/sbin/fsck -y /" to check the disk but got an error "fsck:cannot open '/dev/zroot/ROOT/default': No such file or directory". Not sure if it matters but this is a ZFS installation.
I thought maybe my suricata logs or something I forgot to look after filled up the disk, but df -h would seem to indicate there is free space, everything is at 0-2% capacity except for devfs which is 1.0k size and 1.0k used at 100% capacity.
Can I fix this? My next step is to try and backup the configuration to USB (obviously I messed up by not doing that regularly) and nuke it if I cant.
I'd also like to prevent this happening in the future if at all possible, I've had pFsense running on various machines for years and this is a first for me, but since the network is used for working from home its critical I eliminate any chance of this occurring again if I can.
-
I would be surprised if that was a file system issue, you shut it down correctly and you're running ZFS.
It could be a bad drive, do you runa SWAP slice? You don't need to with 48GB of RAM!Actually bad RAM perhaps? I would expect more random and catastrophic errors though.
Steve
-
The intel SSD is very old and has been through a few desktop PC builds before, it could be bad but I'm not sure (tried to run fsck, not sure what else to try).
-
I had to get this restored tonight, and after discovering how easy this would be I did a quick reinstall. The pfSense installer found the existing config file without issues and I'm back up and running.
I can check the drive health, but honestly, if the consensus is that it might have been a drive error, seeing as how I'm using ZFS and I did a proper halt from the console (its not like I just pulled the plug)... I might as well just take this opportunity to get a new reliable drive so I can have some peace of mind.
Any more advice would be appreciated, thanks!
-
You can check the SMART values from the gui. Intel drives are usually pretty good at reporting that. They're also usually pretty close to indestructible!
Steve
-
OK the smart health from the GUI shows everything is OK...
Do I just chalk this up to a random event and cross my fingers it doesn't happen again? Power down pfSense a few times and try to trigger it which would indicate some kind of hardware issue?
Not really sure where to go from here in trying to prevent this in the future. Of course I can prepare for it, I have a USB stick with the config and pfSense install ready to go nearby now (but obviously I'd rather prevent it).
SMART Attributes Data Structure revision number: 5 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 3 Spin_Up_Time 0x0020 100 100 000 Old_age Offline - 0 4 Start_Stop_Count 0x0030 100 100 000 Old_age Offline - 0 5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 2 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 34518 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1215 192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 134 225 Host_Writes_32MiB 0x0030 200 200 000 Old_age Offline - 394254 226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 5231 227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 1 228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 1778753723 232 Available_Reservd_Space 0x0033 099 099 010 Pre-fail Always - 0 233 Media_Wearout_Indicator 0x0032 095 095 000 Old_age Always - 0 184 End-to-End_Error 0x0033 100 100 099 Pre-fail Always - 0
-
How old is that system?
You could take it off-line and run some ram and disk tests. Not much more you can do except match for errors and be prepared in case if another failure.
Steve
-
Had the same issue with an Intel SSD in a Netgate XG-1541. New SSD and all is well.
For some reason, I wasn't offered the ability to format the drive in ZFS. Is this the standard?
-
The factory installer uses the tested default values, which is a UFS install.
You could install to an XG-1541 with the CE image though and that will give you the option.
Steve