pfSense critical faults at boot after a console halt

Treefrog

This is running on a:
Dell R320 (Xeon E5-2407 2.20ghz x1, 48GB ram and an 80gb ssd (Intel SSD X25-M Series 80GB, 2.5in SATA 3Gb/s, 34nm, MLC).

Setup has been rock solid for over a year now. Recently upgraded to 2.4.5-release (patch 1)
and rebooted without issue a few weeks ago. I am running suricata and pfblockerng, but thats about it and performance has been good.

We just had a power outage so I manually powered off pfSense at the console to avoid abrupt shutdown once the UPS started to get low. The power came back after a few hours and I had no web access to pFsense. I went to the VGA console and noticed on boot up several errors appear (and start showing during boot, before the 16 option menu appears) such as:

vm_fault: pager read error, pid 342 (php-fpm)  << it also has this error with other pids such as 477 (php), pid 503 (php-cgi) etc.
Failed to fully fault in a core file segment at VA 0x800aa0000 with a size 0x7000 to be written at offset 0x146000 for process php-fpm
exited on signal 11 (core dumped)

once at the console menu, it continues to fault periodically, again "Vm_fault: pager read error pid 509 (php-cgi) ,etc etc.

Trying most of the options, for example option 1 to assign interfaces gives the same errors. Pfsense is not pingable from other machines on the network and I cannot ping anything from it. Pretty much the only console menu option that works is to enter the shell prompt.

I also tried rebooting into singleuser mode and ran "/sbin/fsck -y /" to check the disk but got an error "fsck:cannot open '/dev/zroot/ROOT/default': No such file or directory". Not sure if it matters but this is a ZFS installation.

I thought maybe my suricata logs or something I forgot to look after filled up the disk, but df -h would seem to indicate there is free space, everything is at 0-2% capacity except for devfs which is 1.0k size and 1.0k used at 100% capacity.

Can I fix this? My next step is to try and backup the configuration to USB (obviously I messed up by not doing that regularly) and nuke it if I cant.

I'd also like to prevent this happening in the future if at all possible, I've had pFsense running on various machines for years and this is a first for me, but since the network is used for working from home its critical I eliminate any chance of this occurring again if I can.

stephenw10

I would be surprised if that was a file system issue, you shut it down correctly and you're running ZFS.
It could be a bad drive, do you runa SWAP slice? You don't need to with 48GB of RAM!

Actually bad RAM perhaps? I would expect more random and catastrophic errors though.

Steve

Treefrog

The intel SSD is very old and has been through a few desktop PC builds before, it could be bad but I'm not sure (tried to run fsck, not sure what else to try).

Treefrog

I had to get this restored tonight, and after discovering how easy this would be I did a quick reinstall. The pfSense installer found the existing config file without issues and I'm back up and running.

I can check the drive health, but honestly, if the consensus is that it might have been a drive error, seeing as how I'm using ZFS and I did a proper halt from the console (its not like I just pulled the plug)... I might as well just take this opportunity to get a new reliable drive so I can have some peace of mind.

Any more advice would be appreciated, thanks!

stephenw10

You can check the SMART values from the gui. Intel drives are usually pretty good at reporting that. They're also usually pretty close to indestructible!

Steve

Treefrog

OK the smart health from the GUI shows everything is OK...

Do I just chalk this up to a random event and cross my fingers it doesn't happen again? Power down pfSense a few times and try to trigger it which would indicate some kind of hardware issue?

Not really sure where to go from here in trying to prevent this in the future. Of course I can prepare for it, I have a USB stick with the config and pfSense install ready to go nearby now (but obviously I'd rather prevent it).

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       2
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       34518
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1215
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       134
225 Host_Writes_32MiB       0x0030   200   200   000    Old_age   Offline      -       394254
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       5231
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       1
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       1778753723
232 Available_Reservd_Space 0x0033   099   099   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   095   095   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0

stephenw10

How old is that system?

You could take it off-line and run some ram and disk tests. Not much more you can do except match for errors and be prepared in case if another failure.

Steve

ericnix

Had the same issue with an Intel SSD in a Netgate XG-1541. New SSD and all is well.

For some reason, I wasn't offered the ability to format the drive in ZFS. Is this the standard?

stephenw10

The factory installer uses the tested default values, which is a UFS install.

You could install to an XG-1541 with the CE image though and that will give you the option.

Steve