Dying hard drive, replacing with SSD? How to perform this as quickly as possible

pftdm007

Hello guys,

So I recently had some issues with hard drives on a freenas storage server and that made me look more closely at my pfsense HDD's condition. I think the hard drive I currently use in this pfsense box is dying:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail  Always       -       190793353
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       210
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       20
  7 Seek_Error_Rate         0x000f   091   060   030    Pre-fail  Always       -       1346666632
  9 Power_On_Hours          0x0032   014   014   000    Old_age   Always       -       75727
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       210
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   019   000    Old_age   Always       -       25770262658
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   057   045    Old_age   Always       -       32 (Min/Max 30/33)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       34
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       445142
194 Temperature_Celsius     0x0022   032   043   000    Old_age   Always       -       32 (0 15 0 0 0)
195 Hardware_ECC_Recovered  0x001a   058   051   000    Old_age   Always       -       190793353
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

The "Reallocated sector count" is WAY higher than what I'm willing to deal with, and the "Command Timeout" is INSANELY high (I almost wonder if this is some kind of misreporting)...

The drive is a SEAGATE Momentus (Yeah I know..)

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Momentus 5400.6
Device Model:     ST9160314AS
Serial Number:    5VCLVMT1
LU WWN Device Id: 5 000c50 02e9f697a
Firmware Version: 0001SDM1
User Capacity:    160,041,885,696 bytes [160 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Thu Jul 15 08:06:15 2021 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

I'm thinking to replace this drive ASAP but I'd like to avoid rebuilding the entire OS and reinstall / reconfigure everything.

Some questions:

Would you guys recommend a SSD? If so, I'd personally be willing to spend a bit more and get an enterprise grade SSD.
Is it possible to "clone" the drive to a new one and avoid reinstalling everything? Would pfsense be able to deal with this? The idea here is to only clone the data to a new drive and swap the dying one for a new one without reinstalling everything and have lot of downtime.
With such BAD SMART values, why is pfense not issuing notifications (via email or in the gui)? Seems like some sort of cron job looking for the most relevant SMART attributes and warn the admin the drive is "dying" would be nice. Its both a question and a suggestion for the pfsense devs.

I made a similar suggestion a few years ago (notify via email) if the hardware temps are going > threshold but this was never implemented.

Thanks to all!

KOM

@pftdm007 If you have a backup of your config you can reinstall and restore in under 10 minutes.

SSD is better from a power use perspective
Any disk clone software can do that. Acronis TrueImage, Clonezilla etc
No idea

fireodo

@pftdm007 said in Dying hard drive, replacing with SSD? How to perform this as quickly as possible:

Would you guys recommend a SSD?

I agree with @KOM but I also had a sudden SSD death without any S.M.A.R.T warnings! (Transcend 32GB mSATA) Maybe I had bad luck, I don't know ...

Just my 2 cents ...

Regards,
fireodo

AndyRH

Any SSD with trim (they all may do this) should be fine, although I would use a known brand. pfSense does not write enough to wear out an SSD in a reasonable amount of time.
IMO a fresh install with a restore is safer, failing drives are tricky things.

Gertjan

Several options / ideas..

First solution : why bother ? This excellent tool makes a backup of your pfSense config.
The "install USB"is mall, can be downloaded fast, you'll be back on line 10 minutes after you start re installing.

Next : Is your pfSense essential ? Use a new drive every 3,4 years, and after that period, use the disk on a less essential place.
Related : Use an UPS, and all risks are divided by a positive number N, where N is bigger then 1.
Keep a spare drive on the shelves.

Next : You have a "server" some where running on the Internet (for your own sites, mails, games, private DDOS attacks and such) Use a data collector tool like Munin - see here - and as soon as one of the values reaches a critical point, you get a mail.
Btw : I never received a mail from Munin, the drive was always fine now, and dead 10 minutes later, taking pfSense with it (so - see first point). My Munin example is from my dedicated server, it uses a "Raid 1" using two identical drives. For such a setup, smartctrl has more sense. If one drive fails, the system will continue tu run on a single drive. I will have some time preparing the swap and re sync.

Next : Using the new ZFS filesystem, with pools, with a Raid 1 or bigger) a manual, monthly Smartctrl will do.

As you said yourself, a basic cron, some grep and mail isn't that hard.

/usr/local/sbin/smartctl -H -c -l error -l selftest -l selective -a /dev/ada0

(because my drive's driver name is "ada")
This will show a boatload of info.
Just 'grep' the possible bad-ass values, and mail them up to yourself.
Your mini scripts / cron will be update proof.