Dying hard drive, replacing with SSD? How to perform this as quickly as possible
-
Hello guys,
So I recently had some issues with hard drives on a freenas storage server and that made me look more closely at my pfsense HDD's condition. I think the hard drive I currently use in this pfsense box is dying:
SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 118 099 006 Pre-fail Always - 190793353 3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 210 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 20 7 Seek_Error_Rate 0x000f 091 060 030 Pre-fail Always - 1346666632 9 Power_On_Hours 0x0032 014 014 000 Old_age Always - 75727 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 210 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 019 000 Old_age Always - 25770262658 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 068 057 045 Old_age Always - 32 (Min/Max 30/33) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 34 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 445142 194 Temperature_Celsius 0x0022 032 043 000 Old_age Always - 32 (0 15 0 0 0) 195 Hardware_ECC_Recovered 0x001a 058 051 000 Old_age Always - 190793353 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
The "Reallocated sector count" is WAY higher than what I'm willing to deal with, and the "Command Timeout" is INSANELY high (I almost wonder if this is some kind of misreporting)...
The drive is a SEAGATE Momentus (Yeah I know..)
=== START OF INFORMATION SECTION === Model Family: Seagate Momentus 5400.6 Device Model: ST9160314AS Serial Number: 5VCLVMT1 LU WWN Device Id: 5 000c50 02e9f697a Firmware Version: 0001SDM1 User Capacity: 160,041,885,696 bytes [160 GB] Sector Size: 512 bytes logical/physical Rotation Rate: 5400 rpm Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS T13/1699-D revision 4 SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Thu Jul 15 08:06:15 2021 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled
I'm thinking to replace this drive ASAP but I'd like to avoid rebuilding the entire OS and reinstall / reconfigure everything.
Some questions:
- Would you guys recommend a SSD? If so, I'd personally be willing to spend a bit more and get an enterprise grade SSD.
- Is it possible to "clone" the drive to a new one and avoid reinstalling everything? Would pfsense be able to deal with this? The idea here is to only clone the data to a new drive and swap the dying one for a new one without reinstalling everything and have lot of downtime.
- With such BAD SMART values, why is pfense not issuing notifications (via email or in the gui)? Seems like some sort of cron job looking for the most relevant SMART attributes and warn the admin the drive is "dying" would be nice. Its both a question and a suggestion for the pfsense devs.
I made a similar suggestion a few years ago (notify via email) if the hardware temps are going > threshold but this was never implemented.
Thanks to all!
-
@pftdm007 If you have a backup of your config you can reinstall and restore in under 10 minutes.
- SSD is better from a power use perspective
- Any disk clone software can do that. Acronis TrueImage, Clonezilla etc
- No idea
-
@pftdm007 said in Dying hard drive, replacing with SSD? How to perform this as quickly as possible:
Would you guys recommend a SSD?
I agree with @KOM but I also had a sudden SSD death without any S.M.A.R.T warnings! (Transcend 32GB mSATA) Maybe I had bad luck, I don't know ...
Just my 2 cents ...
Regards,
fireodo -
Any SSD with trim (they all may do this) should be fine, although I would use a known brand. pfSense does not write enough to wear out an SSD in a reasonable amount of time.
IMO a fresh install with a restore is safer, failing drives are tricky things. -
Several options / ideas..
First solution : why bother ? This excellent tool makes a backup of your pfSense config.
The "install USB"is mall, can be downloaded fast, you'll be back on line 10 minutes after you start re installing.Next : Is your pfSense essential ? Use a new drive every 3,4 years, and after that period, use the disk on a less essential place.
Related : Use an UPS, and all risks are divided by a positive number N, where N is bigger then 1.
Keep a spare drive on the shelves.Next : You have a "server" some where running on the Internet (for your own sites, mails, games, private DDOS attacks and such) Use a data collector tool like Munin - see here - and as soon as one of the values reaches a critical point, you get a mail.
Btw : I never received a mail from Munin, the drive was always fine now, and dead 10 minutes later, taking pfSense with it (so - see first point). My Munin example is from my dedicated server, it uses a "Raid 1" using two identical drives. For such a setup, smartctrl has more sense. If one drive fails, the system will continue tu run on a single drive. I will have some time preparing the swap and re sync.Next : Using the new ZFS filesystem, with pools, with a Raid 1 or bigger) a manual, monthly Smartctrl will do.
As you said yourself, a basic cron, some grep and mail isn't that hard.
/usr/local/sbin/smartctl -H -c -l error -l selftest -l selective -a /dev/ada0
(because my drive's driver name is "ada")
This will show a boatload of info.
Just 'grep' the possible bad-ass values, and mail them up to yourself.
Your mini scripts / cron will be update proof.