PFsense eats hard drives
I have been using PFsense for several years now at three different locations. At those locations, I have gone through 11 different motherboards as upgrades. I say this to show I have been having HDD problems across a variety of hardware.
The problems is whenever I reboot PFsense, there is a chance the HDD will not be recognized. when it happens, I can hear the drive seeking and it is usually not recognized by the BIOS. This happens about 10% of the time on reboot. If PFsense is not gracefully shutdown, I can bet better than half that I lost my hard drive.
Over the years I have gone through 28 HDDs. Several of them were old drives pulled from a desktop, but many were brand new. I don't know what is causing this but I just about through. I just ordered an SSD and I am think about looking for a new firewall.
Anyone have this same problem? or maybe know what is going on?
What packages are you running? Http proxies if using cashing can write a lot of data to a HDD. Probably shouldn't be having that many issues though. I'm not having any issue over here just to add to your data point. I'm not proxying any of my interface either.
you'll get farther by mentioning your hardware.
chpalmer last edited by
Anyone have this same problem?
Quite the contrary. I have one box with a drive thats been running since 2001. I put pfsense 1.2.1 (when it first came out) on it and have upgraded to 2.1.3 before it was finally retired this year in favor of a watchguard box running pfsense on a cf card.
@Mikeisfly, The only package I use is openvpn.
@gonzopancho, I originally stated I have had this problem on a wide variety of hardware.
The two remaining installs, including the one that just died, both use Jetway mini itx boards and Western Digital HDD. But I have had this problem with Asus, MSI, and Intel boards. With Western Digital, Seagate, and Toshiba drives.
I have been using PFsense for several years now at three different locations. At those locations, I have gone through 11 different motherboards as upgrades
Over the years I have gone through 28 HDDs.
That is an extraordinarily high failure rate (assuming all 28 are failures, and not capacity or speed upgrades). My experience is that FreeBSD is extremely stable, capable of uptimes measured in years. I cannot even think of how pfSense could induce hardware failures …
How is the environment of these three systems? Dirty, hot, bad power? Perhaps a good UPS/filter would be good to have.
They are not in commercial sever rooms or anything. They have all been in office environments. shouldn't be too much dust.
As far as power, two of the three (the one that just died) have UPSs. Just cheap consumer grade APC for home use, nothing fancy.
Yes the failure rate is high, but a number of those drives were pulled from desktops and whatever I had on the bench at the time.
so you're using seconds on hardware, and probably inexpensive mini-itx boards.
I know of Alix boards that have run for years non-stop.
We attempted to sell a Jetway dual-core Atom board, but the failure/return rate was extremely high, so we stopped. (Yet I have employees who use them as their main firewall at home with no issue.)
Over the years I have gone through 28 HDDs.
Ouch! Something's not right there. Any common elements between the systems? Bad power supplies seem like a likely suspect. It would be interesting to get some drive stats from any of them. The total writes figure for example. Perhaps they came from previously high load systems, hard to believe all 28 did though.
If you're only running OpenVPN then you could run Nano where there are very few writes to the boot drive.
I've lost a few HDDs over the years in my main pfSense box at home. I suspect the primary causes are heat and running a tiny laptop drive 24/7 that wasn't meant for that workload.
On 2.1 and later you can also move /var/ and /tmp/ to RAM disks if you have enough spare RAM, so that the constant writing of log files and RRD databases doesn't impact your disk (spinning disk or SSD)
Have you monitored SMART status of the drives? I agree that heat, coupled with second hand hard drives being spun 24/7, could be the culprit.
Interesting news, not even the bios was recognizing the HDD at boot. The POST screen was hanging and I could hear the drive seeking loudly. Out of frustration I banged the drive with the handle of the screwdriver.
I backed everything up again. I did order a new SSD and will be reinstalling soon.
The power supplies in there now have been in use for some time, but not the entire time. Almost a year or so.
Thanks for the tip I will try moving those partitions.
I have used laptop sized drives in the past Maybe 3 or 4 of those. And yes, they all died and I wasn't too surprised.
One occasional (and rather wonky) workaround for failing drives I've used in the past:
IF the failure appears in an area of the drive (rather than everywhere and always) I've roughly calculated the relative position on the disk of the "BAD" zone and create a partition that fully encompasses it (perhaps with some spare space on either side.
In a manual install, force a mountpoint for something you won't use, eg: /bmnt and assign it to the bad partition. Depending on what space you have left you may need to assign other mount points around the "bad partition".
When pfsense boots, you can remove the reference to /bmnt from fstab and the bad spot remains isolated. Not a permanent fix by any means (more like bubblegum and bailing wire ::) ) but it often stops the drive from stuttering over a failing area in normal usage and generating errors, etc. It's given me time to find a suitable replacement drive/system while keeping the unit operational for a while longer.
In general though I find my pfsense installs to be mostly harmless to hard drives, YMMV :)
I just checked the SMART self test logs on the drive, it says passed. Weird though, it only shows 410 LifeTime hours. That is only 17 days. that drive is much older than that. Am I reading that wrong?
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
Try to read the raw SMART data from the drive and then check the manufacturers datasheet. Some values can be a little non-standard. Does 410 days seem more likely?
What does smart report for your drive temperature? My experience with running consumer drives 24/7, you're asking for trouble at sustained temperatures above 40. Above 45, you're asking for trouble and are going to get it.
I'm not familiar with using smart in pfsense, since i use cf. fyi, freeBSD command is smartctl -a /dev/da0.
It has been my experience that HDs tend to outlive their usefulness, no matter how hard you run them, as long as they stay cool. If you have a lot of HDs dying, that's not a software issue, that's a hardware issue.
Spinning down a HDs is horribly hard on the motor, so make sure you don't have any idle drive power savings going on some how.
If you get SSDs, that can pretty much remove all mechanical issues, but if you have an HTTP proxy or anything, you may need to be concerned with data being written to the drive. Makings sure TRIM support is enabled can be very useful for these situations. I personally haven't gone through and manually enabled TRIM or checked on it to see if it was auto-detected. My SSDs rarely get written to and I have no services that want to write, for the most part.
I agree with everything you say, Harvey. Spinning up a drive IS stressful. And heat IS the path to death.
There is a tradeoff. Keeping a drive running 24/7 reduces the stress of spin up. And, without good ventilation, insures that drives run hot and stay hot.
I spin down my archival storage arrays, but keep frequently used drives up 24/7. The challenge is to keep them cool as well.
This is why I never trust HDDs in appliances like pfSense. Thank God there exists such a great thing as NanoBSD! Running from RAM the whole thing.