PFSense killing hard drives

ericwilkison

I have a very basic pfsense install on desktop class hp boxes that keeps killing hard drives. I'm using version 1.2.3-RELEASE with no add on packages. It's set up as a basic packet filter between a corporate and lab networks. It has very low utilization with about 30 users performing admin functions on the lab across it.

I've had it running for about a year now and three times the hard drive had died. Each time the drive has been malfunctioning to the point that the system bios would not even recognise the drive any more. In the interest of getting the system back up quickly each time the entire box was replaced with identical hardware, pfsense was reloaded from disk, and the config was restored from backup.

I have another identical box that is set up as a firewall between the lab network and the Internet. This box has been running for over a year without issue.

Since the hardware was replaced each time malfunctioning hardware can be removed as possible causes. I don't see how anything that pfsense could do would kill the hard drives so the OS would not even see them. But at this point I don't have any other places to look. Could this be caused by something that pfsense is doing?

Any help, suggestions, trouble shooting recommendations that you can provide would be helpfull.

Thanks,

gderf

I can't imagine what pfsense could have to do with your HDs failing.

I have a Windows XP box that killed two WD Caviar Blue 500GB drives with less than a year on them, but I don't blame the OS. The drives are junk and they just happened to be plugged into that XP box.

WD replaced the drives under warranty, but I don't dare put them into service.

mrbostn

What mfg are the drives?
Are they laptop drives?
IDE? SATA?

Power/power supply issues maybe?

wallabybob

A few years ago I became aware of "premature" failure of laptop drives in firewalls. The particular drives were not rated for 24x7 operation and the way they were driven by that particular firewall caused an abnormally high number of operations (head parking if I recall correctly) that lead to the premature failure.

jimp

See my post here: http://forum.pfsense.org/index.php/topic,26626.0.html

dhatz

Does the ataidle command need to be executed on the disk(s) at boot (i.e. after each powerdown)?

jimp

No, just once. It sets a bit in the drive's firmware, if I remember right.

ericwilkison

The current system that was just re-built has a Seagate Barracuda 7200rpm SATA drive in it. I don't know specifically what model was in the other boxes that died, but it would have been something comparable.

I installed smartctl on the system and there is no value for Load_Cycle_Count. I assume that the drive does not record it.

wallabybob

Have you ever tried the apparently broken drives in another system?

Have you tried discussing the problem with the support line of the drive manufacturer?

chpalmer

Seagate Barracuda 7200rpm SATA

320gig by chance? There seems to be a bad batch of these that made it out there a couple of years ago… Ive got one in the range of those that have had problem drives so Ive been hesitant to use it for anything serious...

Ive never had any problems before or after that period (which the dates escape me) in fact I have no trouble with our later models we use...

dreamslacker

@chpalmer:

Seagate Barracuda 7200rpm SATA

320gig by chance? There seems to be a bad batch of these that made it out there a couple of years ago… Ive got one in the range of those that have had problem drives so Ive been hesitant to use it for anything serious...

Ive never had any problems before or after that period (which the dates escape me) in fact I have no trouble with our later models we use...

That would be the Cuda 11 series. It's a firmware related issue and the 500GB and under drives are particularly hard hit. I used to help recover the drives but the 500GB units proved to be a hit and miss affair.
The 1TB generally worked fine (and continued to do so for quite a long time) but the 500GB units I fixed (it's a temporary fix by clearing the test logs area) usually only lasted 2 to 3 additional power cycles and become unfixable after that. AFAIK, the 750GB and above models had a new firmware that resolves the issue but there never were any for the lower capacity models.

Each drive's firmware contains a drive testing algorithm built-in for factory QC/ reliability testing. Before testing for each batch, the drives concerned are loaded with settings to conduct the self-test and write out the logs to a specific area on the drives. These logs cannot be overwritten by subsequent tests and have to be cleared manually (automatically on the test machines which interface directly to the microcontroller).
The problem came when, somehow, the drives sent for production did not have the self-tests disabled so each time the drive powered up, it performs the test and writes the log. When the log area is filled, the drive shutsdown operations (this is a safety measure for testing so that logs won't get discarded during QC phase). To recover the drive with data intact, one needed to directly interface with the MCU on the drive; there are 4 pins on the back of each Seagate drive for serial communications and power - TX/ RX/ V+/ GND.
You'd need to temporarily disconnect the motor (can be achieved by sliding a plastic piece between the contacts) because the drive would conduct tests upon spin-up and lock-up immediately.
Once a link is established with the MCU, you needed to tell it to ramp down the motor and connect the board back to the motor. Then disable the tests, clear the log area and ramp the motor back on. This gives several extra power cycles during which you can: Recover the data, flash the firmware to a fixed firmware etc.

I know this for a fact since I was an IT consultant with one of their sub-contractors building QC equipment racks for them and I encountered the exact same issue with the refurbished drives used during our equipment testing (we were manually running tests and the logs clearing issue wasn't made clear to us until all our test drives locked up and everyone started pointing fingers). After we were informed of the issue, the drives had the logs cleared and we were back in business.
I'm just glad that we made an exit before the 11th series was out wreaking havoc. Otherwise, I'd be sure that they would have been pointing fingers our way (the engineer for the project was quite hostile towards us since he wanted another sub-con handling the project but we won it).

ericwilkison

All of the systems have had 500GB drives and so this could very well be the firmware issue that you described. In my original post I mentioned that we had an identical box running another instance of pfsense that has not had an issue. I checked it out and it's using a Hitachi drive. That tends to support the theory of firmware issues on the hard drive as well.

I'm going to see if HP has any firmware updates for the drive and / or see if I can dig up another box with one of the Hitachi hard drives.

Thanks for the valuable input!

stuxhost

I've been doing just fine with a Seagate 5GB CF II microdrive. Haven't had any HDD issues yet, in fact, it's been far more reliable then when I was using actual flash memory.*

I am probably going to to back to pure CF for pfSense, now that I'm not up/down grading regularly. I'd recommend that you do the same, and use the HDD just for simpler storage (ex: if you're got a cache server setup). I barely see the need for the 5GB I have right now, and can't imagine a FW needing more. If you're not doing something like IDS, you might also want to consider just using NFS if you have those large storage needs.

Keep that firewall light and tight, it's the the brainstem of your network.

- This is not the case with Voyage Linux running on a WRAP on the same drive (and revision), but that system does A LOT of writes.

chpalmer

Ive got a couple of these, one which Ive had for 10 years still running in one of my firewalls… (Got a standby ready in case...)

http://www.google.com/products/catalog?hl=en&sugexp=pfwc&cp=26&gs_id=3&xhr=t&q=FUJITSU+MPD3043AT&gs_upl=&bav=on.2,or.r_gc.r_pw.&biw=1920&bih=980&um=1&ie=UTF-8&tbm=shop&cid=13466709221322952821&sa=X&ei=aZd6TvKnBMTXiAKXt8XBDw&sqi=2&ved=0CEgQ8wIwBA

Pretty bulletproof. No SMART unfortunately...

Alan87i

I had 1 mother board that would kill hard drives. I ran 4 through it in 2 years. all different models and brands. I swapped out the board and have ran the system with the current drive for 3 years.

fastcon68

I can't get either watchguard x500 and x1000 to install to a hard drive. I tired 4 different drives no luck. cF is just about the only way to go unfortuately.
RC