Hard drive install - Netgate FW-7535 1U read_DMA errors after a couple weeks

TheCeltic

I recently purchased a Netgate FW-7535 1U device and added a 750GB Seagate Momentus drive (removing the SD card). Installation was easy and the firewall worked great for about three weeks.

One morning, it stopped working and the following error was displayed:

ad5: failure - read_dma status=51 error=40 LBA=3386711

I am using the full install of PFsense 2.0 and running the snort plugin. I've seen posts with other hardware issues when setting up drives for use with PFsense, any way to fix this issue? I had three partitions, / (4 gig), swap (256 meg), /usr (remaining disk space ~745 gig).

Any tips/advice are appreciated. I was previously running pfsense on a mac mini without issue and had to revert back to the old hardware…

Thanks!!

-TheCeltic

wallabybob

Your drive has probably developed a bad spot around LBA=3386711

I understand hard drives have a number of replacement blocks that can be used to recover from this sort of problem and the drive firmware can relocate a certain number of logical blocks from bad physical spots to good physical spots.

You could boot a Linux live CD on the system and try a few things to see if the bad spot is still visible:

use linux dd utility to do a physical read of the whole drive. It might turn up other bad spots, perhaps even enough to cause you to check the drive warranty
use linux badblocks utility to do a drive investigation including a write test
depending on the distribution, there might be a utility to invoke drive self test functionality.

Best case: you might have discovered the only bad spot on the disk, the logical block has been remapped to another block and you will get a few years trouble free life out of the disk.

Worst case: there are lots of bad blocks scattered over the disk and you have just been lucky not to encounter them.

TheCeltic

I'd like to think it is as simple as a faulty disk… however, there were several similar reported errors (I only listed the one), it is a brand new drive and I've read about several other people having similar issues. Is it possible that the bios settings need to be manually set? What about other issues? I do have a replacement drive and will try it, but using pfsense quickly becomes unpopular with management when hardware fails (I know it may not be the fault of pfsense, but that's how they see it). I REALLY want to ensure management continues to support the decision to use pfsense and not require that I change to some other (commercial and inflexible) solution. Thanks!

wallabybob

@TheCeltic:

there were several similar reported errors (I only listed the one),

If there is a bad spot on the disk it is likely to affect a number of blocks. Did the other reported errors indicate a block "close" to the one in the report you posted?

@TheCeltic:

Is it possible that the bios settings need to be manually set?

What BIOS settings did you have in mind and how do you think they would make a difference?

@TheCeltic:

What about other issues? I do have a replacement drive and will try it, but using pfsense quickly becomes unpopular with management when hardware fails (I know it may not be the fault of pfsense, but that's how they see it).

I suggest you read http://en.wikipedia.org/wiki/Bad_sector and the linked references and articles to give you more information about bad blocks on magnetic disks.

By all means replace the disk but if I was paying for it I would at least think twice before discarding a hard drive that developed one (reported) bad spot. If the bad spots were all close enough together I could repartition the disk so the bad spots were not used.

TheCeltic

There were ~ 10 errors (consecutive) on the drive. Is it possible that the disk gets too hot writing snort log data (there is no fan in the enclosure) and that caused it to fail prematurely? I replaced the drive and will be bringing the firewall back online this weekend… I just want to do all I can to ensure I don't experience the same thing again in 2-3 weeks.

-TheCeltic

wallabybob

@TheCeltic:

I just want to do all I can to ensure I don't experience the same thing again in 2-3 weeks.

Then you should take some steps to ensure the drive is kept within its ratings, including temp ratings. Some drives have capability of reporting their temperature. If you look around you will probably find some software for capturing drive temperature - one of my Linux systems warns me if the hard drives gets "too hot".

Petrus4

Hi I know this post is from Nov. last year. But I think this is most likely a temperature issue. I have the fw-7535 and specifically only put in a SDD because this is a fanless unit and would not be able to dissipate the extra heat produced by a HDD. The company that produces these boxes, lanner also recommends to only use SSD's in the fan-less versions of these units.

At what temperatures was / is your box running and were you able to fix your problems?

cmb

@TheCeltic:

I'd like to think it is as simple as a faulty disk… however, there were several similar reported errors (I only listed the one), it is a brand new drive and I've read about several other people having similar issues.

You were probably seeing NID_NOT_FOUND with CF in the other instances, which is just a normal quirk of SanDisk CF cards and not indicative of any actual problem. I haven't heard of or seen any HD issues on 7535's, I'm running one here myself actually, and have worked on a number of customer systems that have them. Run the Seagate diagnostics on that drive, I expect it'll fail.

jimp

You may also have hit this:

http://forum.pfsense.org/index.php/topic,26626.15.html

I recent discovered that the ataidle command must be run once per power cycle, and changed the code in pfSense to run that automatically for future versions.

It's easy to kill a laptop HDD if the Load Cycle count gets far too high.