HDD dying although SMART says ok?



  • As from this morning, my systemlog is cluttered with the following:

    Sep 7 07:00:48	kernel		vnode_pager_putpages: residual I/O 4096 at 24
    Sep 7 07:00:48	kernel		vnode_pager_putpages: I/O error 5
    Sep 7 07:00:48	kernel		g_vfs_done():ufsid/5799c06539f8a71c[READ(offset=359244398592, length=32768)]error = 5
    Sep 7 07:00:48	kernel		(ada0:ata0:0:0:0): Error 5, Retries exhausted
    Sep 7 07:00:48	kernel		(ada0:ata0:0:0:0): RES: 51 40 9a 51 d2 29 29 00 00 00 00
    Sep 7 07:00:48	kernel		(ada0:ata0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
    Sep 7 07:00:48	kernel		(ada0:ata0:0:0:0): CAM status: ATA Status Error
    Sep 7 07:00:48	kernel		(ada0:ata0:0:0:0): READ_DMA48\. ACB: 25 00 8f 51 d2 40 29 00 00 00 40 00
    Sep 7 07:00:43	kernel		(ada0:ata0:0:0:0): Retrying command
    Sep 7 07:00:43	kernel		(ada0:ata0:0:0:0): RES: 51 40 99 51 d2 29 29 00 00 00 00
    Sep 7 07:00:43	kernel		(ada0:ata0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
    Sep 7 07:00:43	kernel		(ada0:ata0:0:0:0): CAM status: ATA Status Error
    Sep 7 07:00:43	kernel		(ada0:ata0:0:0:0): READ_DMA48\. ACB: 25 00 8f 51 d2 40 29 00 00 00 40 00
    Sep 7 07:00:39	kernel		(ada0:ata0:0:0:0): Retrying command
    Sep 7 07:00:39	kernel		(ada0:ata0:0:0:0): RES: 51 40 98 51 d2 29 29 00 00 00 00
    Sep 7 07:00:39	kernel		(ada0:ata0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
    Sep 7 07:00:39	kernel		(ada0:ata0:0:0:0): CAM status: ATA Status Error
    Sep 7 07:00:39	kernel		(ada0:ata0:0:0:0): READ_DMA48\. ACB: 25 00 8f 51 d2 40 29 00 00 00 40 00
    Sep 7 07:00:35	kernel		(ada0:ata0:0:0:0): Retrying command
    Sep 7 07:00:35	kernel		(ada0:ata0:0:0:0): RES: 51 40 98 51 d2 29 29 00 00 00 00
    Sep 7 07:00:35	kernel		(ada0:ata0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
    Sep 7 07:00:35	kernel		(ada0:ata0:0:0:0): CAM status: ATA Status Error
    Sep 7 07:00:35	kernel		(ada0:ata0:0:0:0): READ_DMA48\. ACB: 25 00 8f 51 d2 40 29 00 00 00 40 00
    Sep 7 07:00:32	kernel		(ada0:ata0:0:0:0): Retrying command
    Sep 7 07:00:32	kernel		(ada0:ata0:0:0:0): RES: 51 40 98 51 d2 29 29 00 00 00 00
    Sep 7 07:00:32	kernel		(ada0:ata0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
    Sep 7 07:00:32	kernel		(ada0:ata0:0:0:0): CAM status: ATA Status Error
    Sep 7 07:00:32	kernel		(ada0:ata0:0:0:0): READ_DMA48\. ACB: 25 00 8f 51 d2 40 29 00 00 00 40 00
    Sep 7 07:00:27	kernel		(ada0:ata0:0:0:0): Retrying command
    Sep 7 07:00:27	kernel		(ada0:ata0:0:0:0): RES: 51 40 96 51 d2 29 29 00 00 00 00
    Sep 7 07:00:27	kernel		(ada0:ata0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 40 (UNC )
    Sep 7 07:00:27	kernel		(ada0:ata0:0:0:0): CAM status: ATA Status Error
    Sep 7 07:00:27	kernel		(ada0:ata0:0:0:0): READ_DMA48\. ACB: 25 00 8f 51 d2 40 29 00 00 00 08 00
    

    This repeats itself every first minute of every hour. Until it states 'Retries exhausted'. This system runs for over 7 years now without any troubles, the HDD is in there for about 3 years now (Seagate Momentus 5400.6).

    SMART still says the drive is healthy, but I can't see any other reason for these entries. RAM is ok, cables swapped. Northing else would cause this right?



  • In my experience SMART is hit or miss.  I'd trust the syslog messages first. Make a config backup ASAP.  Then swap that drive for a known good one (an SSD if you can swing it) do a fresh install, and restore your config.



  • What is the output of```
    smartctl -a /dev/ada0


  • Rebel Alliance Developer Netgate

    Given the errors you're seeing, odds are high that there is actually a problem with the drive.

    In all the years I've been dealing with SMART, two things have been evident:

    1. SMART is prone to false negatives – Just because SMART says a drive is OK, doesn't mean it is. Especially when it comes to physical defects of various kinds or serious controller problems.

    2. If SMART says a drive has a problem, it has a problem.

    So you can trust that if SMART finds a problem, it's definitely a problem but if SMART says it's OK, you have more work to do.

    Same with software RAM tests like memtest86.