Ad4: TIMEOUT - WRITE_DMA retrying (1 retry left)



  • After setting up squid using a SATA 300MB/s 2.5" hard disk on an D945GSEJT I seem to be getting these errors which eventually lead to a kernel panic. I have tried switching off AHCI and using IDE mode but the errors continue. One strange thing is that pFsense seems to detect the SATA controller as a SATA 150 even though the intel spec page claims its 300:

    atapci1: <intel ich7m="" sata150="" controller="">port 0xf0e0-0xf0e7,0xf0d0-0xf0d3,0xf0c0-0xf0c7,0xf0b0-0xf0b3,0xf0a0-0xf0af mem 0xdff40000-0xdff403ff irq 19 at device 31.2 on pci0</intel>



  • Appears there is a fix here: http://linux-bsd-sharing.blogspot.co.uk/2009/03/howto-fix-sata-dma-timeout-issues-on.html but it looks like this means patching the kernel is this even possible with the files shipped with pfsense?



  • What version of pfSense are you using? You might get better results using a pfSense with a more up to date version of FreeBSD.



  • I was using 2.0.1 release, but I tested with 2.1 dev and its the same. Apparently the ata subsystem has been updated for FreeBSD 9, but that's a while off for pfsense i think :(


  • Rebel Alliance Developer Netgate

    You could try to disable DMA maybe (check the wiki) but usually those sorts of timeouts are more a sign of a disk/controller problem. Drivers are possible, but less likely, especially if it happens on multiple versions. Check Diag > SMART Status and run a report on the drive and see if it shows any errors (post the output here and we can look it over)



  • Unfortunately I don’t have the disk in my system anymore. I did check the smart logs and run a full smart test, none of the errors matched any time of the write dma errors. The disk was working fine in a laptop & the motherboard was working correctly using windows 2008 with another disk. I may give it another go sometime, I was using it as a squid disk I will update if I do.



  • Using this disk again. Errors popping up again on a fresh install. Have disabled DMA getting this at boot will see if it continues:
    Jul 1 00:25:26 kernel: ad4: TIMEOUT - READ_MUL48 retrying (1 retry left) LBA=430935279
    Jul 1 00:25:26 kernel: ad4: TIMEOUT - READ_MUL48 retrying (1 retry left) LBA=430940303

    smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.3-RELEASE-p3 i386] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

    1  Extended offline    Completed without error      00%      799        -

    2  Extended offline    Completed without error      00%      559        -

    3  Extended offline    Aborted by host              70%      557        -

    4  Short offline      Completed without error      00%      552        -

    5  Short offline      Completed without error      00%      386        -

    smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.3-RELEASE-p3 i386] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

    === START OF READ SMART DATA SECTION ===
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate    0x000b  100  100  062    Pre-fail  Always      -      0
      2 Throughput_Performance  0x0005  100  100  040    Pre-fail  Offline      -      0
      3 Spin_Up_Time            0x0007  147  147  033    Pre-fail  Always      -      2
      4 Start_Stop_Count        0x0012  100  100  000    Old_age  Always      -      849
      5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      0
      7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0
      8 Seek_Time_Performance  0x0005  100  100  040    Pre-fail  Offline      -      0
      9 Power_On_Hours          0x0012  099  099  000    Old_age  Always      -      801
    10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0
    12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      458
    191 G-Sense_Error_Rate      0x000a  100  100  000    Old_age  Always      -      0
    192 Power-Off_Retract_Count 0x0032  100  100  000    Old_age  Always      -      37
    193 Load_Cycle_Count        0x0012  095  095  000    Old_age  Always      -      54937
    194 Temperature_Celsius    0x0002  148  148  000    Old_age  Always      -      37 (Min/Max 11/45)
    196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      9
    197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0
    198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0
    199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      7
    223 Load_Retry_Count        0x000a  100  100  000    Old_age  Always      -      0

    smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.3-RELEASE-p3 i386] (local build)
    Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

    === START OF READ SMART DATA SECTION ===
    SMART Error Log Version: 1
    ATA Error Count: 171 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
    Powered_Up_Time is measured from power on, and printed as
    DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
    SS=sec, and sss=millisec. It "wraps" after 49.710 days.

    Error 171 occurred at disk power-on lifetime: 748 hours (31 days + 4 hours)
      When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      – -- -- -- -- -- --
      84 51 80 7f 24 64 ea  Error: ICRC, ABRT at LBA = 0x0a64247f = 174335103

    Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      ca 00 00 ff 23 64 ea 00  7d+22:31:00.000  WRITE DMA
      ca 00 00 ff 22 64 ea 00  7d+22:31:00.000  WRITE DMA
      ca 00 00 ff 21 64 ea 00  7d+22:31:00.000  WRITE DMA
      ca 00 00 ff 20 64 ea 00  7d+22:31:00.000  WRITE DMA
      ca 00 00 ff 1f 64 ea 00  7d+22:31:00.000  WRITE DMA

    Error 170 occurred at disk power-on lifetime: 744 hours (31 days + 0 hours)
      When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 10 0f e6 0c ea  Error: ICRC, ABRT 16 sectors at LBA = 0x0a0ce60f = 168617487

    Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      c8 00 20 ff e5 0c ea 00  7d+18:14:48.500  READ DMA
      ca 00 0c ff 10 63 ea 00  7d+18:14:48.400  WRITE DMA
      ca 00 20 bf d6 0c ea 00  7d+18:14:48.400  WRITE DMA
      ca 00 20 bf d6 0c ea 00  7d+18:14:48.400  WRITE DMA
      ca 00 0c df f8 67 ea 00  7d+18:14:48.400  WRITE DMA

    Error 169 occurred at disk power-on lifetime: 610 hours (25 days + 10 hours)
      When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 00 5a e8 0c ea  Error: ICRC, ABRT at LBA = 0x0a0ce85a = 168618074

    Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      ca 00 04 57 e8 0c ea 00  2d+04:30:10.300  WRITE DMA
      ca 00 20 3f 7f 12 ea 00  2d+04:30:10.300  WRITE DMA
      ca 00 04 57 e8 0c ea 00  2d+04:30:10.300  WRITE DMA
      ca 00 20 3f 7f 12 ea 00  2d+04:30:10.300  WRITE DMA
      ca 00 04 57 e8 0c ea 00  2d+04:30:10.300  WRITE DMA

    Error 168 occurred at disk power-on lifetime: 533 hours (22 days + 5 hours)
      When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 10 4f d9 0c ea  Error: ICRC, ABRT at LBA = 0x0a0cd94f = 168614223

    Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      ca 00 20 3f d9 0c ea 00  2d+04:59:56.000  WRITE DMA
      ca 00 08 df 92 77 eb 00  2d+04:59:56.000  WRITE DMA
      ca 00 20 1f d9 0c ea 00  2d+04:59:56.000  WRITE DMA
      ca 00 08 df 92 77 eb 00  2d+04:59:56.000  WRITE DMA
      ca 00 20 ff d8 0c ea 00  2d+04:59:56.000  WRITE DMA

    Error 167 occurred at disk power-on lifetime: 532 hours (22 days + 4 hours)
      When the command that caused the error occurred, the device was active or idle.

    After command completion occurred, registers were:
      ER ST SC SN CL CH DH
      -- -- -- -- -- -- --
      84 51 00 3e 9c 66 ea  Error: ICRC, ABRT at LBA = 0x0a669c3e = 174496830

    Commands leading to the command that caused the error were:
      CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
      -- -- -- -- -- -- -- --  ----------------  --------------------
      ca 00 04 3b 9c 66 ea 00  2d+04:05:37.400  WRITE DMA
      ca 00 04 37 9c 66 ea 00  2d+04:05:37.400  WRITE DMA
      ca 00 04 33 9c 66 ea 00  2d+04:05:37.400  WRITE DMA
      ca 00 14 bf a0 6b ea 00  2d+04:05:37.400  WRITE DMA
      ca 00 04 2f 9c 66 ea 00  2d+04:05:37.400  WRITE DMA


Locked