Ad4: TIMEOUT - WRITE_DMA retrying (1 retry left)
-
After setting up squid using a SATA 300MB/s 2.5" hard disk on an D945GSEJT I seem to be getting these errors which eventually lead to a kernel panic. I have tried switching off AHCI and using IDE mode but the errors continue. One strange thing is that pFsense seems to detect the SATA controller as a SATA 150 even though the intel spec page claims its 300:
atapci1: <intel ich7m="" sata150="" controller="">port 0xf0e0-0xf0e7,0xf0d0-0xf0d3,0xf0c0-0xf0c7,0xf0b0-0xf0b3,0xf0a0-0xf0af mem 0xdff40000-0xdff403ff irq 19 at device 31.2 on pci0</intel>
-
Appears there is a fix here: http://linux-bsd-sharing.blogspot.co.uk/2009/03/howto-fix-sata-dma-timeout-issues-on.html but it looks like this means patching the kernel is this even possible with the files shipped with pfsense?
-
What version of pfSense are you using? You might get better results using a pfSense with a more up to date version of FreeBSD.
-
I was using 2.0.1 release, but I tested with 2.1 dev and its the same. Apparently the ata subsystem has been updated for FreeBSD 9, but that's a while off for pfsense i think :(
-
You could try to disable DMA maybe (check the wiki) but usually those sorts of timeouts are more a sign of a disk/controller problem. Drivers are possible, but less likely, especially if it happens on multiple versions. Check Diag > SMART Status and run a report on the drive and see if it shows any errors (post the output here and we can look it over)
-
Unfortunately I don’t have the disk in my system anymore. I did check the smart logs and run a full smart test, none of the errors matched any time of the write dma errors. The disk was working fine in a laptop & the motherboard was working correctly using windows 2008 with another disk. I may give it another go sometime, I was using it as a squid disk I will update if I do.
-
Using this disk again. Errors popping up again on a fresh install. Have disabled DMA getting this at boot will see if it continues:
Jul 1 00:25:26 kernel: ad4: TIMEOUT - READ_MUL48 retrying (1 retry left) LBA=430935279
Jul 1 00:25:26 kernel: ad4: TIMEOUT - READ_MUL48 retrying (1 retry left) LBA=430940303smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.3-RELEASE-p3 i386] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error1 Extended offline Completed without error 00% 799 -
2 Extended offline Completed without error 00% 559 -
3 Extended offline Aborted by host 70% 557 -
4 Short offline Completed without error 00% 552 -
5 Short offline Completed without error 00% 386 -
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.3-RELEASE-p3 i386] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 062 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 147 147 033 Pre-fail Always - 2
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 849
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 099 099 000 Old_age Always - 801
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 458
191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 37
193 Load_Cycle_Count 0x0012 095 095 000 Old_age Always - 54937
194 Temperature_Celsius 0x0002 148 148 000 Old_age Always - 37 (Min/Max 11/45)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 9
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 7
223 Load_Retry_Count 0x000a 100 100 000 Old_age Always - 0smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.3-RELEASE-p3 i386] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net=== START OF READ SMART DATA SECTION ===
SMART Error Log Version: 1
ATA Error Count: 171 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.Error 171 occurred at disk power-on lifetime: 748 hours (31 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.After command completion occurred, registers were:
ER ST SC SN CL CH DH
– -- -- -- -- -- --
84 51 80 7f 24 64 ea Error: ICRC, ABRT at LBA = 0x0a64247f = 174335103Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 00 ff 23 64 ea 00 7d+22:31:00.000 WRITE DMA
ca 00 00 ff 22 64 ea 00 7d+22:31:00.000 WRITE DMA
ca 00 00 ff 21 64 ea 00 7d+22:31:00.000 WRITE DMA
ca 00 00 ff 20 64 ea 00 7d+22:31:00.000 WRITE DMA
ca 00 00 ff 1f 64 ea 00 7d+22:31:00.000 WRITE DMAError 170 occurred at disk power-on lifetime: 744 hours (31 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 10 0f e6 0c ea Error: ICRC, ABRT 16 sectors at LBA = 0x0a0ce60f = 168617487Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 20 ff e5 0c ea 00 7d+18:14:48.500 READ DMA
ca 00 0c ff 10 63 ea 00 7d+18:14:48.400 WRITE DMA
ca 00 20 bf d6 0c ea 00 7d+18:14:48.400 WRITE DMA
ca 00 20 bf d6 0c ea 00 7d+18:14:48.400 WRITE DMA
ca 00 0c df f8 67 ea 00 7d+18:14:48.400 WRITE DMAError 169 occurred at disk power-on lifetime: 610 hours (25 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 5a e8 0c ea Error: ICRC, ABRT at LBA = 0x0a0ce85a = 168618074Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 04 57 e8 0c ea 00 2d+04:30:10.300 WRITE DMA
ca 00 20 3f 7f 12 ea 00 2d+04:30:10.300 WRITE DMA
ca 00 04 57 e8 0c ea 00 2d+04:30:10.300 WRITE DMA
ca 00 20 3f 7f 12 ea 00 2d+04:30:10.300 WRITE DMA
ca 00 04 57 e8 0c ea 00 2d+04:30:10.300 WRITE DMAError 168 occurred at disk power-on lifetime: 533 hours (22 days + 5 hours)
When the command that caused the error occurred, the device was active or idle.After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 10 4f d9 0c ea Error: ICRC, ABRT at LBA = 0x0a0cd94f = 168614223Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 20 3f d9 0c ea 00 2d+04:59:56.000 WRITE DMA
ca 00 08 df 92 77 eb 00 2d+04:59:56.000 WRITE DMA
ca 00 20 1f d9 0c ea 00 2d+04:59:56.000 WRITE DMA
ca 00 08 df 92 77 eb 00 2d+04:59:56.000 WRITE DMA
ca 00 20 ff d8 0c ea 00 2d+04:59:56.000 WRITE DMAError 167 occurred at disk power-on lifetime: 532 hours (22 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
84 51 00 3e 9c 66 ea Error: ICRC, ABRT at LBA = 0x0a669c3e = 174496830Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
ca 00 04 3b 9c 66 ea 00 2d+04:05:37.400 WRITE DMA
ca 00 04 37 9c 66 ea 00 2d+04:05:37.400 WRITE DMA
ca 00 04 33 9c 66 ea 00 2d+04:05:37.400 WRITE DMA
ca 00 14 bf a0 6b ea 00 2d+04:05:37.400 WRITE DMA
ca 00 04 2f 9c 66 ea 00 2d+04:05:37.400 WRITE DMA