Strange WRITE_DMA errors when switching on network port



  • Hi Guys,

    I have two pfsense boxes on which CARP was working fine. However I have now changed my switches from two Cisco 2950s to two Cisco 3750Gs that are stacked.

    I have one interface that we run all VLANs on, the first firewall and first switch run fun, the port is a dot1q trunk.

    However as soon as i turn the switchport on the sw2 (connected to fw2) i see the following errors:

    Jun 15 18:18:04 kernel: ad0: FAILURE - WRITE_DMA status=51 <ready,dsc,error>error=4 <aborted>dma=0x06 LBA=1129359
    Jun 15 18:18:04 kernel: g_vfs_done():ad0s1a[WRITE(offset=578191360, length=16384)]error = 5
    Jun 15 18:18:04 kernel: ad0: FAILURE - WRITE_DMA status=51 <ready,dsc,error>error=4 <aborted>dma=0x06 LBA=1505711
    Jun 15 18:18:04 kernel: g_vfs_done():ad0s1a[WRITE(offset=770883584, length=16384)]error = 5
    Jun 15 18:18:05 kernel: ad0: FAILURE - WRITE_DMA status=51 <ready,dsc,error>error=4 <aborted>dma=0x06 LBA=2258575
    Jun 15 18:18:05 kernel: g_vfs_done():ad0s1a[WRITE(offset=1156349952, length=16384)]error = 5
    Jun 15 18:18:11 kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=3387471
    Jun 15 18:18:11 kernel: ad0: FAILURE - WRITE_DMA status=51 <ready,dsc,error>error=4 <aborted>dma=0x06 LBA=5645583
    Jun 15 18:18:11 kernel: g_vfs_done():ad0s1a[WRITE(offset=2890498048, length=16384)]error = 5
    Jun 15 18:18:11 kernel: ad0: FAILURE - WRITE_DMA status=51 <ready,dsc,error>error=4 <aborted>dma=0x06 LBA=6398287
    Jun 15 18:18:11 kernel: g_vfs_done():ad0s1a[WRITE(offset=3275882496, length=16384)]error = 5
    Jun 15 18:18:11 kernel: ad0: FAILURE - WRITE_DMA status=51 <ready,dsc,error>error=4 <aborted>dma=0x06 LBA=7151599
    Jun 15 18:18:11 kernel: g_vfs_done():ad0s1a[WRITE(offset=3661578240, length=16384)]error = 5
    Jun 15 18:18:12 kernel: ad0: FAILURE - WRITE_DMA status=51 <ready,dsc,error>error=4 <aborted>dma=0x06 LBA=7527343
    Jun 15 18:18:12 kernel: g_vfs_done():ad0s1a[WRITE(offset=3853959168, length=16384)]error = 5
    Jun 15 18:18:18 kernel: ad0: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=376655

    Once I turn the switchport off the errors disappear, but obviously I can't access my vlans.

    I have tried everything I can think of, including reinstalling pfsense, and even creating a whole new config.

    Any ideas what is causing this?

    Many Thanks,</aborted></ready,dsc,error></aborted></ready,dsc,error></aborted></ready,dsc,error></aborted></ready,dsc,error></aborted></ready,dsc,error></aborted></ready,dsc,error></aborted></ready,dsc,error>


  • Rebel Alliance Developer Netgate

    Your HDD controller might be sharing an IRQ with that port, you can check with:

    vmstat -i
    

    At a shell prompt or Diagnostics > Command

    You might have to change some options in the BIOS to fix that, or shut off DMA for the hard drive.

    Usually that error is indicative of a hard drive, cable, or controller error (typically one of them is faulty) but if it only happens when you enable something else, there may be hope.



  • Hi Jimp,

    Thanks for that information, I have disabled DMA and the exact same thing is happening. When I enable the switchport the errors appear.

    I don't think it is sharing an IRQ either. The HDD is a 4GB CF Card connected with a Sata-CF Converter, and has been working fine until upgrading to 1.2.3 and changing our switches.

    Do you have any idea what else could be causing the problem?

    Output from vmstat -i

    $ vmstat -i
    interrupt                          total      rate
    irq1: atkbd0                          12          0
    irq14: ata0                        2539        10
    irq16: re3 uhci3                      35          0
    irq18: re1 uhci2                    521          2
    irq19: re2 uhci1                    3634        14
    irq23: uhci0 ehci0                    1          0
    cpu0: timer                      500965      1995
    irq256: re0                        5466        21
    cpu1: timer                      500908      1995
    cpu3: timer                      500908      1995
    cpu2: timer                      500909      1995
    Total                            2015898      8031

    Many Thanks,


  • Rebel Alliance Developer Netgate

    Try editing /boot/loader.conf and adding this line:

    hw.ata.ata_dma=0
    

    And then reboot

    CF converters are not known for their great DMA compatibility…



  • Hi Jimp,

    I tried that, it did reduce the errors but they were still there. As a last ditch attempt I stuck in a 160gb SATA disk i had laying around and that worked perfectly. So it must have been something strange with the converter.

    Strange thing is, I have the exact same setup on my primary firewall, with a 4GB CF card and converter, upgraded that to 1.2.3 and worked without any problems. So I am not sure why I had issues with the backup firewall, it would be a very strange coincidence if there was a hardware failure at the same time as upgrading the software.

    Either way things are back up and running, thanks for your help, much appreciated.


Locked