Bad disk issue with FreeBSD 6.1 based builds



  • I'm still using build 01-24-07 since my last bad experience with an upgrade, but i'm getting weird errors now with heavy disk activity as described here:

    http://www.freebsd.org/cgi/query-pr.cgi?pr=103435

    These cause temporary disk deadlocks dropping one of my network interfaces and screwing its CARP virtual ip.

    In particular i was just looking at snort blocked addresses and all our network connections were dropped simultaneously (i had to relaunch pfsync to make it work correctly again, it was stuck in INIT), as by system logs:

    re0: watchdog timeout
    re0: link state changed to DOWN
    ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
    ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
    ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    ad5: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SET_MULTI taskqueue timeout - completing request directly
    ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=33122223
    re0: 10 link states coalesced
    re0: link state changed to DOWN
    re0: link state changed to UP
    arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
    arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
    re0: watchdog timeout
    ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
    ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
    ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
    ad5: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SET_MULTI taskqueue timeout - completing request directly
    ad4: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=33122223
    re0: 11 link states coalesced
    re0: link state changed to DOWN
    re0: link state changed to UP
    arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
    arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
    re0: watchdog timeout
    ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
    ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
    ad5: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
    re0: watchdog timeout
    ad5: WARNING - SET_MULTI taskqueue timeout - completing request directly
    ad5: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=33122223
    re0: 11 link states coalesced
    re0: link state changed to DOWN
    re0: link state changed to UP
    arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
    arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
    re0: watchdog timeout
    re0: link state changed to DOWN
    re0: link state changed to UP
    arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
    arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
    re0: watchdog timeout
    re0: link state changed to DOWN
    ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=156301487
    re0: link state changed to UP
    arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
    arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
    arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
    arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
    arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
    arp_rtrequest: bad gateway 62.2.160.97 (!AF_LINK)

    It seems that it's been solved with the 6.2 kernels from last november, so i'd like to know if the latest pfsense build is reliable for a production use and the "preemption" kernel config key is still disabled (see http://forum.pfsense.org/index.php/topic,3664.0.html).

    Regards!!



  • Setup the BIOS so the nics are not sharing the Hard Disks IRQ.



  • you can try a snapshot from http://snapshots.pfsense.org/FreeBSD6/RELENG_1/ for a 6.2-based version. No debugging is enabled.



  • Thanks for the support and here's the update.

    I've managed to update the BIOS, btw it's a Dell machine and there are no IRQ settings available. I found out that two of the installed network cards share the same IRQ, i hope that's not a big issue, i think it's a common practice.

    Anyway i've also installed the latest build and it worked flawlessly, i just had a bad duplication problem with the services list (multiple "squid" rows). I've conducted tests which previously failed (even if after a long uptime) on the file system and now they seem to work fine, but before too much complimenting i'd like to see what happens under heavy load, also heavy network load.

    Also, latest build seems to have the issue outlined here

    http://forum.pfsense.org/index.php/topic,3325.0.html

    in the last post solved

    BTW, what stops you from releasing the latest builds?

    Alberto



  • known bugs is why the snapshots are not yet a release version. The first 1.2 beta will be out soon, which will be free of all known issues. The 1.2 release should follow shortly after.

    As for IRQ sharing, yeah it shouldn't cause any problems, but it will reduce performance (not sure how much, but something as active as a disk and a NIC, both of which can be interrupt heavy if under load, could be significant).


Log in to reply