Bad disk issue with FreeBSD 6.1 based builds
-
I'm still using build 01-24-07 since my last bad experience with an upgrade, but i'm getting weird errors now with heavy disk activity as described here:
http://www.freebsd.org/cgi/query-pr.cgi?pr=103435
These cause temporary disk deadlocks dropping one of my network interfaces and screwing its CARP virtual ip.
In particular i was just looking at snort blocked addresses and all our network connections were dropped simultaneously (i had to relaunch pfsync to make it work correctly again, it was stuck in INIT), as by system logs:
re0: watchdog timeout
re0: link state changed to DOWN
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
re0: watchdog timeout
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
re0: watchdog timeout
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
ad5: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=33122223
re0: 10 link states coalesced
re0: link state changed to DOWN
re0: link state changed to UP
arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
re0: watchdog timeout
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
re0: watchdog timeout
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
re0: watchdog timeout
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
ad5: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=33122223
re0: 11 link states coalesced
re0: link state changed to DOWN
re0: link state changed to UP
arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
re0: watchdog timeout
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
re0: watchdog timeout
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
re0: watchdog timeout
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
ad5: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
ad5: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
re0: watchdog timeout
ad5: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad5: TIMEOUT - WRITE_DMA retrying (0 retries left) LBA=33122223
re0: 11 link states coalesced
re0: link state changed to DOWN
re0: link state changed to UP
arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
re0: watchdog timeout
re0: link state changed to DOWN
re0: link state changed to UP
arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
re0: watchdog timeout
re0: link state changed to DOWN
ad5: TIMEOUT - WRITE_DMA retrying (1 retry left) LBA=156301487
re0: link state changed to UP
arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
arp_rtrequest: bad gateway 10.100.0.1 (!AF_LINK)
arp_rtrequest: bad gateway 62.2.160.66 (!AF_LINK)
arp_rtrequest: bad gateway 62.2.160.97 (!AF_LINK)It seems that it's been solved with the 6.2 kernels from last november, so i'd like to know if the latest pfsense build is reliable for a production use and the "preemption" kernel config key is still disabled (see http://forum.pfsense.org/index.php/topic,3664.0.html).
Regards!!
-
Setup the BIOS so the nics are not sharing the Hard Disks IRQ.
-
you can try a snapshot from http://snapshots.pfsense.org/FreeBSD6/RELENG_1/ for a 6.2-based version. No debugging is enabled.
-
Thanks for the support and here's the update.
I've managed to update the BIOS, btw it's a Dell machine and there are no IRQ settings available. I found out that two of the installed network cards share the same IRQ, i hope that's not a big issue, i think it's a common practice.
Anyway i've also installed the latest build and it worked flawlessly, i just had a bad duplication problem with the services list (multiple "squid" rows). I've conducted tests which previously failed (even if after a long uptime) on the file system and now they seem to work fine, but before too much complimenting i'd like to see what happens under heavy load, also heavy network load.
Also, latest build seems to have the issue outlined here
http://forum.pfsense.org/index.php/topic,3325.0.html
in the last post solved
BTW, what stops you from releasing the latest builds?
Alberto
-
known bugs is why the snapshots are not yet a release version. The first 1.2 beta will be out soon, which will be free of all known issues. The 1.2 release should follow shortly after.
As for IRQ sharing, yeah it shouldn't cause any problems, but it will reduce performance (not sure how much, but something as active as a disk and a NIC, both of which can be interrupt heavy if under load, could be significant).