1.2 on poweredge 860 - no more traffic thru wans after <1h



  • hello, and sorry for my english.

    since 1.2 I have troubles with my dual wan setup on a dell poweredge 860 (celeron D 3GHz, 1GB RAM), 2x broadcom embedded nics and 2x pro1000. after some random time, generally within an hour of uptime, both WAN interfaces die and don't route traffic any more, while nics link is still up.

    in the dashboard, loadbalancer and both failovers pools turn to red, from cli I can't reach any outside address.
    here's the log when it happens. the only 'fix' is to reboot.

    i had 1.2 RC2 working w/o any issue for weeks. this problem started right after upgrading to 1.2 release, tried also reinstalling but no joy.

    here's the log starting at the time troubles begin. the same config, within a vmware esx vm works flawlessly. maybe some new/faulty drivers in 1.2REL?

    thanks

    Mar 13 22:11:22 kernel: em1: watchdog timeout – resetting
    Mar 13 22:11:22 kernel: em1: link state changed to DOWN
    Mar 13 22:11:24 kernel: em1: link state changed to UP
    Mar 13 22:11:24 check_reload_status: rc.linkup starting
    Mar 13 22:11:24 php: : Processing em1 - start
    Mar 13 22:11:24 php: : Hotplug event detected for em1 but ignoring since interface is not set for DHCP
    Mar 13 22:11:24 php: : Processing start -
    Mar 13 22:11:24 php: : Not a valid interface action ""
    Mar 13 22:11:24 php: : Processing -
    Mar 13 22:11:24 php: : Not a valid interface action ""
    Mar 13 22:11:26 slbd[426]: ICMP poll failed for WAN2_GW, marking service DOWN
    Mar 13 22:11:26 slbd[426]: ICMP poll failed for WAN_GW, marking service DOWN
    Mar 13 22:11:26 slbd[426]: Service tlc2bt changed status, reloading filter policy
    Mar 13 22:11:26 slbd[426]: ICMP poll failed for WAN_GW, marking service DOWN
    Mar 13 22:11:26 slbd[426]: ICMP poll failed for WAN_GW, marking service DOWN
    Mar 13 22:11:28 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:11:29 check_reload_status: reloading filter
    Mar 13 22:11:33 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:11:33 slbd[426]: ICMP poll failed for WAN2_GW, marking service DOWN
    Mar 13 22:11:33 slbd[426]: ICMP poll failed for WAN2_GW, marking service DOWN
    Mar 13 22:11:33 slbd[426]: Service bt2tlc changed status, reloading filter policy
    Mar 13 22:11:33 slbd[426]: Service LB01 changed status, reloading filter policy
    Mar 13 22:11:38 check_reload_status: reloading filter
    Mar 13 22:11:38 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:11:38 last message repeated 2 times
    Mar 13 22:11:42 kernel: em1: watchdog timeout – resetting
    Mar 13 22:11:42 kernel: em1: link state changed to DOWN
    Mar 13 22:11:43 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:11:43 last message repeated 2 times
    Mar 13 22:11:44 kernel: em1: link state changed to UP
    Mar 13 22:11:46 check_reload_status: rc.linkup starting
    Mar 13 22:11:46 php: : Processing em1 - start
    Mar 13 22:11:46 php: : Hotplug event detected for em1 but ignoring since interface is not set for DHCP
    Mar 13 22:11:46 php: : Processing start -
    Mar 13 22:11:46 php: : Not a valid interface action ""
    Mar 13 22:11:46 php: : Processing -
    Mar 13 22:11:46 php: : Not a valid interface action ""
    Mar 13 22:11:48 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:12:23 last message repeated 21 times
    Mar 13 22:14:28 last message repeated 75 times
    Mar 13 22:14:28 last message repeated 2 times
    Mar 13 22:14:31 kernel: em1: watchdog timeout – resetting
    Mar 13 22:14:31 kernel: em1: link state changed to DOWN
    Mar 13 22:14:33 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:14:33 last message repeated 2 times
    Mar 13 22:14:33 kernel: em1: link state changed to UP
    Mar 13 22:14:36 check_reload_status: rc.linkup starting
    Mar 13 22:14:36 php: : Processing em1 - start
    Mar 13 22:14:36 php: : Hotplug event detected for em1 but ignoring since interface is not set for DHCP
    Mar 13 22:14:36 php: : Processing start -
    Mar 13 22:14:36 php: : Not a valid interface action ""
    Mar 13 22:14:36 php: : Processing -
    Mar 13 22:14:36 php: : Not a valid interface action ""
    Mar 13 22:14:38 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:14:38 last message repeated 2 times
    Mar 13 22:14:39 kernel: em1: watchdog timeout – resetting
    Mar 13 22:14:39 kernel: em1: link state changed to DOWN
    Mar 13 22:14:41 kernel: em1: link state changed to UP
    Mar 13 22:14:41 check_reload_status: rc.linkup starting
    Mar 13 22:14:42 php: : Processing em1 - start
    Mar 13 22:14:42 php: : Hotplug event detected for em1 but ignoring since interface is not set for DHCP
    Mar 13 22:14:42 php: : Processing start -
    Mar 13 22:14:42 php: : Not a valid interface action ""
    Mar 13 22:14:42 php: : Processing -
    Mar 13 22:14:42 php: : Not a valid interface action ""
    Mar 13 22:14:43 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:15:13 last message repeated 20 times
    Mar 13 22:15:16 kernel: bge1: watchdog timeout – resetting
    Mar 13 22:15:16 kernel: bge1: link state changed to DOWN
    Mar 13 22:15:18 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:15:18 last message repeated 2 times
    Mar 13 22:15:18 kernel: bge1: link state changed to UP
    Mar 13 22:15:22 check_reload_status: rc.linkup starting
    Mar 13 22:15:22 php: : Processing bge1 - start
    Mar 13 22:15:22 php: : Hotplug event detected for bge1 but ignoring since interface is not set for DHCP
    Mar 13 22:15:22 php: : Processing start -
    Mar 13 22:15:22 php: : Not a valid interface action ""
    Mar 13 22:15:22 php: : Processing -
    Mar 13 22:15:22 php: : Not a valid interface action ""
    Mar 13 22:15:23 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:15:58 last message repeated 21 times
    Mar 13 22:16:03 last message repeated 5 times
    Mar 13 22:16:06 php: /index.php: XMLRPC communication error: No route to host
    Mar 13 22:16:08 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:16:43 last message repeated 21 times
    Mar 13 22:18:38 last message repeated 71 times
    Mar 13 22:18:43 last message repeated 3 times
    Mar 13 22:18:43 kernel: em1: watchdog timeout – resetting
    Mar 13 22:18:43 kernel: em1: link state changed to DOWN
    Mar 13 22:18:45 kernel: em1: link state changed to UP
    Mar 13 22:18:47 check_reload_status: rc.linkup starting
    Mar 13 22:18:47 php: : Processing em1 - start
    Mar 13 22:18:47 php: : Hotplug event detected for em1 but ignoring since interface is not set for DHCP
    Mar 13 22:18:47 php: : Processing start -
    Mar 13 22:18:47 php: : Not a valid interface action ""
    Mar 13 22:18:47 php: : Processing -
    Mar 13 22:18:47 php: : Not a valid interface action ""
    Mar 13 22:18:48 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666
    Mar 13 22:19:23 last message repeated 21 times
    Mar 13 22:19:33 last message repeated 8 times
    Mar 13 22:19:36 syslogd: exiting on signal 15
    Mar 13 22:19:36 syslogd: kernel boot file is /boot/kernel/kernel
    Mar 13 22:19:38 slbd[426]: Switching to sitedown for VIP 127.0.0.1:666



  • try to disable acpi and see if your nic watchdog timeouts go away: http://devwiki.pfsense.org/BootOptions



  • hi, no luck with no acpi, I have a kernel panic.



  • anyone?

    meanwhile, booting with SMP kernel seems to prevent nics watchdog timeouts. anything wrong in running SMP kernel on a single proc machine?

    thanks.



  • Nothing wrong with SMP on single-core machines :)


Locked