Kernel: em0: Watchdog timeout – resetting



  • Hi there,

    Recently I've noticed that 3 times a day, every 8 hours and 15 minutes, my primary pfSense firewall has been resetting one of it's interfaces.

    The first event was on Aug 15, and no changes were made to the firewall that day.

    The interface was em2, the 3rd port of a 4-port LACP trunk in a SuperMicro 5017P-TLN4F with 4 onboard Intel Gigabit em LAN ports.
    The cards are driven by an Intel 82574L chip.

    It has then happened every 8 hours and 15 minutes since then, like clockwork, up until the firewall was updated from 2.1.3-RELEASE to 2.1.5-RELEASE, on September 15th.

    The firewall was rebooted for the upgrade and there were no more watchdog timeouts for 4 days, however this morning at 2AM and then again at 10AM, the event has re-occurred (albeit on a different interface) - em0 this time.  Similar behavior seems to be following, every 8h15 em0 is resetting due to watchdog timeout – I only have the one reoccurrence so far, but I assume in a few hours we'll see another.

    
    Sep 11 02:19:13 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 11 10:34:17 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 11 18:49:20 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 12 03:04:24 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 12 11:19:27 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 12 19:34:31 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 13 03:49:35 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 13 12:04:38 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 13 20:19:42 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 14 04:34:46 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 14 12:49:49 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep 14 21:04:52 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    
    

    Each event is followed by the same context:

    
    Sep  2 12:02:48 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
    Sep  2 12:02:48 cw-fw-01 kernel: em2: Queue(0) tdh = 0, hw tdt = 992
    Sep  2 12:02:48 cw-fw-01 kernel: em2: TX(0) desc avail = 32,Next TX to Clean = 0
    Sep  2 12:02:48 cw-fw-01 kernel: em2: link state changed to DOWN
    
    

    And the numbers are always the same, for tdh, tdt, descriptors avail, etc.  The very first timeout has different numbers, but all subsequent timeouts had the same numbers.

    It seems to be happening exactly every 8 hours and 15 minutes – shifting the time of day it happens by 15 minutes each time.  I'm currently unaware of any traffic pattern in the network or job that might cause this, and there aren't any meaningful spikes in the graphs to account for it. (EDIT: especially as further investigation reveals that the port is actually inactive in the LAGG group ever since the very first instance/timeout).

    The affected NIC (first em2, and now em0 after the upgrade/reboot) is part of a 4-NIC LAGG (em0, em1, em2, em3) and the issue went completely unnoticed for a month -- only this morning (the first occurance of the problem after the reboot) it managed to trip apinger and cause a failover to the secondary, prompting me to dig deeper into the cause and discover that it had been going on for some time now.

    Any clues?  Anything I should try?  It's very strange, and I'm not comfortable at all as it's quite a critical piece of our network infrastructure.

    It seems that once the first reset occurs, em0 gets removed as an active port in lagg0 and thus subsequent failures don't actually affect connectivity (from ifconfig -v lagg0):

    
    lagg0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
            options=4219b <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,tso4,wol_magic,vlan_hwtso>ether 00:25:90:95:de:d0
            inet6 fe80::225:90ff:fe95:ded0%lagg0 prefixlen 64 scopeid 0x9
            nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect
            status: active
            groups: lagg
            laggproto lacp
            lag id: [(8000,00-25-90-95-DE-D0,0130,0000,0000),
                     (8000,EC-CD-6D-38-C5-FD,FD01,0000,0000)]
            laggport: em3 flags=1c <active,collecting,distributing>state=3D
                    [(8000,00-25-90-95-DE-D0,0130,8000,0004),
                     (8000,EC-CD-6D-38-C5-FD,FD01,0003,0003)]
            laggport: em2 flags=1c <active,collecting,distributing>state=3D
                    [(8000,00-25-90-95-DE-D0,0130,8000,0003),
                     (8000,EC-CD-6D-38-C5-FD,FD01,0001,0001)]
            laggport: em1 flags=1c <active,collecting,distributing>state=3D
                    [(8000,00-25-90-95-DE-D0,0130,8000,0002),
                     (8000,EC-CD-6D-38-C5-FD,FD01,0004,0004)]
            laggport: em0 flags=18 <collecting,distributing>state=7D
                    [(8000,00-25-90-95-DE-D0,0130,8000,0001),
                     (FFFF,00-00-00-00-00-00,0000,FFFF,0000)]</collecting,distributing></active,collecting,distributing></active,collecting,distributing></active,collecting,distributing></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,tso4,wol_magic,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast> 
    

    Notice that em0 (the troubled port this particular reboot cycle) is not ACTIVE (and hasn't been since the first watchdog timeout), and tcpdump confirms this - no traffic whatsoever going across it - which means that the port is locking up every 8h15 even while completely idle and doing absolutely nothing?

    sysctl output below.  I've failed CARP over to the secondary to see what happens there (will it start exhibiting symptoms?  will the primary still continue to suffer watchdog timeouts even as it idles?).

    
    # sysctl hw.em
    hw.em.num_queues: 0
    hw.em.eee_setting: 1
    hw.em.rx_process_limit: 100
    hw.em.enable_msix: 1
    hw.em.sbp: 0
    hw.em.smart_pwr_down: 0
    hw.em.txd: 1024
    hw.em.rxd: 1024
    hw.em.rx_abs_int_delay: 66
    hw.em.tx_abs_int_delay: 66
    hw.em.rx_int_delay: 0
    hw.em.tx_int_delay: 66
    
    
    
    # sysctl dev.em.0
    dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.8
    dev.em.0.%driver: em
    dev.em.0.%location: slot=0 function=0 handle=\_SB_.PCI0.RP01.PXSX
    dev.em.0.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x0000 class=0x020000
    dev.em.0.%parent: pci1
    dev.em.0.nvm: -1
    dev.em.0.debug: -1
    dev.em.0.fc: 3
    dev.em.0.rx_int_delay: 0
    dev.em.0.tx_int_delay: 66
    dev.em.0.rx_abs_int_delay: 66
    dev.em.0.tx_abs_int_delay: 66
    dev.em.0.itr: 488
    dev.em.0.rx_processing_limit: 100
    dev.em.0.eee_control: 1
    dev.em.0.link_irq: 28290
    dev.em.0.mbuf_alloc_fail: 0
    dev.em.0.cluster_alloc_fail: 0
    dev.em.0.dropped: 0
    dev.em.0.tx_dma_fail: 0
    dev.em.0.rx_overruns: 0
    dev.em.0.watchdog_timeouts: 2
    dev.em.0.device_control: 1074790984
    dev.em.0.rx_control: 67141658
    dev.em.0.fc_high_water: 18432
    dev.em.0.fc_low_water: 16932
    dev.em.0.queue0.txd_head: 0
    dev.em.0.queue0.txd_tail: 176
    dev.em.0.queue0.tx_irq: 258561999
    dev.em.0.queue0.no_desc_avail: 0
    dev.em.0.queue0.rxd_head: 0
    dev.em.0.queue0.rxd_tail: 1023
    dev.em.0.queue0.rx_irq: 297119049
    dev.em.0.mac_stats.excess_coll: 0
    dev.em.0.mac_stats.single_coll: 0
    dev.em.0.mac_stats.multiple_coll: 0
    dev.em.0.mac_stats.late_coll: 0
    dev.em.0.mac_stats.collision_count: 0
    dev.em.0.mac_stats.symbol_errors: 0
    dev.em.0.mac_stats.sequence_errors: 0
    dev.em.0.mac_stats.defer_count: 0
    dev.em.0.mac_stats.missed_packets: 28465
    dev.em.0.mac_stats.recv_no_buff: 0
    dev.em.0.mac_stats.recv_undersize: 0
    dev.em.0.mac_stats.recv_fragmented: 0
    dev.em.0.mac_stats.recv_oversize: 0
    dev.em.0.mac_stats.recv_jabber: 0
    dev.em.0.mac_stats.recv_errs: 0
    dev.em.0.mac_stats.crc_errs: 0
    dev.em.0.mac_stats.alignment_errs: 0
    dev.em.0.mac_stats.coll_ext_errs: 0
    dev.em.0.mac_stats.xon_recvd: 0
    dev.em.0.mac_stats.xon_txd: 0
    dev.em.0.mac_stats.xoff_recvd: 0
    dev.em.0.mac_stats.xoff_txd: 0
    dev.em.0.mac_stats.total_pkts_recvd: 328827928
    dev.em.0.mac_stats.good_pkts_recvd: 328799463
    dev.em.0.mac_stats.bcast_pkts_recvd: 3400907
    dev.em.0.mac_stats.mcast_pkts_recvd: 107503
    dev.em.0.mac_stats.rx_frames_64: 3766
    dev.em.0.mac_stats.rx_frames_65_127: 131108817
    dev.em.0.mac_stats.rx_frames_128_255: 87956012
    dev.em.0.mac_stats.rx_frames_256_511: 55962296
    dev.em.0.mac_stats.rx_frames_512_1023: 24330820
    dev.em.0.mac_stats.rx_frames_1024_1522: 29437752
    dev.em.0.mac_stats.good_octets_recvd: 104670970717
    dev.em.0.mac_stats.good_octets_txd: 119934667773
    dev.em.0.mac_stats.total_pkts_txd: 279223751
    dev.em.0.mac_stats.good_pkts_txd: 279223751
    dev.em.0.mac_stats.bcast_pkts_txd: 1780
    dev.em.0.mac_stats.mcast_pkts_txd: 929946
    dev.em.0.mac_stats.tx_frames_64: 22207294
    dev.em.0.mac_stats.tx_frames_65_127: 77265601
    dev.em.0.mac_stats.tx_frames_128_255: 50849837
    dev.em.0.mac_stats.tx_frames_256_511: 53313263
    dev.em.0.mac_stats.tx_frames_512_1023: 37681244
    dev.em.0.mac_stats.tx_frames_1024_1522: 37906512
    dev.em.0.mac_stats.tso_txd: 28
    dev.em.0.mac_stats.tso_ctx_fail: 0
    dev.em.0.interrupts.asserts: 705
    dev.em.0.interrupts.rx_pkt_timer: 0
    dev.em.0.interrupts.rx_abs_timer: 0
    dev.em.0.interrupts.tx_pkt_timer: 0
    dev.em.0.interrupts.tx_abs_timer: 0
    dev.em.0.interrupts.tx_queue_empty: 0
    dev.em.0.interrupts.tx_queue_min_thresh: 0
    dev.em.0.interrupts.rx_desc_min_thresh: 0
    dev.em.0.interrupts.rx_overrun: 2
    dev.em.0.wake: 0
    
    


  • i'm also get  them timeouts also. since upgraded from 2.1.5 to 2.2 beta for testing ;)

    i'm busy testing some boot changes, but it should work to stop that issue.
    so far so good no problem yet

    add this the boot loader

    /boot/loader.conf:

    hw.pci.enable_msi=0
    hw.pci.enable_msix=0



  • I have the same problem.

    Is there any final solution.

    I use 2.3.1_1



  • what's your hardware?


  • Rebel Alliance Developer Netgate

    This thread was almost two years old, and highly unlikely to be related. Start a new thread.


Log in to reply