Kernel: em0: Watchdog timeout – resetting

Ashmodai

Hi there,

Recently I've noticed that 3 times a day, every 8 hours and 15 minutes, my primary pfSense firewall has been resetting one of it's interfaces.

The first event was on Aug 15, and no changes were made to the firewall that day.

The interface was em2, the 3rd port of a 4-port LACP trunk in a SuperMicro 5017P-TLN4F with 4 onboard Intel Gigabit em LAN ports.
The cards are driven by an Intel 82574L chip.

It has then happened every 8 hours and 15 minutes since then, like clockwork, up until the firewall was updated from 2.1.3-RELEASE to 2.1.5-RELEASE, on September 15th.

The firewall was rebooted for the upgrade and there were no more watchdog timeouts for 4 days, however this morning at 2AM and then again at 10AM, the event has re-occurred (albeit on a different interface) - em0 this time. Similar behavior seems to be following, every 8h15 em0 is resetting due to watchdog timeout – I only have the one reoccurrence so far, but I assume in a few hours we'll see another.


Sep 11 02:19:13 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 11 10:34:17 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 11 18:49:20 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 12 03:04:24 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 12 11:19:27 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 12 19:34:31 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 13 03:49:35 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 13 12:04:38 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 13 20:19:42 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 14 04:34:46 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 14 12:49:49 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep 14 21:04:52 cw-fw-01 kernel: em2: Watchdog timeout -- resetting

Each event is followed by the same context:


Sep  2 12:02:48 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
Sep  2 12:02:48 cw-fw-01 kernel: em2: Queue(0) tdh = 0, hw tdt = 992
Sep  2 12:02:48 cw-fw-01 kernel: em2: TX(0) desc avail = 32,Next TX to Clean = 0
Sep  2 12:02:48 cw-fw-01 kernel: em2: link state changed to DOWN

And the numbers are always the same, for tdh, tdt, descriptors avail, etc. The very first timeout has different numbers, but all subsequent timeouts had the same numbers.

It seems to be happening exactly every 8 hours and 15 minutes – shifting the time of day it happens by 15 minutes each time. I'm currently unaware of any traffic pattern in the network or job that might cause this, and there aren't any meaningful spikes in the graphs to account for it. (EDIT: especially as further investigation reveals that the port is actually inactive in the LAGG group ever since the very first instance/timeout).

The affected NIC (first em2, and now em0 after the upgrade/reboot) is part of a 4-NIC LAGG (em0, em1, em2, em3) and the issue went completely unnoticed for a month -- only this morning (the first occurance of the problem after the reboot) it managed to trip apinger and cause a failover to the secondary, prompting me to dig deeper into the cause and discover that it had been going on for some time now.

Any clues? Anything I should try? It's very strange, and I'm not comfortable at all as it's quite a critical piece of our network infrastructure.

It seems that once the first reset occurs, em0 gets removed as an active port in lagg0 and thus subsequent failures don't actually affect connectivity (from ifconfig -v lagg0):


lagg0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
        options=4219b <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,tso4,wol_magic,vlan_hwtso>ether 00:25:90:95:de:d0
        inet6 fe80::225:90ff:fe95:ded0%lagg0 prefixlen 64 scopeid 0x9
        nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect
        status: active
        groups: lagg
        laggproto lacp
        lag id: [(8000,00-25-90-95-DE-D0,0130,0000,0000),
                 (8000,EC-CD-6D-38-C5-FD,FD01,0000,0000)]
        laggport: em3 flags=1c <active,collecting,distributing>state=3D
                [(8000,00-25-90-95-DE-D0,0130,8000,0004),
                 (8000,EC-CD-6D-38-C5-FD,FD01,0003,0003)]
        laggport: em2 flags=1c <active,collecting,distributing>state=3D
                [(8000,00-25-90-95-DE-D0,0130,8000,0003),
                 (8000,EC-CD-6D-38-C5-FD,FD01,0001,0001)]
        laggport: em1 flags=1c <active,collecting,distributing>state=3D
                [(8000,00-25-90-95-DE-D0,0130,8000,0002),
                 (8000,EC-CD-6D-38-C5-FD,FD01,0004,0004)]
        laggport: em0 flags=18 <collecting,distributing>state=7D
                [(8000,00-25-90-95-DE-D0,0130,8000,0001),
                 (FFFF,00-00-00-00-00-00,0000,FFFF,0000)]</collecting,distributing></active,collecting,distributing></active,collecting,distributing></active,collecting,distributing></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,tso4,wol_magic,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast>

Notice that em0 (the troubled port this particular reboot cycle) is not ACTIVE (and hasn't been since the first watchdog timeout), and tcpdump confirms this - no traffic whatsoever going across it - which means that the port is locking up every 8h15 even while completely idle and doing absolutely nothing?

sysctl output below. I've failed CARP over to the secondary to see what happens there (will it start exhibiting symptoms? will the primary still continue to suffer watchdog timeouts even as it idles?).


# sysctl hw.em
hw.em.num_queues: 0
hw.em.eee_setting: 1
hw.em.rx_process_limit: 100
hw.em.enable_msix: 1
hw.em.sbp: 0
hw.em.smart_pwr_down: 0
hw.em.txd: 1024
hw.em.rxd: 1024
hw.em.rx_abs_int_delay: 66
hw.em.tx_abs_int_delay: 66
hw.em.rx_int_delay: 0
hw.em.tx_int_delay: 66


# sysctl dev.em.0
dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.8
dev.em.0.%driver: em
dev.em.0.%location: slot=0 function=0 handle=\_SB_.PCI0.RP01.PXSX
dev.em.0.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x0000 class=0x020000
dev.em.0.%parent: pci1
dev.em.0.nvm: -1
dev.em.0.debug: -1
dev.em.0.fc: 3
dev.em.0.rx_int_delay: 0
dev.em.0.tx_int_delay: 66
dev.em.0.rx_abs_int_delay: 66
dev.em.0.tx_abs_int_delay: 66
dev.em.0.itr: 488
dev.em.0.rx_processing_limit: 100
dev.em.0.eee_control: 1
dev.em.0.link_irq: 28290
dev.em.0.mbuf_alloc_fail: 0
dev.em.0.cluster_alloc_fail: 0
dev.em.0.dropped: 0
dev.em.0.tx_dma_fail: 0
dev.em.0.rx_overruns: 0
dev.em.0.watchdog_timeouts: 2
dev.em.0.device_control: 1074790984
dev.em.0.rx_control: 67141658
dev.em.0.fc_high_water: 18432
dev.em.0.fc_low_water: 16932
dev.em.0.queue0.txd_head: 0
dev.em.0.queue0.txd_tail: 176
dev.em.0.queue0.tx_irq: 258561999
dev.em.0.queue0.no_desc_avail: 0
dev.em.0.queue0.rxd_head: 0
dev.em.0.queue0.rxd_tail: 1023
dev.em.0.queue0.rx_irq: 297119049
dev.em.0.mac_stats.excess_coll: 0
dev.em.0.mac_stats.single_coll: 0
dev.em.0.mac_stats.multiple_coll: 0
dev.em.0.mac_stats.late_coll: 0
dev.em.0.mac_stats.collision_count: 0
dev.em.0.mac_stats.symbol_errors: 0
dev.em.0.mac_stats.sequence_errors: 0
dev.em.0.mac_stats.defer_count: 0
dev.em.0.mac_stats.missed_packets: 28465
dev.em.0.mac_stats.recv_no_buff: 0
dev.em.0.mac_stats.recv_undersize: 0
dev.em.0.mac_stats.recv_fragmented: 0
dev.em.0.mac_stats.recv_oversize: 0
dev.em.0.mac_stats.recv_jabber: 0
dev.em.0.mac_stats.recv_errs: 0
dev.em.0.mac_stats.crc_errs: 0
dev.em.0.mac_stats.alignment_errs: 0
dev.em.0.mac_stats.coll_ext_errs: 0
dev.em.0.mac_stats.xon_recvd: 0
dev.em.0.mac_stats.xon_txd: 0
dev.em.0.mac_stats.xoff_recvd: 0
dev.em.0.mac_stats.xoff_txd: 0
dev.em.0.mac_stats.total_pkts_recvd: 328827928
dev.em.0.mac_stats.good_pkts_recvd: 328799463
dev.em.0.mac_stats.bcast_pkts_recvd: 3400907
dev.em.0.mac_stats.mcast_pkts_recvd: 107503
dev.em.0.mac_stats.rx_frames_64: 3766
dev.em.0.mac_stats.rx_frames_65_127: 131108817
dev.em.0.mac_stats.rx_frames_128_255: 87956012
dev.em.0.mac_stats.rx_frames_256_511: 55962296
dev.em.0.mac_stats.rx_frames_512_1023: 24330820
dev.em.0.mac_stats.rx_frames_1024_1522: 29437752
dev.em.0.mac_stats.good_octets_recvd: 104670970717
dev.em.0.mac_stats.good_octets_txd: 119934667773
dev.em.0.mac_stats.total_pkts_txd: 279223751
dev.em.0.mac_stats.good_pkts_txd: 279223751
dev.em.0.mac_stats.bcast_pkts_txd: 1780
dev.em.0.mac_stats.mcast_pkts_txd: 929946
dev.em.0.mac_stats.tx_frames_64: 22207294
dev.em.0.mac_stats.tx_frames_65_127: 77265601
dev.em.0.mac_stats.tx_frames_128_255: 50849837
dev.em.0.mac_stats.tx_frames_256_511: 53313263
dev.em.0.mac_stats.tx_frames_512_1023: 37681244
dev.em.0.mac_stats.tx_frames_1024_1522: 37906512
dev.em.0.mac_stats.tso_txd: 28
dev.em.0.mac_stats.tso_ctx_fail: 0
dev.em.0.interrupts.asserts: 705
dev.em.0.interrupts.rx_pkt_timer: 0
dev.em.0.interrupts.rx_abs_timer: 0
dev.em.0.interrupts.tx_pkt_timer: 0
dev.em.0.interrupts.tx_abs_timer: 0
dev.em.0.interrupts.tx_queue_empty: 0
dev.em.0.interrupts.tx_queue_min_thresh: 0
dev.em.0.interrupts.rx_desc_min_thresh: 0
dev.em.0.interrupts.rx_overrun: 2
dev.em.0.wake: 0

magnetoza

i'm also get them timeouts also. since upgraded from 2.1.5 to 2.2 beta for testing ;)

i'm busy testing some boot changes, but it should work to stop that issue.
so far so good no problem yet

add this the boot loader

/boot/loader.conf:

hw.pci.enable_msi=0
hw.pci.enable_msix=0

Cezar R46 Tecnologia

I have the same problem.

Is there any final solution.

I use 2.3.1_1

Hugovsky

what's your hardware?

jimp

This thread was almost two years old, and highly unlikely to be related. Start a new thread.