Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Kernel: em0: Watchdog timeout – resetting

    Scheduled Pinned Locked Moved Hardware
    5 Posts 5 Posters 6.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A
      Ashmodai
      last edited by

      Hi there,

      Recently I've noticed that 3 times a day, every 8 hours and 15 minutes, my primary pfSense firewall has been resetting one of it's interfaces.

      The first event was on Aug 15, and no changes were made to the firewall that day.

      The interface was em2, the 3rd port of a 4-port LACP trunk in a SuperMicro 5017P-TLN4F with 4 onboard Intel Gigabit em LAN ports.
      The cards are driven by an Intel 82574L chip.

      It has then happened every 8 hours and 15 minutes since then, like clockwork, up until the firewall was updated from 2.1.3-RELEASE to 2.1.5-RELEASE, on September 15th.

      The firewall was rebooted for the upgrade and there were no more watchdog timeouts for 4 days, however this morning at 2AM and then again at 10AM, the event has re-occurred (albeit on a different interface) - em0 this time.  Similar behavior seems to be following, every 8h15 em0 is resetting due to watchdog timeout – I only have the one reoccurrence so far, but I assume in a few hours we'll see another.

      
      Sep 11 02:19:13 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 11 10:34:17 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 11 18:49:20 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 12 03:04:24 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 12 11:19:27 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 12 19:34:31 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 13 03:49:35 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 13 12:04:38 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 13 20:19:42 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 14 04:34:46 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 14 12:49:49 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep 14 21:04:52 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      
      

      Each event is followed by the same context:

      
      Sep  2 12:02:48 cw-fw-01 kernel: em2: Watchdog timeout -- resetting
      Sep  2 12:02:48 cw-fw-01 kernel: em2: Queue(0) tdh = 0, hw tdt = 992
      Sep  2 12:02:48 cw-fw-01 kernel: em2: TX(0) desc avail = 32,Next TX to Clean = 0
      Sep  2 12:02:48 cw-fw-01 kernel: em2: link state changed to DOWN
      
      

      And the numbers are always the same, for tdh, tdt, descriptors avail, etc.  The very first timeout has different numbers, but all subsequent timeouts had the same numbers.

      It seems to be happening exactly every 8 hours and 15 minutes – shifting the time of day it happens by 15 minutes each time.  I'm currently unaware of any traffic pattern in the network or job that might cause this, and there aren't any meaningful spikes in the graphs to account for it. (EDIT: especially as further investigation reveals that the port is actually inactive in the LAGG group ever since the very first instance/timeout).

      The affected NIC (first em2, and now em0 after the upgrade/reboot) is part of a 4-NIC LAGG (em0, em1, em2, em3) and the issue went completely unnoticed for a month -- only this morning (the first occurance of the problem after the reboot) it managed to trip apinger and cause a failover to the secondary, prompting me to dig deeper into the cause and discover that it had been going on for some time now.

      Any clues?  Anything I should try?  It's very strange, and I'm not comfortable at all as it's quite a critical piece of our network infrastructure.

      It seems that once the first reset occurs, em0 gets removed as an active port in lagg0 and thus subsequent failures don't actually affect connectivity (from ifconfig -v lagg0):

      
      lagg0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
              options=4219b <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,tso4,wol_magic,vlan_hwtso>ether 00:25:90:95:de:d0
              inet6 fe80::225:90ff:fe95:ded0%lagg0 prefixlen 64 scopeid 0x9
              nd6 options=3 <performnud,accept_rtadv>media: Ethernet autoselect
              status: active
              groups: lagg
              laggproto lacp
              lag id: [(8000,00-25-90-95-DE-D0,0130,0000,0000),
                       (8000,EC-CD-6D-38-C5-FD,FD01,0000,0000)]
              laggport: em3 flags=1c <active,collecting,distributing>state=3D
                      [(8000,00-25-90-95-DE-D0,0130,8000,0004),
                       (8000,EC-CD-6D-38-C5-FD,FD01,0003,0003)]
              laggport: em2 flags=1c <active,collecting,distributing>state=3D
                      [(8000,00-25-90-95-DE-D0,0130,8000,0003),
                       (8000,EC-CD-6D-38-C5-FD,FD01,0001,0001)]
              laggport: em1 flags=1c <active,collecting,distributing>state=3D
                      [(8000,00-25-90-95-DE-D0,0130,8000,0002),
                       (8000,EC-CD-6D-38-C5-FD,FD01,0004,0004)]
              laggport: em0 flags=18 <collecting,distributing>state=7D
                      [(8000,00-25-90-95-DE-D0,0130,8000,0001),
                       (FFFF,00-00-00-00-00-00,0000,FFFF,0000)]</collecting,distributing></active,collecting,distributing></active,collecting,distributing></active,collecting,distributing></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum,tso4,wol_magic,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast> 
      

      Notice that em0 (the troubled port this particular reboot cycle) is not ACTIVE (and hasn't been since the first watchdog timeout), and tcpdump confirms this - no traffic whatsoever going across it - which means that the port is locking up every 8h15 even while completely idle and doing absolutely nothing?

      sysctl output below.  I've failed CARP over to the secondary to see what happens there (will it start exhibiting symptoms?  will the primary still continue to suffer watchdog timeouts even as it idles?).

      
      # sysctl hw.em
      hw.em.num_queues: 0
      hw.em.eee_setting: 1
      hw.em.rx_process_limit: 100
      hw.em.enable_msix: 1
      hw.em.sbp: 0
      hw.em.smart_pwr_down: 0
      hw.em.txd: 1024
      hw.em.rxd: 1024
      hw.em.rx_abs_int_delay: 66
      hw.em.tx_abs_int_delay: 66
      hw.em.rx_int_delay: 0
      hw.em.tx_int_delay: 66
      
      
      
      # sysctl dev.em.0
      dev.em.0.%desc: Intel(R) PRO/1000 Network Connection 7.3.8
      dev.em.0.%driver: em
      dev.em.0.%location: slot=0 function=0 handle=\_SB_.PCI0.RP01.PXSX
      dev.em.0.%pnpinfo: vendor=0x8086 device=0x10d3 subvendor=0x15d9 subdevice=0x0000 class=0x020000
      dev.em.0.%parent: pci1
      dev.em.0.nvm: -1
      dev.em.0.debug: -1
      dev.em.0.fc: 3
      dev.em.0.rx_int_delay: 0
      dev.em.0.tx_int_delay: 66
      dev.em.0.rx_abs_int_delay: 66
      dev.em.0.tx_abs_int_delay: 66
      dev.em.0.itr: 488
      dev.em.0.rx_processing_limit: 100
      dev.em.0.eee_control: 1
      dev.em.0.link_irq: 28290
      dev.em.0.mbuf_alloc_fail: 0
      dev.em.0.cluster_alloc_fail: 0
      dev.em.0.dropped: 0
      dev.em.0.tx_dma_fail: 0
      dev.em.0.rx_overruns: 0
      dev.em.0.watchdog_timeouts: 2
      dev.em.0.device_control: 1074790984
      dev.em.0.rx_control: 67141658
      dev.em.0.fc_high_water: 18432
      dev.em.0.fc_low_water: 16932
      dev.em.0.queue0.txd_head: 0
      dev.em.0.queue0.txd_tail: 176
      dev.em.0.queue0.tx_irq: 258561999
      dev.em.0.queue0.no_desc_avail: 0
      dev.em.0.queue0.rxd_head: 0
      dev.em.0.queue0.rxd_tail: 1023
      dev.em.0.queue0.rx_irq: 297119049
      dev.em.0.mac_stats.excess_coll: 0
      dev.em.0.mac_stats.single_coll: 0
      dev.em.0.mac_stats.multiple_coll: 0
      dev.em.0.mac_stats.late_coll: 0
      dev.em.0.mac_stats.collision_count: 0
      dev.em.0.mac_stats.symbol_errors: 0
      dev.em.0.mac_stats.sequence_errors: 0
      dev.em.0.mac_stats.defer_count: 0
      dev.em.0.mac_stats.missed_packets: 28465
      dev.em.0.mac_stats.recv_no_buff: 0
      dev.em.0.mac_stats.recv_undersize: 0
      dev.em.0.mac_stats.recv_fragmented: 0
      dev.em.0.mac_stats.recv_oversize: 0
      dev.em.0.mac_stats.recv_jabber: 0
      dev.em.0.mac_stats.recv_errs: 0
      dev.em.0.mac_stats.crc_errs: 0
      dev.em.0.mac_stats.alignment_errs: 0
      dev.em.0.mac_stats.coll_ext_errs: 0
      dev.em.0.mac_stats.xon_recvd: 0
      dev.em.0.mac_stats.xon_txd: 0
      dev.em.0.mac_stats.xoff_recvd: 0
      dev.em.0.mac_stats.xoff_txd: 0
      dev.em.0.mac_stats.total_pkts_recvd: 328827928
      dev.em.0.mac_stats.good_pkts_recvd: 328799463
      dev.em.0.mac_stats.bcast_pkts_recvd: 3400907
      dev.em.0.mac_stats.mcast_pkts_recvd: 107503
      dev.em.0.mac_stats.rx_frames_64: 3766
      dev.em.0.mac_stats.rx_frames_65_127: 131108817
      dev.em.0.mac_stats.rx_frames_128_255: 87956012
      dev.em.0.mac_stats.rx_frames_256_511: 55962296
      dev.em.0.mac_stats.rx_frames_512_1023: 24330820
      dev.em.0.mac_stats.rx_frames_1024_1522: 29437752
      dev.em.0.mac_stats.good_octets_recvd: 104670970717
      dev.em.0.mac_stats.good_octets_txd: 119934667773
      dev.em.0.mac_stats.total_pkts_txd: 279223751
      dev.em.0.mac_stats.good_pkts_txd: 279223751
      dev.em.0.mac_stats.bcast_pkts_txd: 1780
      dev.em.0.mac_stats.mcast_pkts_txd: 929946
      dev.em.0.mac_stats.tx_frames_64: 22207294
      dev.em.0.mac_stats.tx_frames_65_127: 77265601
      dev.em.0.mac_stats.tx_frames_128_255: 50849837
      dev.em.0.mac_stats.tx_frames_256_511: 53313263
      dev.em.0.mac_stats.tx_frames_512_1023: 37681244
      dev.em.0.mac_stats.tx_frames_1024_1522: 37906512
      dev.em.0.mac_stats.tso_txd: 28
      dev.em.0.mac_stats.tso_ctx_fail: 0
      dev.em.0.interrupts.asserts: 705
      dev.em.0.interrupts.rx_pkt_timer: 0
      dev.em.0.interrupts.rx_abs_timer: 0
      dev.em.0.interrupts.tx_pkt_timer: 0
      dev.em.0.interrupts.tx_abs_timer: 0
      dev.em.0.interrupts.tx_queue_empty: 0
      dev.em.0.interrupts.tx_queue_min_thresh: 0
      dev.em.0.interrupts.rx_desc_min_thresh: 0
      dev.em.0.interrupts.rx_overrun: 2
      dev.em.0.wake: 0
      
      
      1 Reply Last reply Reply Quote 0
      • M
        magnetoza
        last edited by

        i'm also get  them timeouts also. since upgraded from 2.1.5 to 2.2 beta for testing ;)

        i'm busy testing some boot changes, but it should work to stop that issue.
        so far so good no problem yet

        add this the boot loader

        /boot/loader.conf:

        hw.pci.enable_msi=0
        hw.pci.enable_msix=0

        1 Reply Last reply Reply Quote 0
        • C
          Cezar R46 Tecnologia
          last edited by

          I have the same problem.

          Is there any final solution.

          I use 2.3.1_1

          Cezar Pordeus

          1 Reply Last reply Reply Quote 0
          • H
            Hugovsky
            last edited by

            what's your hardware?

            1 Reply Last reply Reply Quote 0
            • jimpJ
              jimp Rebel Alliance Developer Netgate
              last edited by

              This thread was almost two years old, and highly unlikely to be related. Start a new thread.

              Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

              Need help fast? Netgate Global Support!

              Do not Chat/PM for help!

              1 Reply Last reply Reply Quote 0
              • First post
                Last post
              Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.