Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Any gigabit hardware downstream causes – em0: Watchdog timeoout -- resetting

    Scheduled Pinned Locked Moved Hardware
    16 Posts 3 Posters 6.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W
      wallabybob
      last edited by

      This might not be particularly relevant to the problem you have reported but your I/O capacity is way over-committed. All your NICs are on the same PCI bus. If that is a standard PCI bus then you will be hard pressed to get 1Gbps (b = BITS) through it. Your NICs could want 3*2Gbps + 200Mbps + PCI bus overheads.

      I suggest you try connecting at most one Gigabit capable device and force everything else to run at 100Mbps, or even better, force everything to run at 100Mbps or less.

      If your WAN speed to the Internet is 100Mbps or less there is nothing to be gained by running the link to your modem at 1Gbps.

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        That is a strange problem. You connect things to a downstream interface and the upstream NIC fails.  :-\

        One thing I would try here is disabling any hardware you don't need in the bios. I'd start with the floppy controller and parallel port but also try USB controllers, if you're not using them and is the Broadcom NIC on board?

        You have a number of things using irq16 including em0, not necessarily a problem.

        Steve

        1 Reply Last reply Reply Quote 0
        • J
          jmyers
          last edited by

          Hello!

          Thanks for the replies!!  Currently I have a net gear 10/100 hub connected to the lan side (em1) and then to a netgear 10/100/1000 hub with my Wireless router and other machines connected to it.  This is the only way I can connect the 10/100/1000 hardware to the router.

          I definitely understand about the PCI bus limits and the upstream speed being the limiting factor on Internet bound traffic and I appreciate the reminder–  to that end I disabled everything possible in BIOS including the on-board NIC and removed one of the PCI cards with no luck.

          So, there are two reasons why I need to get this figured out.

          1. We sometimes have high internal traffic and multiple end points making large transfers so the Gigabit switches make a big difference in that case.  To be sure the router speed wasn't involved in this I did the following test

          ---- run iperf between two nodes on the gigabit hub
          ---- move the cables to the 10/100 hub

          Both tests were with the 10/100 in between the gigabit hub and the router, here's the results:


          Server listening on TCP port 5001
          TCP window size: 8.00 KByte (default)

          [180] local 192.168.1.107 port 5001 connected with 192.168.1.112 port 1625
          [ ID] Interval       Transfer     Bandwidth
          [180]  0.0-10.1 sec   666 MBytes   552 Mbits/sec
          [196] local 192.168.1.107 port 5001 connected with 192.168.1.112 port 1638
          [196]  0.0-10.0 sec  51.5 MBytes  43.0 Mbits/sec

          It's 10+ times faster between local end-points!

          1. I want to be able to use Gigabit hardware in the Lan, I can isolate the lan NIC in the PFSense box for now with a 10/100 hub, but it's not a scalable solution as I'm testing this for potential build outs for small business networks.

          Considering the fact that I may just have some funky hardware, I'm not adverse to building a different machine, but I don't want to hit a similar problem.  I'm hoping to understand what's causing it before investing in more test hardware.

          I've been able to run linux / iptables routers on this same hardware with the same gigabit hardware downstream with no troubles.

          Thanks again for any insight you can provide!  I love PFSense and hope to get this worked out so I can use it on a larger scale!!!

          Is there a Golden hardware list somewhere that I haven't stumbled upon yet?

          Thanks,
          -j.

          1 Reply Last reply Reply Quote 0
          • W
            wallabybob
            last edited by

            @jmyers:

            It's 10+ times faster between local end-points!

            "local" (same subnet)  traffic will normally bypass the router.

            @jmyers:

            1. I want to be able to use Gigabit hardware in the Lan, I can isolate the lan NIC in the PFSense box for now with a 10/100 hub, but it's not a scalable solution as I'm testing this for potential build outs for small business networks.

            1. I'm guessing you won't be using Pentium 4 equipment in the "build outs for small business networks". Any "modern" PC based system is likely to provide considerably higher I/O bandwidth than a single PCI bus.

            2. You haven't given much detail of what you want the equipment doing for small business networks. If it is for basis firewalling an ALIX could well be suitable for internet speeds up to about 85MBps. Because "local" systems would normally communicate directly through a switch (bypassing the router) there might not be a need for the router/firewall to run interfaces at Gigabit speeds.

            @jmyers:

            I've been able to run linux / iptables routers on this same hardware with the same gigabit hardware downstream with no troubles.

            Maybe the Linux configuration enables flow control on the NICs and pfSense doesn't. I'll need to look for another thread about flow control. I'll update this reply when I find that thread.

            @jmyers:

            Is there a Golden hardware list somewhere that I haven't stumbled upon?

            Such a list can't tell you whether a particular configuration will be suitable for your particular application. How would you characterise the requirements you are attempting to address? In particular, where do you need gigabit speeds and why?

            Edit: The thread I mentioned is at http://forum.pfsense.org/index.php/topic,57037.0.html though it takes a few replies to get into the relevant section.

            1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator
              last edited by

              Whilst I completely agree with everything Wallabybob said above there is no reason why your hardware shouldn't be working as you have it, or no obvious reason at least.

              @jmyers:

              Considering the fact that I may just have some funky hardware, I'm not adverse to building a different machine, but I don't want to hit a similar problem.  I'm hoping to understand what's causing it before investing in more test hardware.

              I think this is probably it. You have some combination of hardware that is conflicting somehow and causing you box to run out of some resource. It still seems very odd that em0 goes down but that isn't the interface being connected to the GigE hardware.  :-\ Since you have spare NICs I would try leaving em0 unassigned and using the others instead. As Wallabybob said your WAN connection may not be >100Mbps (you haven't said yet) so you could use the Broadcom NIC for that.
              My home pfSense box is P4 based with 10 interfaces including 3 Intel 'em' GigE NICs and I've never seen anything like this.

              Steve

              1 Reply Last reply Reply Quote 0
              • W
                wallabybob
                last edited by

                @stephenw10:

                My home pfSense box is P4 based with 10 interfaces including 3 Intel 'em' GigE NICs and I've never seen anything like this.

                Different chipset? You don't drive your NICs so hard? Or maybe whatever you have connected to them doesn't drive them so hard?

                1 Reply Last reply Reply Quote 0
                • stephenw10S
                  stephenw10 Netgate Administrator
                  last edited by

                  I think it's very likely a different chipset. The X-Peak has the server specific 875P/6300ESB which I'm sure helps. However my own box has an underclocked CPU and I have run tests through it as fast I could push. It's never even blinked even if it's not that quick. I say this just as an example that it's not inherently a P4 is too slow problem. This is something that shouldn't be happening.

                  Steve

                  1 Reply Last reply Reply Quote 0
                  • J
                    jmyers
                    last edited by

                    Hi all –

                    First, thanks again for all the help~ Here's the latest...

                    I moved different NICs to different PCI slots, and ran with only two: any combination will cause a watchdog timeout when a GbE device is connected.  Using the on-board 10/100 Broadcom NIC I can connect GbE devices to any of the em* interfaces without getting the watchdog timeout, so it does look like something is getting resource constrained on the PCI bus.

                    While my upstream internet connection is only about 20Mbps down I do have a GbE lan connection on the modem.  I sometimes place a hub/switch there also to mirror traffic.  Strange that the GbE LAN port on the modem did not trigger the watchdog timeout but a GbE hub did when connect between WAN and modem.

                    Again, it's not important to have Gigabit speeds, it's just the hardware that I have and that is most available, it's problematic to have to use a 10/100 hub to avoid hard failures.

                    The GbE hub does work in line with the broadcom where it failed on the Intel PCI controller.

                    For this box at home, I'm going to call it solved unless anyone wants me to experiment to see if we can find the root cause.  Again, a straight Linux like Centos 6.3 or ubuntu 12.04 server with iptable for routing doesn't have any problems with this same setup.

                    As far as services I use in small office environments --

                    Generally it's basic firewall / VPN for about 20 users usually no more than 5 concurrent.  We also generally use the LAN side DHCP server, the DNS forwader, dynamic dns, and sometimes captive portal at some locations.  This box works fine for this setup, except for this issue with the watchdog timeout.  The reason to use this type of hardware is that it's readily available and has worked fine with Linux based routers.  Also, I can remove the motherboard from the tower enclosure place it in a cheap rack mount atx case and have a good rack mount router/firewall solution.

                    Or so I thought; I'm sure I can find slightly newer hardware for about the same cost that has PCI-e instead of PCI for the expansion NICs.

                    Steve -- are your intel NICs PCI?

                    Thanks for the help!

                    1 Reply Last reply Reply Quote 0
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      My NICs are all on board so it's hard to say for sure. They may be PCI-X. The bus may not be 33MHz. They are not PCIe though.
                      The 875p MCH also has the CSA interface which offers 266MB/s again I don't know if this is used or how it would appear in FreeBSD.

                      Steve

                      1 Reply Last reply Reply Quote 0
                      • J
                        jmyers
                        last edited by

                        what model mainboard?  sounds interesting.

                        1 Reply Last reply Reply Quote 0
                        • stephenw10S
                          stephenw10 Netgate Administrator
                          last edited by

                          It's a re-purposed Watchguard Firebox X6000. See:
                          http://forum.pfsense.org/index.php/topic,25011.0.html

                          Steve

                          1 Reply Last reply Reply Quote 0
                          • W
                            wallabybob
                            last edited by

                            @jmyers:

                            Again, it's not important to have Gigabit speeds, it's just the hardware that I have and that is most available, it's problematic to have to use a 10/100 hub to avoid hard failures.

                            Generally you can configure a Gigabit capable device to operate at 100Mbps.

                            @jmyers:

                            For this box at home, I'm going to call it solved unless anyone wants me to experiment to see if we can find the root cause.  Again, a straight Linux like Centos 6.3 or ubuntu 12.04 server with iptable for routing doesn't have any problems with this same setup.

                            I would be interested to see if enabling flow control "fixes" the behaviour. See my earlier reply with a link to another topic for some clues about enabling flow control on the NICs.

                            1 Reply Last reply Reply Quote 0
                            • stephenw10S
                              stephenw10 Netgate Administrator
                              last edited by

                              I agree. Flow control seems like exactly the sort of thing that would solve this, however none of my Intel NICs appear to offer it  :-:

                              [2.0.3-RELEASE][root@pfsense.fire.box]/root(2): ifconfig -m em1
                              em1: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 1500
                                      options=9b <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum>capabilities=100db <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,polling,vlan_hwcsum,vlan_hwfilter>ether 00:90:7f:31:4b:ee
                                      inet 192.168.2.1 netmask 0xffffff00 broadcast 192.168.2.255
                                      inet6 fe80::290:7fff:fe31:4bee%em1 prefixlen 64 scopeid 0x2
                                      nd6 options=43 <performnud,accept_rtadv>media: Ethernet autoselect (1000baseT <full-duplex>)
                                      status: active
                                      supported media:
                                              media autoselect
                                              media 1000baseT
                                              media 1000baseT mediaopt full-duplex
                                              media 100baseTX mediaopt full-duplex
                                              media 100baseTX
                                              media 10baseT/UTP mediaopt full-duplex
                                              media 10baseT/UTP</full-duplex></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,polling,vlan_hwcsum,vlan_hwfilter></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum></up,broadcast,running,simplex,multicast> 
                              

                              Steve

                              1 Reply Last reply Reply Quote 0
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                Looks like flow control is not managed in the same way in Intel cards:
                                http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Flow_Control

                                Though I'm still seeing nothing.  :-\

                                [2.0.3-RELEASE][root@pfsense.fire.box]/root(1): sysctl hw.em
                                hw.em.eee_setting: 0
                                hw.em.rx_process_limit: 100
                                hw.em.enable_msix: 1
                                hw.em.sbp: 0
                                hw.em.smart_pwr_down: 0
                                hw.em.txd: 1024
                                hw.em.rxd: 1024
                                hw.em.rx_abs_int_delay: 66
                                hw.em.tx_abs_int_delay: 66
                                hw.em.rx_int_delay: 0
                                hw.em.tx_int_delay: 66
                                
                                

                                Steve

                                1 Reply Last reply Reply Quote 0
                                • stephenw10S
                                  stephenw10 Netgate Administrator
                                  last edited by

                                  However flowcontrol does appear to operational:

                                  [2.0.3-RELEASE][root@pfsense.fire.box]/root(5): sysctl dev.em
                                  dev.em.0.%desc: Intel(R) PRO/1000 Legacy Network Connection 1.0.4
                                  dev.em.0.%driver: em
                                  dev.em.0.%location: slot=1 function=0
                                  dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 subdevice=0x1075 class=0x020000
                                  dev.em.0.%parent: pci2
                                  dev.em.0.nvm: -1
                                  dev.em.0.rx_int_delay: 0
                                  dev.em.0.tx_int_delay: 66
                                  dev.em.0.rx_abs_int_delay: 66
                                  dev.em.0.tx_abs_int_delay: 66
                                  dev.em.0.rx_processing_limit: 100
                                  dev.em.0.flow_control: 3
                                  dev.em.0.mbuf_alloc_fail: 0
                                  dev.em.0.cluster_alloc_fail: 0
                                  dev.em.0.dropped: 0
                                  dev.em.0.tx_dma_fail: 0
                                  dev.em.0.tx_desc_fail1: 0
                                  dev.em.0.tx_desc_fail2: 0
                                  dev.em.0.rx_overruns: 0
                                  dev.em.0.watchdog_timeouts: 0
                                  dev.em.0.device_control: 1077674561
                                  dev.em.0.rx_control: 32770
                                  dev.em.0.fc_high_water: 28672
                                  dev.em.0.fc_low_water: 27172
                                  dev.em.0.fifo_workaround: 0
                                  dev.em.0.fifo_reset: 0
                                  dev.em.0.txd_head: 131
                                  dev.em.0.txd_tail: 131
                                  dev.em.0.rxd_head: 194
                                  dev.em.0.rxd_tail: 193
                                  dev.em.0.mac_stats.excess_coll: 0
                                  dev.em.0.mac_stats.single_coll: 0
                                  dev.em.0.mac_stats.multiple_coll: 0
                                  dev.em.0.mac_stats.late_coll: 0
                                  dev.em.0.mac_stats.collision_count: 0
                                  dev.em.0.mac_stats.symbol_errors: 0
                                  dev.em.0.mac_stats.sequence_errors: 0
                                  dev.em.0.mac_stats.defer_count: 0
                                  dev.em.0.mac_stats.missed_packets: 0
                                  dev.em.0.mac_stats.recv_no_buff: 0
                                  dev.em.0.mac_stats.recv_undersize: 0
                                  dev.em.0.mac_stats.recv_fragmented: 0
                                  dev.em.0.mac_stats.recv_oversize: 0
                                  dev.em.0.mac_stats.recv_jabber: 0
                                  dev.em.0.mac_stats.recv_errs: 0
                                  dev.em.0.mac_stats.crc_errs: 0
                                  dev.em.0.mac_stats.alignment_errs: 0
                                  dev.em.0.mac_stats.coll_ext_errs: 0
                                  dev.em.0.mac_stats.xon_recvd: 0
                                  dev.em.0.mac_stats.xon_txd: 0
                                  dev.em.0.mac_stats.xoff_recvd: 0
                                  dev.em.0.mac_stats.xoff_txd: 0
                                  dev.em.0.mac_stats.total_pkts_recvd: 850882
                                  dev.em.0.mac_stats.good_pkts_recvd: 850882
                                  dev.em.0.mac_stats.bcast_pkts_recvd: 699
                                  dev.em.0.mac_stats.mcast_pkts_recvd: 0
                                  dev.em.0.mac_stats.rx_frames_64: 715
                                  dev.em.0.mac_stats.rx_frames_65_127: 847331
                                  dev.em.0.mac_stats.rx_frames_128_255: 985
                                  dev.em.0.mac_stats.rx_frames_256_511: 555
                                  dev.em.0.mac_stats.rx_frames_512_1023: 90
                                  dev.em.0.mac_stats.rx_frames_1024_1522: 1206
                                  dev.em.0.mac_stats.good_octets_recvd: 71709190
                                  dev.em.0.mac_stats.good_octets_txd: 72480859
                                  dev.em.0.mac_stats.total_pkts_txd: 849527
                                  dev.em.0.mac_stats.good_pkts_txd: 849527
                                  dev.em.0.mac_stats.bcast_pkts_txd: 22
                                  dev.em.0.mac_stats.mcast_pkts_txd: 5
                                  dev.em.0.mac_stats.tx_frames_64: 726
                                  dev.em.0.mac_stats.tx_frames_65_127: 846203
                                  dev.em.0.mac_stats.tx_frames_128_255: 152
                                  dev.em.0.mac_stats.tx_frames_256_511: 361
                                  dev.em.0.mac_stats.tx_frames_512_1023: 424
                                  dev.em.0.mac_stats.tx_frames_1024_1522: 1661
                                  dev.em.0.mac_stats.tso_txd: 0
                                  dev.em.0.mac_stats.tso_ctx_fail: 0
                                  
                                  

                                  How do those numbers compare with your failing NIC?

                                  Steve

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post
                                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.