Any gigabit hardware downstream causes – em0: Watchdog timeoout -- resetting
-
Whilst I completely agree with everything Wallabybob said above there is no reason why your hardware shouldn't be working as you have it, or no obvious reason at least.
Considering the fact that I may just have some funky hardware, I'm not adverse to building a different machine, but I don't want to hit a similar problem. I'm hoping to understand what's causing it before investing in more test hardware.
I think this is probably it. You have some combination of hardware that is conflicting somehow and causing you box to run out of some resource. It still seems very odd that em0 goes down but that isn't the interface being connected to the GigE hardware. :-\ Since you have spare NICs I would try leaving em0 unassigned and using the others instead. As Wallabybob said your WAN connection may not be >100Mbps (you haven't said yet) so you could use the Broadcom NIC for that.
My home pfSense box is P4 based with 10 interfaces including 3 Intel 'em' GigE NICs and I've never seen anything like this.Steve
-
My home pfSense box is P4 based with 10 interfaces including 3 Intel 'em' GigE NICs and I've never seen anything like this.
Different chipset? You don't drive your NICs so hard? Or maybe whatever you have connected to them doesn't drive them so hard?
-
I think it's very likely a different chipset. The X-Peak has the server specific 875P/6300ESB which I'm sure helps. However my own box has an underclocked CPU and I have run tests through it as fast I could push. It's never even blinked even if it's not that quick. I say this just as an example that it's not inherently a P4 is too slow problem. This is something that shouldn't be happening.
Steve
-
Hi all –
First, thanks again for all the help~ Here's the latest...
I moved different NICs to different PCI slots, and ran with only two: any combination will cause a watchdog timeout when a GbE device is connected. Using the on-board 10/100 Broadcom NIC I can connect GbE devices to any of the em* interfaces without getting the watchdog timeout, so it does look like something is getting resource constrained on the PCI bus.
While my upstream internet connection is only about 20Mbps down I do have a GbE lan connection on the modem. I sometimes place a hub/switch there also to mirror traffic. Strange that the GbE LAN port on the modem did not trigger the watchdog timeout but a GbE hub did when connect between WAN and modem.
Again, it's not important to have Gigabit speeds, it's just the hardware that I have and that is most available, it's problematic to have to use a 10/100 hub to avoid hard failures.
The GbE hub does work in line with the broadcom where it failed on the Intel PCI controller.
For this box at home, I'm going to call it solved unless anyone wants me to experiment to see if we can find the root cause. Again, a straight Linux like Centos 6.3 or ubuntu 12.04 server with iptable for routing doesn't have any problems with this same setup.
As far as services I use in small office environments --
Generally it's basic firewall / VPN for about 20 users usually no more than 5 concurrent. We also generally use the LAN side DHCP server, the DNS forwader, dynamic dns, and sometimes captive portal at some locations. This box works fine for this setup, except for this issue with the watchdog timeout. The reason to use this type of hardware is that it's readily available and has worked fine with Linux based routers. Also, I can remove the motherboard from the tower enclosure place it in a cheap rack mount atx case and have a good rack mount router/firewall solution.
Or so I thought; I'm sure I can find slightly newer hardware for about the same cost that has PCI-e instead of PCI for the expansion NICs.
Steve -- are your intel NICs PCI?
Thanks for the help!
-
My NICs are all on board so it's hard to say for sure. They may be PCI-X. The bus may not be 33MHz. They are not PCIe though.
The 875p MCH also has the CSA interface which offers 266MB/s again I don't know if this is used or how it would appear in FreeBSD.Steve
-
what model mainboard? sounds interesting.
-
It's a re-purposed Watchguard Firebox X6000. See:
http://forum.pfsense.org/index.php/topic,25011.0.htmlSteve
-
Again, it's not important to have Gigabit speeds, it's just the hardware that I have and that is most available, it's problematic to have to use a 10/100 hub to avoid hard failures.
Generally you can configure a Gigabit capable device to operate at 100Mbps.
For this box at home, I'm going to call it solved unless anyone wants me to experiment to see if we can find the root cause. Again, a straight Linux like Centos 6.3 or ubuntu 12.04 server with iptable for routing doesn't have any problems with this same setup.
I would be interested to see if enabling flow control "fixes" the behaviour. See my earlier reply with a link to another topic for some clues about enabling flow control on the NICs.
-
I agree. Flow control seems like exactly the sort of thing that would solve this, however none of my Intel NICs appear to offer it :-:
[2.0.3-RELEASE][root@pfsense.fire.box]/root(2): ifconfig -m em1 em1: flags=8843 <up,broadcast,running,simplex,multicast>metric 0 mtu 1500 options=9b <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum>capabilities=100db <rxcsum,txcsum,vlan_mtu,vlan_hwtagging,polling,vlan_hwcsum,vlan_hwfilter>ether 00:90:7f:31:4b:ee inet 192.168.2.1 netmask 0xffffff00 broadcast 192.168.2.255 inet6 fe80::290:7fff:fe31:4bee%em1 prefixlen 64 scopeid 0x2 nd6 options=43 <performnud,accept_rtadv>media: Ethernet autoselect (1000baseT <full-duplex>) status: active supported media: media autoselect media 1000baseT media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP</full-duplex></performnud,accept_rtadv></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,polling,vlan_hwcsum,vlan_hwfilter></rxcsum,txcsum,vlan_mtu,vlan_hwtagging,vlan_hwcsum></up,broadcast,running,simplex,multicast>
Steve
-
Looks like flow control is not managed in the same way in Intel cards:
http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Flow_ControlThough I'm still seeing nothing. :-\
[2.0.3-RELEASE][root@pfsense.fire.box]/root(1): sysctl hw.em hw.em.eee_setting: 0 hw.em.rx_process_limit: 100 hw.em.enable_msix: 1 hw.em.sbp: 0 hw.em.smart_pwr_down: 0 hw.em.txd: 1024 hw.em.rxd: 1024 hw.em.rx_abs_int_delay: 66 hw.em.tx_abs_int_delay: 66 hw.em.rx_int_delay: 0 hw.em.tx_int_delay: 66
Steve
-
However flowcontrol does appear to operational:
[2.0.3-RELEASE][root@pfsense.fire.box]/root(5): sysctl dev.em dev.em.0.%desc: Intel(R) PRO/1000 Legacy Network Connection 1.0.4 dev.em.0.%driver: em dev.em.0.%location: slot=1 function=0 dev.em.0.%pnpinfo: vendor=0x8086 device=0x1075 subvendor=0x8086 subdevice=0x1075 class=0x020000 dev.em.0.%parent: pci2 dev.em.0.nvm: -1 dev.em.0.rx_int_delay: 0 dev.em.0.tx_int_delay: 66 dev.em.0.rx_abs_int_delay: 66 dev.em.0.tx_abs_int_delay: 66 dev.em.0.rx_processing_limit: 100 dev.em.0.flow_control: 3 dev.em.0.mbuf_alloc_fail: 0 dev.em.0.cluster_alloc_fail: 0 dev.em.0.dropped: 0 dev.em.0.tx_dma_fail: 0 dev.em.0.tx_desc_fail1: 0 dev.em.0.tx_desc_fail2: 0 dev.em.0.rx_overruns: 0 dev.em.0.watchdog_timeouts: 0 dev.em.0.device_control: 1077674561 dev.em.0.rx_control: 32770 dev.em.0.fc_high_water: 28672 dev.em.0.fc_low_water: 27172 dev.em.0.fifo_workaround: 0 dev.em.0.fifo_reset: 0 dev.em.0.txd_head: 131 dev.em.0.txd_tail: 131 dev.em.0.rxd_head: 194 dev.em.0.rxd_tail: 193 dev.em.0.mac_stats.excess_coll: 0 dev.em.0.mac_stats.single_coll: 0 dev.em.0.mac_stats.multiple_coll: 0 dev.em.0.mac_stats.late_coll: 0 dev.em.0.mac_stats.collision_count: 0 dev.em.0.mac_stats.symbol_errors: 0 dev.em.0.mac_stats.sequence_errors: 0 dev.em.0.mac_stats.defer_count: 0 dev.em.0.mac_stats.missed_packets: 0 dev.em.0.mac_stats.recv_no_buff: 0 dev.em.0.mac_stats.recv_undersize: 0 dev.em.0.mac_stats.recv_fragmented: 0 dev.em.0.mac_stats.recv_oversize: 0 dev.em.0.mac_stats.recv_jabber: 0 dev.em.0.mac_stats.recv_errs: 0 dev.em.0.mac_stats.crc_errs: 0 dev.em.0.mac_stats.alignment_errs: 0 dev.em.0.mac_stats.coll_ext_errs: 0 dev.em.0.mac_stats.xon_recvd: 0 dev.em.0.mac_stats.xon_txd: 0 dev.em.0.mac_stats.xoff_recvd: 0 dev.em.0.mac_stats.xoff_txd: 0 dev.em.0.mac_stats.total_pkts_recvd: 850882 dev.em.0.mac_stats.good_pkts_recvd: 850882 dev.em.0.mac_stats.bcast_pkts_recvd: 699 dev.em.0.mac_stats.mcast_pkts_recvd: 0 dev.em.0.mac_stats.rx_frames_64: 715 dev.em.0.mac_stats.rx_frames_65_127: 847331 dev.em.0.mac_stats.rx_frames_128_255: 985 dev.em.0.mac_stats.rx_frames_256_511: 555 dev.em.0.mac_stats.rx_frames_512_1023: 90 dev.em.0.mac_stats.rx_frames_1024_1522: 1206 dev.em.0.mac_stats.good_octets_recvd: 71709190 dev.em.0.mac_stats.good_octets_txd: 72480859 dev.em.0.mac_stats.total_pkts_txd: 849527 dev.em.0.mac_stats.good_pkts_txd: 849527 dev.em.0.mac_stats.bcast_pkts_txd: 22 dev.em.0.mac_stats.mcast_pkts_txd: 5 dev.em.0.mac_stats.tx_frames_64: 726 dev.em.0.mac_stats.tx_frames_65_127: 846203 dev.em.0.mac_stats.tx_frames_128_255: 152 dev.em.0.mac_stats.tx_frames_256_511: 361 dev.em.0.mac_stats.tx_frames_512_1023: 424 dev.em.0.mac_stats.tx_frames_1024_1522: 1661 dev.em.0.mac_stats.tso_txd: 0 dev.em.0.mac_stats.tso_ctx_fail: 0
How do those numbers compare with your failing NIC?
Steve