Bridged Setup Losing Interface - Watchdog Timeout
I've had the same pfsense system running for about 10-11 months without issue. The system consists of:
2x Rosewill dual gigabit NIC
System is on an APC UPS
The system is has three interfaces bridged running pfsense 2.1 x64. Aside from initial setup issues there has been no issues with the system until yesterday morning. Upon waking up I noticed that the interface that feeds the upstairs of my house had lost connectivity. Checking ipconfig, I noticed that the subnet had changed and that the computer had a new IP. Going downstairs to check the modem and pfsense system I noticed everything down there was still running, but the system showed re1 watchdog timeout. Eventually, I just swapped the interface to a different port on the other NIC, reboot, and bam things start working upstairs. I thought that the port had simply died so I let it run to make sure that was the issue. Everything runs fine for the rest of the night and I decide to test the, "dead" port when I get home the next day.
Today, I woke up after and everything was running without issue. I go to get lunch and upon returning home the same issue has popped up again. At this point I don't think two ports on separate NICs die a day apart from each other so I start doing some more testing.
-Test all cat5e cables running through the upstairs interface with a cable tester. All test in working order.
-Test both switches in the upstairs interface. Swap their positions. Try one running with the other turned off, etc… No issues with the switches.
-Run new cat5e cable from downstairs through the floor to upstairs as the old cable had a couple nicks in it from the carpet tack strip. No change.
Ruling hardware out of the equation I start moving into pfsense logs. Considering it has been running so long without any issue I don't see what could have caused anything to stop working, but there appears to be issues. The first thing I noticed is really high CPU utilization, 75-90% from:
Searching the forum I learn that can be caused by interfaces rapidly going from a working to a nonworking state. I'm not very good at this, but checking pfinfo I don't see anything obvious besides;
At this point I'm not too sure what to do or test, but it seems the issue lies somewhere in pfsense having gone haywire the other night and now hating the upstairs of my house for some odd reason. Considering the system was at a 33+ day uptime before that only being shutdown due to an electrical upgrade to my house I'm not too sure what could be going on. At this point I've run another line from upstairs to the switch on another interface and have the switches run in series. It is a complete hack, but it works. It also happens to drops the CPU usage down to 1-2%. Any help debugging this and getting it working would be greatly appreciated.
-System running for 10-11 months with no issues.
-PFSense 2.1 x64 Release
-2x Dual Nic used as LAN interfaces. Onboard ethernet port used as WAN.
-3x interfaces on 2x dual NIC are running in a LAN Bridge
-Watchdog timeout issues on interface that runs to my upstairs regardless of network port plugged into.
-All network ports work properly
-Physical hardware; switches, cat5e cables, etc… all test working.
-Help greatly appreciated.
Things I Won't Be Doing:
-Updating system to a newer version of pfsense. I have 37x machines downstairs that have to be on at all time. Updating system even if it goes without issue turns those machines off. If an issue does arise I am now stuck shuffling parts around until I can fix the pfsense install.
-Leaving upstairs interface plugged into other switch as a hack fix.
They are all re(4) interfaces? Do you know exactly what chips they are using?
They are realtek based NICs. They enumerate re(0), re(1), re(3), and re(4) in my lan bridge installation, with re(2) being the onboard ethernet port used as the WAN. Re(0) and re(1) are one NIC and re(3) and re(4) are on the other.
Hmm. Well unfortunately Realtek NICs don't enjoy a good reputation and whilst most of that is based on their earlier 10/100 NICs the Gigabit NICs are still prone to odd behaviour. There were updates that went in to Realtek driver to support newer versions of the rtl8111 chip but I don't know if any other fixes went in and I can't remember when. ::) However I'm not sure there is much to suggest if you can't update to 2.1.5 or even straight to 2.2.
If you absolutely can't have any downtime then Realtek NICs are not the best choice. I recommend you swap them out for some Intel NICs at your earliest convenience. Probably not what you want to hear. :-\
I don't think it is an update based issue as the system has been working without issue since the beginning of the year. What makes a system work for that long and then start producing errors where there is no parts failure?
Intermittent fault. Failing switch, failing NICs. Failing under high load or memory use conditions. Failing due to some unusual network traffic.
The older Realtek NICs used to suffer watchdog timeouts with monotonous regularity on some hardware/driver combinations. Despite some concerted effort to determine a cause none was found but suspicion fell on fragmented packets being a common cause. Many people were able to eliminate or massively reduce the issue by placing a good quality switch immediately connected to the Realtek NIC. I'm not saying that applies here though, that was a much older NIC, but you can see how it could work fine for months and then suddenly fail when some new or updated piece of software starts sending differently formatted packets.