Network Drops on my LAN side, a quick "5" then "r" for reroot will bring everything back, but I can't find the root



  • Hey Everyone,

    I'm starting to pull my hair out over my network drops. I've got pfSense ver. 2.4.3-Release-p1 running on a Dell PE R610 with great specs (see attached photo), but I keep having network outages, only on my LAN. The pfsense box can ping out on the WAN but not internally (really choppy pings, lots of drops). I've tried port mirroring my internal traffic, but I cannot find a specific root of the cause. I removed Snort as I thought it might be the issue but it still persist. I do have a few VLANs, but nothing too complicated. I'm using DHCP on 2 vLans and DNS forwarding to a Windows ADDS server which runs DHCP for my main wired network. I do have OpenVPN running, as well as Squid and SquidGuard. All was running fine until a month ago. If I do a quick 'reroot' from the box itself everything immediately works. I can't find anything too odd in the system logs either. Please HELP!!!! Thanks!

    0_1537293231549_pfspecs.PNG



  • Sorry, for a few more details:

    I have roughly 500 Windows 10 machines, 15 server 2016 machines, 1000 phones connected, and roughly 200 chromebooks connected at all times. Besides that there are a few linux boxes around the building.

    The problem is very similar to this thread: https://forum.netgate.com/topic/65540/lan-connection-drops-all-the-time
    except that the above thread was resolved with an update.



  • Sorry, One more round of specs:

    I actually swapped boxes and this one was a free server that I was going to build as a HyperV server, but it is currently running pfSense so that I could eliminate the possibility of bad hardware. That is the reason the specs are so ridiculous. I had pfsense previously running on an R610 with more mild specs, but I thought it may be hardware so I moved the installation onto this currently unused machine.



  • If you have a Cisco switch, you can run a TDR test to test the wire from your switch to PFsense.

    Without knowing much about your network, the first thing I would do is a new NIC.

    Off topic, but is there a reason you haven't enabled AES-NI?



  • I'm using all Ubiquiti Unifi switches so its a No Go on the cable test, but I've already replaced the cable and switched to new hardware. I turned off AES-NI just for troubleshooting a while ago. The problem followed from old hardware to new hardware, so it seems to be in my settings. I did a fresh install, then restored the settings to the new box using the xml file from the old box. Oddly enough, I do notice that when the LAN side hangs, if I run a quick Wireshark PCAP on my mirrored port, it often times has a print job going through the network. It could be because this is a school and the teachers print a lot (just coincidence?), or possibly something about the print jobs are hanging up pfSense?



  • drop squid & squidguard to see if they are the source of the issues.
    what nics are in there. whats in the syslogs?



  • I appreciate the help. I tried cutting off Squid and SquidGuard and the problem is still existing. Tomorrow I'm going to try to kill OpenVPN and see if that does it. The system logs are pretty clean. During the down time panic, there are no logs triggered. I'm going to try killing OpenVPN though because I do notice there are a lot of people "knocking at the door" from all around the world. I know that everyone gets people trying to enter, but we have been having a very high number lately (anywhere from 5-15 random IPs from around the world trying to get in through the VPN, but failing the TLS handshake, or I catch them doing a port scan and block them and report them).

    The NIC is an intel card and shows up as igb0 and igb1. Its the card with 2 gb RG45 ports and 2 10gb fiber ports, so I just use the 2gb RG45 ports. I will say that the problem seems to be intensifying, as in the problem is now occurring 3 to 10 times a day, while users are in the building. Over the past weekend though I was running a ping watchdog and never had a single drop, so it does seem to occur due to traffic.





  • Yeah, I tried that too. At this point I've ruled out any of the plug-ins / add-ons for pfsense. The network drops are very frequent now, as in once an hour. I haven't been able to notice any network loops throughout the building. I have RSTP enabled on the switches so they should be catching any loops if detected anyways. Wireshark PCAPs from a mirrored port on the switch feeding pfsense doesn't show any crazy traffic. I cannot seem to find any form of pattern in the sys logs either.

    There has been one new development: The VOIP system will cut out during the pfSense LAN blackouts. They are on their own VLAN with no need to use pfSense for Routing since they should be able to route through the switches and get a phone line out on their T1 connection. So........ I'm not sure why they are cutting out. Intuition would tell me that the problem then must be originating in the switches, but if so then why does a quick re-root of pfSense fix the issue?


  • Banned

    @jonahparks said in Network Drops on my LAN side, a quick "5" then "r" for reroot will bring everything back, but I can't find the root:

    Intuition would tell me that the problem then must be originating in the switches, but if so then why does a quick re-root of pfSense fix the issue?

    For one thing, its a reboot not reroot. As for your question: Rebooting usually cycles the power to the NICs too (or resets them at least), which is similar to pulling and re-plugging the Ethernet cable. So could try this instead of rebooting and see if it helps, in that case it could very well be an issue with the switch.


  • Netgate Administrator

    I imagine he really does mean re-root here. r does re-root at option 5.

    But, yes, re-connecting the interfaces could be a clue if that brings back the connection.

    If it's affecting traffic that doesn't go through the firewall at all that does seem like a switch issue I agree.

    Steve



  • So after further digging, I found that my Unifi system in its previous update automatically turned on "Wireless Meshing" between 2 of my 50 WAPs. Both are hardwired, so there is no need for meshing, so no problems occurred the first 3 weeks after the update (therefore I never suspected it), but if one WAP gets overloaded with traffic and misses its heartbeat, it creates a temporary wireless bridge to reconnect and then creates a network loop. For some reason the switches RSTP setting isn't picking up the loop and its making its way all the way up the food chain to the pfSense box since the wireless bridge resides on a VLAN and needs routing through to the LAN. Since I took off the Wireless Meshing setting, everything has cleared up. I'm hopeful that this was the root of the problem and the peacefulness continues. I'll keep you guys informed and I appreciate all the help!


Log in to reply