Problem pinpointing network issues
-
For the past few weeks I've been having these bizarre network issues where I'd lose the connection to another computer in the network via RDP or SSH, and the whole network drops all packets for a few seconds when this happens.
I've noticed a sure triggers is to download something, for instance a less than 100MB software update. But several simultaneous streams from Netflix, HBO, YouTube, you name it, don't crash the network. I'm not saying only by starting a download the network will crash, it's just a sure way to cause it...when it's "having the issue". By that I mean, it's not always happening, it's sporadic. and it makes it so much harder to pinpoint. I've been trying to learn Wireshark but I'm so far away to even have a clue of WTF to look for.
I didn't know where to file this post to begin with! I've think I've found where it starts but I don't know how to interpret it. I didn't even know how to export a range from Wireshark so desperate I just took this screenshot:
That's a capture done from pfSense focusing on a single host to which I was connected from another computer in the network via SSH and started a download to trigger the network crash. After that, there are a lot errors and a lot of these packets:
5530 15.374172 0.001244 52.85.35.230 10.0.0.32 TLSv1.2 1466 Ignored Unknown Record
I assume that would be a legitimate packet coming in but screwed up along the way then some machine, I don't know which, not knows how to deal with it.
The sheer transfer rate doesn't seem to affect it as I regularly move gigabytes over and over and media streaming, the other big consumer keeps up fine, although I assume it works over UDP or some sort of multicast-type technology not lighter but easier to process because several TB consumed each month are streaming media. I haven't really analyzed it deeply and even if I did I probably wouldn't know what to make out of it. And the big transfers I mentioned earlier, moving VM disks and stuff, don't require NAT, downloads, however, do. So I thought to dig on that but the router is nowhere near capacity in any way, not in RAM, not in CPU no in MBUFs or whatever those those things that are measured on the NICs are called. In fact, that one bar never goes over 3% or so. CPU is at around 20-35 on a heavy load, memory 80% out of 8GB when Squid is on bc I give it 2GB, ntopng and Suricata are the other big consumers, but they have always been working without causing issues. When Squid is off because I thought something wasn't right or something else, RAM is about halfway.
Recently I've been getting lovely hack attempts in Suricata's logs but not more than usual they aren't enough to weaken pfSense, at the most they would bring down the connection because of the traffic and I'd get another IP address on redial, effectively locking them out. That's about the only good side of dynamic IP addresses and nothing else. Another rule out.
I though my local network was overloaded because I was running several VMs all with networked storage and all the traffic was just too much but some digging around on the internet keeps pointing back to the firewall. It is pfSense who does interVLAN routing after all, and this really heavy traffic is confined to a single VLAN but since VLAN trunks go from switch-to-switch-to-swith-..., at some point the firewall get involved. I've ruled out M/R/STP and multicast VLANs/IGMP Snooping and every other technology that might divert or block traffic from a VLAN.
And to finish me off completely, as if the network was mocking me, STREAMING DOESN'T SKIP A BEAT when the network crashes! It is bursty traffic but still, it take several seconds for the network to come back, Apple TV, Roku, Chromecast, none of them even realize there is no network, the media-related device that does, is the Harmony Hub, the little green LED changes to red to indicate it's lost its connection but it's not actually disconnected from the wireless network, I can ping it to force the LED to get green again when it detects traffic.
Do you have some advice on what to look for ? Anything is good, I'm desperate! Big big thank you if u do even if it's not a sure thing, it couldn't hurt next to my actual ignorance on the subject. I could always restore one of my many many backups. (Already tried previous configs BTW, didn't work, something is causing it)
-
Solved it, it wasn't the network. It was pfSense itself. I tried another box and everything got back to normal. I guess some code was misbehaving within pfSense itself.
I was this close *makingfingersabouttotouchhandgesture* to dust off my license for Mikrotik's CHR to see if with its complicated management at least it got some insights to match. Glad I didn't, I needed to sleep already! :)