Potentially solved issue with network outages, but why?
Posting this, in part, to see if it may help others experiencing similar problems, but would love to see if anyone could explain why it was happening or why what I did may have fixed it.
I was using pfSense in a ESXi VM for over a year with no issues on my home network. The ESXi setup was pretty straightforward, with a NIC dedicated to the pfSense VM for WAN connected to my cable modem, a NIC dedicated to the pfSense VM for LAN connected to my core switch, and a NIC for the VM Network used by all other VMs connected to the core switch. The network also uses 2 Ubiquiti APs for wifi. There is also a site-to-site IPSEC tunnel going to my vacation house, also using pfSense.
Everything ran smoothly until I replaced my ESXI box with new hardware, and migrated all of the VMs over. After reconfiguring, everything worked, but I would experience frequent but intermittent network dropouts, where I would lose network access to the pfSense VM and the APs, and all connected devices would drop offline. This would occur 2-3 times/day. Wired devices, for the most part, could connect to internal hosts via IP, but DNS and Internet were inaccessible. pfSense would start working again in about 2-3 minutes and restore connectivity, but it was not a reboot. I've never really been able to confirm what was stopping or restarting within pfSense, but I suspected the DNS resolver, as the prominent symptom (aside from pfSense becoming inaccessible) was the lack of DNS resolution, both on the internal network and out to the Internet. Although it happened at seemingly random times, these network drops would frequently coincide with when my wife needed to use online conferencing for work (they use UberConference). 95% of the time she had a work call, the network would drop out. The best I could get out of Wireshark was that the number of TCP retries would spike, but there was no corresponding "smoking gun" as to what was breaking during these events.
It's worth noting that my ISP connection has been incredibly stable for over a decade. I've never had serious networking issues with my cable modem, and these issues don't appear to stem from the ISP connection, as it remains stable once pfSense recovers from whatever is going on.
In troubleshooting this over the past ~6 months, I've pretty much replaced every part of my network infrastructure. I replaced my unmanaged gigabit switch (which was giving me occasional lockups) with a Ubiquiti managed switch to try to get some more insight into the problem. Still had issues, and the only insight the Ubiquiti would give me was that the affected clients "experience rating" would drop to 0%.
I then decided to replace my pfSense VM with a baremetal install on a Protectli box. As the configuration on my VM was a holdover from my m0n0wall days, I decided to go as close to default as possible on the new box. Frustratingly, the dropouts still occurred on the new hardware with a very basic pfSense install.
After shutting down the old VM, I decided (partly out of desperation) to even disconnect the cables from the old VM from my switch.
The only indicator I got from the pfSense logs on the new box around the time of these dropouts was that there was an address change on one or more of the IPSEC tunnel endpoints, yet my IP address has remained consistent for years at a time (my ISP assigns them dynamically, but leaves you with the same IP unless there's a prolonged disconnection or a MAC change). Same ISP on both ends of the tunnel, same DHCP policy. I confirmed that the IP on the remote end of the tunnel hasn't changed.
The interesting thing was that, the other day I disabled DHCP6 on the WAN interface, since my ISP doesn't use IPv6, and we've had 3 days with no network drops. I've confirmed with their support channel that they are not trying to assign IPv6 addresses via DHCP, and confirmed that it was disabled on the VM, but was turned on by default on the fresh install.
So, the only troubleshooting/configuration changes that have occurred since the last network dropout several days ago were:
- Disconnected the VM LAN connection from the switch.
- Disabled DHCP6 on the WAN interface on pfSense.
So, I'm thrilled that I may have resolved the issue of the network drops. What I'm still frustrated with is that neither of these things seem like they should have resolved it, but did and I have no idea why.