Upgrade to 21.02 - High latency between local VLANs
-
After our upgrade to 21.02-p1, we are seeing very high latency to some hosts when accessing between some hosts on different local VLANs / subnets. I believe we upgraded from whatever the latest 2.4.x was.
In addition to being used as a firewall, our Netgate XG-7100 is serving as a layer 3 router between multiple local private VLANs. There are no firewall rules, beyond the default allow all, between VLANs.
For the sake of example and simplicity, let's say that we have two VLAN's, and let's call them 101 and 102 for reference. Each VLAN is set up as a /24 subnet. Let's also say that there are two hosts on 101, we'll call them 101.2 & 101.3. Let's also have two hosts on 102, called 102.2 & 102.3. Let's call the gateway for each VLAN/subnet 101.1 and 102.1 respectively.
The following are ping tests that demonstrate the issue we are seeing:
- 101.2 pings 101.3 host (ok < 1ms)
- 101.2 pings 101.1 gateway (ok < 1ms)
- 101.2 pings 102.1 gateway (ok < 1ms)
- 101.2 pings 102.2 host (high latency - 500+ ms)
- 101.2 pings 102.3 host (ok < 1ms)
...and from a reversed point of view...
- 102.2 pings 102.3 host (ok < 1ms)
- 102.2 pings 102.1 own gateway (ok < 1ms)
- 102.2 pings 101.1 gateway (ok < 1ms)
- 102.2 pings 101.2 host (high latency - 500+ ms)
- 102.2 pings 101.3 host (ok < 1ms latency)
Here are some actual ping statistics for one of the problematic scenarios described above:
--- high latency host ping statistics (101.2 to 102.2) --- 3508 packets transmitted, 3507 received, 0.0285063% packet loss, time 4168ms rtt min/avg/max/mdev = 0.271/661.461/2898.586/504.081 ms, pipe 3
And to a different host on the same subnet as the first example not showing an issue...
--- normal latency ping statistics (101.2 to 102.3) --- 64 packets transmitted, 64 received, 0% packet loss, time 174ms rtt min/avg/max/mdev = 0.339/0.719/1.673/0.287 ms
There doesn't seem to be rhyme or reason to which hosts exhibit the behaviour. Some are windows, some are Linux. Some are different versions of windows and Linux. There does not appear to be a pattern that I have been able to detect. Everything was smooth for the last year, it was only when updating to 21.02-p1 that we are seeing these issue. I'm at a loss here and any feedback or questions for additional information would be very much appreciated.
-
I ended up downgrading back to 2.4.5-p1 and everything is fine again. Maybe there's something specific to our configuration, but even with Netgate hardware, it doesn't look to me that 21.x is ready for prime time just yet. Maybe it's just us.
In case it helps anyone, there were at least three show-stopper issues that we found before we gave up:
1 ) Severe routing latency between vlans
2) DNS Resolver (unbound) crashing if "Register DHCP leases in the DNS Resolver" is enabled.
3) OpenVPN completely unusable (users can't connect, widget says there is a problem, services say everything is fine).