Intermittent high latency between two LAN interfaces
-
Is there a specific reason you're not running 22.01 yet?
-
No reason, just haven't spent the time to do the upgrade yet.
I don't see anything in the release notes that would indicate a fix for this issue, so I'd prefer to invest time in gathering additional debug data first rather than upgrading if the act of upgrading by itself is unlikely to resolve this problem.
-
Indeed and in fact there is a known issue in 22.01 that might present like that. There's nothing in 21.05.2 that did though as far as I know.
What is lagg1 there?
Do you see errors on any interfaces in Status > Interfaces?
Are you running any packages?
Steve
-
Thanks, it's good to know about that similar issue in 22.01.
The lagg1 interface is a lagg between ix0 and ix1 with
LAGG Protocol
set toFAILOVER
andFailover Master Interface
set toauto
.Do you see errors on any interfaces in Status > Interfaces?
Yes, the LAN interface (gray on the above MTR) has the following:
In/out errors 0/456 Collisions 0
And the OPT1 interface (blue above) has the following:
In/out errors 0/166 Collisions 0
Installed packages:
- Cron
- mtr-nox11
- openvpn-client-export
- pfBlockerNG-devel
- zabbix-agent4
-
Hmm, that's not ideal. What does
netstat -i
show at the command line?Could be errors on the lagg or one of the members not just the VLAN.
Did this just start happening?
-
This has been happening for a few months but seems to have gotten worse over the past few weeks. Here's the output of
netstat -i
- I removed the specific MAC addresses and the IPv6 lines since those aren't relevant here:Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll ix0 1500 <Link#1> 00:11:22:33:44:55 70322185682 423 0 70804435446 0 0 ix1 1500 <Link#2> 00:11:22:33:44:55 24934970 2 0 19091 0 0 ix2 1500 <Link#3> 66:77:aa:bb:cc:dd 1012533026 0 0 493073228 0 0 ix3 1500 <Link#14> 66:77:aa:bb:cc:dd 4914 0 0 0 0 0 lo0 16384 <Link#15> lo0 2183050 0 0 2183050 0 0 lo0 - localhost localhost 0 - - 0 - - lo0 - your-net localhost 2183049 - - 2183050 - - enc0* 1536 <Link#16> enc0 0 0 0 0 0 0 pflog 33160 <Link#17> pflog0 0 0 0 134021 0 0 pfsyn 1500 <Link#18> pfsync0 0 0 0 0 0 0 lagg0 1500 <Link#19> 66:77:aa:bb:cc:dd 1012537940 0 0 493073228 3 0 lagg1 1500 <Link#20> 00:11:22:33:44:55 70347118689 425 0 70804454537 622 0 lagg0 1500 <Link#21> 66:77:aa:bb:cc:dd 1004528099 0 0 492383298 2 0 lagg0 - <WAN-ONE> <WAN-ONE-IP> 14472999 - - 35 - - lagg0 1500 <Link#22> 66:77:aa:bb:cc:dd 0 0 0 7 0 0 lagg1 1500 <Link#23> 00:11:22:33:44:55 45500619462 0 0 24028805736 456 0 lagg1 - <GRAY-NETWORK> <GRAY-NETWORK-IP> 40479752 - - 80336493 - - lagg1 1500 <Link#24> 00:11:22:33:44:55 24821582530 0 0 46775648818 166 0 lagg1 - <BLUE-NETWORK> <BLUE-NETWORK-IP> 358175778 - - 358086480 - - lagg0 1500 <Link#25> 66:77:aa:bb:cc:dd 8004823 0 0 689918 1 0 lagg0 - <WAN-TWO> <WAN-TWO-IP> 526569 - - 0 -
-
What is lagg1 connected to? Are you unable to run LACP there?
You might try swapping the lagg conections and see if the input errors follow that.
None of that explains 9s pings though.
I would try going to 22.01 if you can. There is a workaround patch for the known issue there if you hit it. Even if you do though you still wouldn't see pings at 9000ms.
Steve
-
Correct, lagg1 cannot support LACP on the other end because the other end of lagg1 is a set of two independent switches (not stacked). I am really only interested in fault tolerance when aggregating the links, so using the
FAILOVER
protocol is sufficient.Is there a way for me to see which VLAN (e.g.
lagg1.X
) had those errors?Is there something about 22.01 that would make it more likely to not have this issue, or give me better visibility? I can't explain why I saw 9000ms pings on 21.05.2-RELEASE, so I'm concerned that same situation will just occur again on 22.01.
-
Not beyond what you see from 'netstat -i' above. You can see the out errors there on the VLAN subnets.
Do you see any link changes when you see the very high latency?
Can you test running without the lagg?
Steve
-
I haven't noticed any link chances when I see the high latency. As far as testing without the lagg, can I just unplug one of the cables to do the test (forcing all of the traffic to go through one port with no other option) or is that not sufficient (do I actually need to reconfigure to remove the lagg interface itself in the config)?
-
I captured another instance of this disruption again today; this time the latency is much smaller (but still high) and there's some packet loss:
When this happens, TCP connections (e.g. an SSH session) get dropped.
The errors on ix0 and lagg1 have increased slightly:
Name Mtu Network Address Ipkts Ierrs Idrop Opkts Oerrs Coll ix0 1500 <Link#1> 00:11:22:33:44:55 70482923217 426 0 70965439161 0 0 lagg1 1500 <Link#20> 00:11:22:33:44:55 70507907764 428 0 70965458252 622 0
-
Hmm, you would not expect some minor packet loss to cause TCP connections to fail. You just see retransmissons.
Unless all of those failures were happening at the same time so it times out. That would take a while though.
This starts to look more like a duplicate IP or a packet loop. You can see that if you have a loop that's prevented by stp and it periodically resets.
Removing one link from the lagg entirely might prove that.
Steve