Intermittent high latency between two LAN interfaces

stephenw10

Is there a specific reason you're not running 22.01 yet?

amartin

No reason, just haven't spent the time to do the upgrade yet.

I don't see anything in the release notes that would indicate a fix for this issue, so I'd prefer to invest time in gathering additional debug data first rather than upgrading if the act of upgrading by itself is unlikely to resolve this problem.

stephenw10

Indeed and in fact there is a known issue in 22.01 that might present like that. There's nothing in 21.05.2 that did though as far as I know.

What is lagg1 there?

Do you see errors on any interfaces in Status > Interfaces?

Are you running any packages?

Steve

amartin

Thanks, it's good to know about that similar issue in 22.01.

The lagg1 interface is a lagg between ix0 and ix1 with LAGG Protocol set to FAILOVER and Failover Master Interface set to auto.

Do you see errors on any interfaces in Status > Interfaces?

Yes, the LAN interface (gray on the above MTR) has the following:

In/out errors 0/456 
Collisions 0

And the OPT1 interface (blue above) has the following:

In/out errors 0/166 
Collisions 0

Installed packages:

Cron
mtr-nox11
openvpn-client-export
pfBlockerNG-devel
zabbix-agent4

stephenw10

Hmm, that's not ideal. What does netstat -i show at the command line?

Could be errors on the lagg or one of the members not just the VLAN.

Did this just start happening?

amartin

This has been happening for a few months but seems to have gotten worse over the past few weeks. Here's the output of netstat -i - I removed the specific MAC addresses and the IPv6 lines since those aren't relevant here:

Name   Mtu    Network        Address            Ipkts        Ierrs  Idrop  Opkts        Oerrs  Coll
ix0    1500   <Link#1>       00:11:22:33:44:55  70322185682  423    0      70804435446  0      0
ix1    1500   <Link#2>       00:11:22:33:44:55  24934970     2      0      19091        0      0
ix2    1500   <Link#3>       66:77:aa:bb:cc:dd  1012533026   0      0      493073228    0      0
ix3    1500   <Link#14>      66:77:aa:bb:cc:dd  4914         0      0      0            0      0
lo0    16384  <Link#15>      lo0                2183050      0      0      2183050      0      0
lo0    -      localhost      localhost          0            -      -      0            -      -
lo0    -      your-net       localhost          2183049      -      -      2183050      -      -
enc0*  1536   <Link#16>      enc0               0            0      0      0            0      0
pflog  33160  <Link#17>      pflog0             0            0      0      134021       0      0
pfsyn  1500   <Link#18>      pfsync0            0            0      0      0            0      0
lagg0  1500   <Link#19>      66:77:aa:bb:cc:dd  1012537940   0      0      493073228    3      0
lagg1  1500   <Link#20>      00:11:22:33:44:55  70347118689  425    0      70804454537  622    0
lagg0  1500   <Link#21>      66:77:aa:bb:cc:dd  1004528099   0      0      492383298    2      0
lagg0  -      <WAN-ONE>      <WAN-ONE-IP>       14472999     -      -      35           -      -
lagg0  1500   <Link#22>      66:77:aa:bb:cc:dd  0            0      0      7            0      0
lagg1  1500   <Link#23>      00:11:22:33:44:55  45500619462  0      0      24028805736  456    0
lagg1  -      <GRAY-NETWORK> <GRAY-NETWORK-IP>  40479752     -      -      80336493     -      -
lagg1  1500   <Link#24>      00:11:22:33:44:55  24821582530  0      0      46775648818  166    0
lagg1  -      <BLUE-NETWORK> <BLUE-NETWORK-IP>  358175778    -      -      358086480    -      -
lagg0  1500   <Link#25>      66:77:aa:bb:cc:dd  8004823      0      0      689918       1      0
lagg0  -      <WAN-TWO>      <WAN-TWO-IP>       526569       -      -      0            -

stephenw10

What is lagg1 connected to? Are you unable to run LACP there?

You might try swapping the lagg conections and see if the input errors follow that.

None of that explains 9s pings though.

I would try going to 22.01 if you can. There is a workaround patch for the known issue there if you hit it. Even if you do though you still wouldn't see pings at 9000ms.

Steve

amartin

Correct, lagg1 cannot support LACP on the other end because the other end of lagg1 is a set of two independent switches (not stacked). I am really only interested in fault tolerance when aggregating the links, so using the FAILOVER protocol is sufficient.

Is there a way for me to see which VLAN (e.g. lagg1.X) had those errors?

Is there something about 22.01 that would make it more likely to not have this issue, or give me better visibility? I can't explain why I saw 9000ms pings on 21.05.2-RELEASE, so I'm concerned that same situation will just occur again on 22.01.

stephenw10

Not beyond what you see from 'netstat -i' above. You can see the out errors there on the VLAN subnets.

Do you see any link changes when you see the very high latency?

Can you test running without the lagg?

Steve

amartin

I haven't noticed any link chances when I see the high latency. As far as testing without the lagg, can I just unplug one of the cables to do the test (forcing all of the traffic to go through one port with no other option) or is that not sufficient (do I actually need to reconfigure to remove the lagg interface itself in the config)?

amartin

I captured another instance of this disruption again today; this time the latency is much smaller (but still high) and there's some packet loss:

When this happens, TCP connections (e.g. an SSH session) get dropped.

The errors on ix0 and lagg1 have increased slightly:

Name   Mtu    Network        Address            Ipkts        Ierrs  Idrop  Opkts        Oerrs  Coll
ix0    1500   <Link#1>       00:11:22:33:44:55  70482923217  426    0      70965439161  0      0
lagg1  1500   <Link#20>      00:11:22:33:44:55  70507907764  428    0      70965458252  622    0

stephenw10

Hmm, you would not expect some minor packet loss to cause TCP connections to fail. You just see retransmissons.

Unless all of those failures were happening at the same time so it times out. That would take a while though.

This starts to look more like a duplicate IP or a packet loop. You can see that if you have a loop that's prevented by stp and it periodically resets.

Removing one link from the lagg entirely might prove that.

Steve