WAN Gateway Latency

stephenw10

Setting an external monitoring IP is a good idea, the ISP gateway may not respond to pings with any priority.

How long do these incidents last?

What actual latency do you see?

If you only have one WAN I would disable the 'Gateway Monitong Action' for the gateway so it doesn't run a bunch of scripts if an alarm is triggered.

Try running MTR from a clients behind pfSense during an event. See if you can see the hop where the latency is happening.

vdsadmin

Same issue here at multiple sites. It is some variation on this theme...

Mar 2 22:01:06 dpinger 28964 WAN_DHCP 98.157.240.1: Clear latency 47053us stddev 50973us loss 1%
Mar 2 22:00:03 dpinger 28964 WAN_DHCP 98.157.240.1: Alarm latency 1891543us stddev 3930277us loss 0%
Mar 1 22:01:52 dpinger 28964 WAN_DHCP 98.157.240.1: Clear latency 44064us stddev 47457us loss 2%
Mar 1 22:00:58 dpinger 28964 WAN_DHCP 98.157.240.1: Alarm latency 20422262us stddev 16569847us loss 7%
Mar 1 22:00:49 dpinger 28964 WAN_DHCP 98.157.240.1: Alarm latency 20225490us stddev 16622996us loss 7%
Mar 1 22:00:14 dpinger 28964 WAN_DHCP 98.157.240.1: Alarm latency 8265us stddev 797us loss 21%

Or

Mar 21 10:18:17 dpinger 50215 WAN_DHCP 1.1.1.1: sendto error: 65
Mar 21 10:18:15 dpinger 50215 WAN_DHCP 1.1.1.1: sendto error: 65
Mar 21 10:18:14 dpinger 50215 WAN_DHCP 1.1.1.1: Alarm latency 0us stddev 0us loss 100%
Mar 21 10:18:13 dpinger 50215 WAN_DHCP 1.1.1.1: sendto error: 65

Commonalities are the following.

All Spectrum links
All Netgate 3100 or 4100 routers
All Spectrum service techs report the same thing. "All good on Spectrum's end. No errors detected." I have had techs out to 4 different locations.
All Unifi switches and WAPs behind the routers
No apparent errors reported in other Netgate logs for same time period.
All locations have a combination of LAN and VLANs
Spectrum modems are of differing models

The results are the same. The WAN link goes offline permanently and the only solution is to reboot the Netgate router. After reboot the WAN link is restored for most of a day or for a few days. Rinse and repeat.

I have been dealing with this for about 6 months now and have tried numerous changes on the Netgate routers to no avail. The only way I have found to slow down the issues is with a cron job that reboots the routers every morning before employees arrive. That seemed to reduce the frequency of the WAN going offline but has not stopped it completely. Occasionally, a location will still report going offline during the day.

My best guess is that Spectrum changed something that Netgate doesn't like. Discovering that however, is outside my paygrade. I am currently planning on ripping out all equipment from one location and replacing it with different gear to see if that solves it. I would rather not do that.

Any assistance would be appreciated.

iterator23

@vdsadmin

Let me know how it goes with changing the equipement I am also thinking about removing pfsense and putting a UDM Pro to see if there is a difference.

stephenw10

That first log shows the latency triggering an alarm but then the alarm is cleared. That doesn't look like it would require a reboot.

The second log shows the link completely down. Like it lost the dhcp lease on the WAN entirely.

Those are very different I assume they were from different sites?

When it's down what do you see on the WAN?

vdsadmin

I appreciate the quick response.

Same router. Different days. These are just samples of what appears in the Gateway logs at all locations. These Gateway log entries appear around the same time that the WAN link goes down permanently. The different monitor IPs are just me trying random things in an attempt to determine cause.

Your last question is one of the more frustrating aspects of all this. I am offsite. The WAN goes down. Since I am offsite and the WAN is down there is no way for me to get in there and see what is going on. People need to work so they reboot the router. Since it is so random I can't sit at a site for days and wait for it to occur. I have considered setting up a WWAN link at a site to access the router during an outage but since these sites are simple networks I determined it would be a shorter path to victory if I just ripped it all out and replaced it all.

If you have a better plan, I am all ears.

vdsadmin

Does any of this seem relevant?

https://forum.netgate.com/topic/135647/help-netgate-router-is-receving-frequent-gateway-alarms-resetting-causing-lost-connections/22

https://forum.netgate.com/topic/111733/interesting-case-of-wan-dropping-daily-dhcp-being-blocked-by-firewall/7

Changing the port speed did not solve it for me but if this triggers any memories or helps with the discussion please let me know.

stephenw10

Can you upload the complete system log covering an event?

https://nc.netgate.com/nextcloud/s/pPQWfi63ZeY5woM

vdsadmin

Sorry. Forgot to mention I am currently working through another theory. I have pieced together some random bits of information that may or may not be connected.

All routers are running 24.11 and I changed all the routers over to Kea DHCP from ISC DHCP months ago.
I have had a few locations running Netgate routers where Kea DHCP just stops for no reason I can tell.
Around the same time the Spectrum WAN links go down there is an entry in the DHCP logs that shows a DHCP request to the Spectrum DHCP server.
Some of the users have mentioned the strange behavior of some workstations going offline while others are still active. This could be explained by the DHCP server going offline.
I did a ping test to that Spectrum DHCP server at the affected site and the result was tremendous latency.

64 bytes from 142.254.150.237: icmp_seq=0 ttl=251 time=12.257 ms
64 bytes from 142.254.150.237: icmp_seq=1 ttl=251 time=13.415 ms
64 bytes from 142.254.150.237: icmp_seq=2 ttl=251 time=16.755 ms
64 bytes from 142.254.150.237: icmp_seq=3 ttl=251 time=13.201 ms
64 bytes from 142.254.150.237: icmp_seq=4 ttl=251 time=10329.483 ms
64 bytes from 142.254.150.237: icmp_seq=5 ttl=251 time=329.781 ms
64 bytes from 142.254.150.237: icmp_seq=6 ttl=251 time=260.305 ms
64 bytes from 142.254.150.237: icmp_seq=7 ttl=251 time=5631.755 ms
64 bytes from 142.254.150.237: icmp_seq=8 ttl=251 time=55.087 ms
64 bytes from 142.254.150.237: icmp_seq=9 ttl=251 time=84.499 ms

I did a ping test to the same Spectrum DHCP server from a Spectrum link where I am not experiencing these disconnect issues and I got expected latency.

PING 142.254.150.237 (142.254.150.237): 56 data bytes
64 bytes from 142.254.150.237: icmp_seq=0 ttl=255 time=8.200 ms
64 bytes from 142.254.150.237: icmp_seq=1 ttl=255 time=9.421 ms
64 bytes from 142.254.150.237: icmp_seq=2 ttl=255 time=8.494 ms
64 bytes from 142.254.150.237: icmp_seq=3 ttl=255 time=8.081 ms
64 bytes from 142.254.150.237: icmp_seq=4 ttl=255 time=8.991 ms
64 bytes from 142.254.150.237: icmp_seq=5 ttl=255 time=8.176 ms
64 bytes from 142.254.150.237: icmp_seq=6 ttl=255 time=7.587 ms
64 bytes from 142.254.150.237: icmp_seq=7 ttl=255 time=9.137 ms
64 bytes from 142.254.150.237: icmp_seq=8 ttl=255 time=8.731 ms
64 bytes from 142.254.150.237: icmp_seq=9 ttl=255 time=8.165 ms

This lead me to believe that Kea is issuing a DHCP request on the WAN link but the Spectrum DHCP server is not responding fast enough so Kea is marking the link as down or just simply crashing and taking the WAN link down along the way.

In the interest of the tried and true method of making system changes on slimly supported random speculation I reverted 2 of the locations back to ISC DHCP. They have been stable for 2 days now. I wish that means a lot but it is to soon to tell. I am going to give it a week and if it remains stable and there are no more disconnects I am going to blame Kea DHCP whether it is to blame or not.

Best of luck to us all.

iterator23

@vdsadmin

Which unifi switches and aps do you have at the sites?

I am running ISC never moved to KEA.

Not sure if related at all but in your unifi console under ports are you seeing TX drops? I am asking because I am running Gen 1 switches and see some TX drops across multiple but not the newer switches.

vdsadmin

@iterator23

At the most affected site I am running a single USW 16 PoE switch with a single Nano HD WAP.

On the average day I am not experiencing any drops on the switch. The Netgate 3100 is not reporting any errors on the interfaces.

stephenw10

Kea is only the server side for internal clients. It has nothing to do with the dhclient requests on WAN.

vdsadmin

@stephenw10

Understood. That is how I understand it as well. However, there are still entries in the DHCP log that reference the WAN DHCP server around the time of the WAN outage.

This is a current entry in the DHCP log from a Netgate 4100 and ix3 is the WAN Interface. This particular system is still running Kea and is on a stable Spectrum connection and is not experiencing any outages.

Mar 21 09:39:22 dhclient 44939 bound to xxx.xxx.xxx.xxx -- renewal in 43200 seconds.
Mar 21 09:39:22 dhclient 75840 Creating resolv.conf
Mar 21 09:39:22 dhclient 74947 RENEW
Mar 21 09:39:22 dhclient 44939 DHCPACK from 142.254.150.237
Mar 21 09:39:22 dhclient 44939 DHCPREQUEST on ix3 to 142.254.150.237 port 67

I have no idea if Kea is the culprit. I am just raising it up in case there is an outside chance this could be the issue.

vdsadmin

@stephenw10

I will do that as soon as I am able.

stephenw10

It could be Kea via some affected process but not directly.

If dhclient shows failing to pull a new lease at release time then that's certainly a problem.