Novice trying to diagnose internet dropouts. (log included)

laserbones

I'm having intermittent internet dropouts and maybe DNS errors. This all started recently. I'm still somewhat of a novice and have been using pfsense at home for 1.5 years. I'm trying to learn how to solve these issues by myself but don't know where to start. I've included my log and I'm hoping to find some help with the interpretation of it.

https://www.dropbox.com/scl/fi/hy0y68ds7pb4iac8r41la/system.log?rlkey=qmz2kzfp74vl4iy6lrfxb0wf5&dl=0

stephenw10

All that's shown there is that the WAN gateway becomes lossy or high latency triggering an alarm.

Do you have an external IP set for gateway monitoring?

What sort of WAN is it? You might just need to tune the monitoring values.

Or it could be an actual upstream WAN issue. A failing modem or a fault at the ISP etc.

Steve

laserbones

@stephenw10 I have an ipv4 and ipv6 on the same WAN. I don't have dual WAN so I just disabled gateway monitoring. My ISP said I have an old gateway so they're sending me a new gateway.

I'll see how this works out. I'm assuming tuning monitoring values is irrelevant if I disable gateway monitoring? Do you have any other suggestions?

stephenw10

I would suggest disabling the 'monitoring action' but keeping monitoring active so you can see the state of the connection. And I would set the monitoring IP to something external like 8.8.8.8 for more useful data than just the first hop gateway.

michmoor

@stephenw10 was just about to write this. Monitoring action is likely at fault. Default enabled is problematic. Keeping it enabled in a single connected set up will impact IPsec tunnels if you have them and also triggers restarts of various packages and services. Keep the action checkbox disabled

jwns

I've also experiencing intermittent drops of my WAN interface on my Netgate 2100, 24.03-RELEASE (arm64).

For OP, I don't think mine is DNS.
I occasionally lose my WAN interface until I wait 5-10 minutes or manually bounce it. I tend to observe the issue between 0300 and 0600 UTC but it has happened at other times a day.

There is no apparent issue at the ISP.

The standard log configuration doesn't appear to provide any meaningful data related to the failure with the exception of gateway monitoring indicating latency. All other log content begins when I log in to investigate.

I'm looking for some Netgate/pfSense specific troubleshooting tips that may help me determine specifically why the WAN interface is dropping or becoming unresponsive.

Has anyone got anything on this?

stephenw10

What exactly is logged? Nothing but the gateway alarm?

jwns

@stephenw10

Gateway Monitoring Settings:
State Killing on Recovery: Don't Kill
State Killing on Failure: Do not kill
All boxes unchecked.

My most recent drop outs occurred between 0500-0515 UTC on 30 DEC 2024 and 0045-0050 on 31 DEC.

In the ten minute period prior to the most recent drop out...

System

General: nothing for at least 30 minute prior to me logging in to manually bounce the interface.
Gateways:

Looking back over Gateway logs, interfaces show a pattern of the monitor issuing two latency alarms before clearing.
This is occuring every couple days, occasionally multiple times a day.
Individual incidents shows packet loss in excess of 20% for 3-5 minutes before the issue resolves without manual intervention.
Cases where I intervene manually include additional exit and status messages triggered by disabling and enabling the WAN interface.

Routing: Nothing for months. I think nothing since I flashed the firmware upgrade.
DNS Resolver: the logs have all churned.
Wireless: No logs to display.
Gui Service: User Agents of my browsers in the web interface
OS Boot: Nothing interesting here.

Firewall: the logs have all churned.

DHCP: dhclient renewed the WAN IP. This happens every 15 minutes without incident.

Authentication: Me logging in to restart the interface

jwns

@stephenw10

Here are the five lines in the Gateways log for my most recent incident. It happend when I wasn't home and resolved itself in less than five minutes.

I have two gateways. One is my WAN, the other is a site-to-site VPN that relies upon the WAN interface.
Gateway monitoring actions are unchecked or set to "Do not kill"

2024-12-31 00:46:33.357371+00:00 dpinger 80792 WAN_DHCP X.X.205.1: Alarm latency 2837us stddev 2347us loss 22%
2024-12-31 00:46:33.966295+00:00 dpinger 81141 VPN_GW1 XX.XX.0.1: Alarm latency 4193us stddev 866us loss 22%
2024-12-31 00:48:50.648894+00:00 dpinger 81141 VPN_GW1 XX.XX.0.1: Alarm latency 14424198us stddev 8613417us loss 58%
2024-12-31 00:49:22.981368+00:00 dpinger 81141 VPN_GW1 XX.XX.0.1: Alarm latency 6384956us stddev 9171752us loss 5%
2024-12-31 00:49:45.228737+00:00 dpinger 80792 WAN_DHCP X.X.205.1: Clear latency 2879us stddev 2351us loss 5%
2024-12-31 00:49:53.230126+00:00 dpinger 81141 VPN_GW1 XX.XX.0.1: Clear latency 4300us stddev 1125us loss 0%

stephenw10

Ok assuming the VPN runs over the WAN that seems expected if the WAN is actually dropping packets.

Does it recover by itself?

jwns

@stephenw10

Yes, my VPN is dependent upon the WAN so it is expected to fail with the WAN.

Yes, WAN has always recovered on it's own within 3-5 minutes of the failure. I don't typically intervene because it recovers faster than I can fix it and faster than I can get on the phone with ISP support.
I can see this over weeks with the issue recurring.

I haven't been able to find any reliable correlation, e.g.: it doesn't align with DHCP renewals, firewall schedules, or other configurations within the NetGate appliance.

I'm between a rock and a hard place with Frontier tech supports because "they don't see anything" and I'm obviously not using their "Eero 6 Pro Edge device".

I'm not sure what to test next or what I can do to detect a fault and prove a problem short of installing a monitor between my router WAN uplink port and the ONT.

stephenw10

Can you put their router back in place to test?

It doesn't look like anything other than an upstream issue. I.e. a real WAN problem.

jwns

@stephenw10

I appreciate your input.
It's nice to hear I'm not completely delusional.

I'm back to work tomorrow so risking the biscuit with their Router is a non-starter. I don't think I can stomach the Double NAT, Remote Access, and other potential disruptions that could may summon.

As this issue happens when the fickle pixies haven't received a blood sacrifice, I'm going to start documenting the occurrences of "Alarm Cleared" and make a written ticket to the ISP. Tier One will always go for the easy-out on the nonstandard equipment.

jwns

@stephenw10
Just had a new failure mode a few minutes ago.

Gateway became unresponsive for greater than 5 minutes.
It didn't come back until I Disabled and reenabled the interface, and then local traffic wouldn't function until I manually bounced the DNS Resolver service...

I'm going to disable monitoring entirely for a month and see if it influences these evening outages.

I'm at my wits end.

stephenw10

If it happens again check if it's in the ARP table still. That sounds like it could be an issue with upstream ARP timeouts. We have seen some ISPs that just stop responding like that when ARP entries expire at their side.

jwns

@stephenw10

I’ve been running on “previous stable” firmware.

In response to this most recent drop I upgraded firmware on this SG2100 from 2403 to 2411, removed or disabled several non-essential add ons, and disabled gateway monitoring entirely.

crosses fingers