WAN connection dropping intermittently
-
Hi,
I've had a long-standing issue with my WAN connection going down intermittently, and I've been unable to identify the source of the problem.
The connection will drop for anywhere between 30 and 90 seconds at a time, several times a day. I haven't been able to identify a pattern for the drops - sometimes it remains up for 10 hours, sometimes it drops 2-3 times in an hour. When the WAN connection drops, pfsense remains up and I can ping it on my LAN side with no problem. There are no "link state" changes coinciding with the drops.
I don't have a modem or similar, but rather connect directly via ethernet. The ISP has verified that the cables are good with no faulty connections (and except when the WAN goes down, I have 0% packet loss). We have also tested the connection by plugging in various other devices directly, including a device from the ISP, and a couple of my laptops, and don't experience any drops then.
In terms of hardware I have a Netgate 6100, running pfsense+ 24.03-RELEASE
For the purpose of troubleshooting I am running no packages, nor VPNs
I have disabled gateway monitoring action (devices behind the firewall also see it as the connection being dropped when dpinger says it's down)
On the LAN side I have a unifi switch and access pointsBelow is an excerpt from recent logs:
2024-11-07 18:55:28.386844+01:00 dpinger 32387 WANGW 194.xxx.xxx.3: Clear latency 5856us stddev 10542us loss 5%
2024-11-07 18:53:55.677166+01:00 dpinger 32387 WANGW 194.xxx.xxx.3: Alarm latency 3037us stddev 4246us loss 22%
2024-11-07 18:35:32.070210+01:00 dpinger 32387 WANGW 194.xxx.xxx.3: Clear latency 5903us stddev 8184us loss 5%
2024-11-07 18:33:55.751032+01:00 dpinger 32387 WANGW 194.xxx.xxx.3: Alarm latency 4777us stddev 7169us loss 22%Would appreciate any help in troubleshooting the issue! I'm sure there are other data points required, so please let me know what would be helpful.
Best,
Alex -
In this case, as your WAN is a Ethernet cable connection, place a switch on the WAN side of pfSense.
3 cables :
1 is the original ISP WAN connection.
1 goes to the WAN of pfSense
1 goes to a PC that use use for monitoring.As soon as the traffic stops flowing through the 6100 from the WAN, and back, check with the monitoring PC if it shows the same behavior, or not.
A bad connection can be more as a bad cable connection to the uplink ISP equipment.
ISPs tend to have more the one client ^^
If upstream (ISP) devices are saturated, your data throughput will suffer. And no, none of us never saw an ISP admitting that their networks were unusable ones in a while. That just can't happen ;)
The pfSense monitoring tool, dpinger, sends ICMP packets (ping packets). These are always low priority, and the first to 'vanish' if some upstream device is under load. -
Thanks Gertjan!
I will try that (at work now, so has to be later) but it strikes me as unlikely - it's not just ICMP packets that get dropped: all traffic stops, including video calls, youtube streaming, or simply loading a webpage.
This happens every day, and several times with the pfsense box, while the other devices went for days with no interruption.
I should probably add that I've tried using different ports for the WAN interface with no change in behaviour, so I don't think it's a hardware issue.
Best,
Alex -
Using myself a 4100 with 24.03 behind a ISP router, 1 Gbit fiber connection.
I couldn't find one ICMP packets lost over the last ... 6 month ?I used other hardware ("barebone") solution before I got my 4100, but never has strange outages.
Broke the connection because I messed up 'something' ? Yes, that has happened.A LAN interface is like a WAN interface, why would it fail ?
If doubt, take another port ^^ a 6100 has 6 NIC's, right ? -
Your latency looks pretty low, are you still monitoring the gateway IP directly?
The first thing I would do is set the monitoring IP to something external.
-
Good morning, and thanks for the suggestions!
When I got home I looked at the logs which showed a number of drops during the last 24-36 hours, including the two drops I showed in my original post:
2024-11-08 21:30:11.582551+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 6351us stddev 8741us loss 5%
2024-11-08 21:28:51.537529+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 2441us stddev 3325us loss 22%
2024-11-08 17:30:58.777260+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 7165us stddev 8798us loss 5%
2024-11-08 17:29:51.771257+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 5889us stddev 7967us loss 22%
2024-11-08 16:31:10.301016+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 5627us stddev 7828us loss 5%
2024-11-08 16:29:52.558738+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 3908us stddev 4204us loss 22%
2024-11-08 12:31:58.081936+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 6649us stddev 10061us loss 6%
2024-11-08 12:30:53.138370+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 3571us stddev 4800us loss 22%
2024-11-08 12:12:02.533386+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 7163us stddev 9233us loss 5%
2024-11-08 12:10:52.796857+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 4311us stddev 5938us loss 21%
2024-11-08 05:13:25.405565+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 5235us stddev 6291us loss 5%
2024-11-08 05:11:53.814595+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 5106us stddev 7377us loss 21%
2024-11-08 02:14:00.821345+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 6536us stddev 9023us loss 5%
2024-11-08 02:12:54.527981+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 5851us stddev 7015us loss 22%
2024-11-07 19:15:23.892450+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 4592us stddev 6255us loss 5%
2024-11-07 19:13:54.996799+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 4265us stddev 7476us loss 21%
2024-11-07 18:55:28.386844+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 5856us stddev 10542us loss 5%
2024-11-07 18:53:55.677166+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 3037us stddev 4246us loss 22%
2024-11-07 18:35:32.070210+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Clear latency 5903us stddev 8184us loss 5%
2024-11-07 18:33:55.751032+01:00 dpinger 32387 WANGW 194.xxx.yyy.3: Alarm latency 4777us stddev 7169us loss 22%I then made 3 changes as suggested:
- I changed the monitoring IP to 8.8.8.8
- I swapped the WAN to a different port
- I put an unmanaged switch on the WAN side, and also connected a laptop to monitor.
My laptop had no drops throughout the night, while the router had two:
2024-11-09 09:47:50.790223+01:00 dpinger 21182 WANGW 8.8.8.8: Clear latency 1397us stddev 28us loss 5%
2024-11-09 09:46:49.794245+01:00 dpinger 21182 WANGW 8.8.8.8: Alarm latency 1395us stddev 35us loss 21%
2024-11-09 08:48:02.362721+01:00 dpinger 21182 WANGW 8.8.8.8: Clear latency 1390us stddev 35us loss 6%
2024-11-09 08:46:49.711801+01:00 dpinger 21182 WANGW 8.8.8.8: Alarm latency 1383us stddev 46us loss 21%I guess I can open up a firewall rule on the WAN interface to allow my laptop to ping it from the WAN side, and see if there are any issues when dpinger says there are.
Any other suggestions on what could be helpful to see to figure out what's going on?
-
@alexnovice said in WAN connection dropping intermittently:
My laptop had no drops throughout the night, while the router had two:
I can't suggest a tool, but I'm pretty sure they exist : have your laptop do the same thing : have it ping every 1/2 seconds 8.8.8.8 also.
At worst, omen the command line and executeping -t 8.8.8.8
and leave it running there for the day.
@alexnovice said in WAN connection dropping intermittently:
I guess I can open up a firewall rule on the WAN interface to allow my laptop to ping it from the WAN side, and see if there are any issues when dpinger says there are.
Good idea !
On the laptop, a second cmd box, an ping it also.ping -t a.b.c.d
where a.b.c.d is your pfSense WAN IP.
-
I've used a powershell-script to continuously ping various IPs, both when I've been behind the router and now "in front". That, plus actual impact on usability (like video calls dropping) are part of what I've used to identify that my connection has dropped.
So I had another WAN drop according to pfsense at noon.. Looked identical to all the others from what I can see. During that time my laptop was able to ping the three IPs I was looking at continuously without any drops:
- My ISP's gateway
- The internet (8.8.8.8)
- The WAN port on my router
My interpretation of that is that the issue resides somewhere in the pfsense box, but I have no clue what it could be. ARP table? Firewall rules?
Best,
Alex -
@alexnovice said in WAN connection dropping intermittently:
My interpretation of that is that the issue resides somewhere in the pfsense box, but I have no clue what it could be. ARP table? Firewall rules?
Hummm.
Not rules.If the 6100 gets very occupied it might start to lose packets.
But, "I am running no packages, nor VPNs" so just plain vanilla pfSense, on a 6100, that's ... strange. I've a 4100 and can't over stress it enough to show packet loss.
And I use packages like pfBlockerng with a couple of feeds, nothing big. No squid/suricate and other resource hogs.For me ;) for a 6100, you should have to throw many Gbits at before it has troubles doing its job.
Go console or SSH, menu option 8 and run 'top' for a while.
You can sort on processor activity percentage.@alexnovice said in WAN connection dropping intermittently:
I should probably add that I've tried using different ports for the WAN interface with no change in behaviour, so I don't think it's a hardware issue.
That exclude individual NIC issues, I agree.
-
Seems mostly idle from what I can tell:
I also pulled out a bit more on what processes are running:
-
I would try to run a packet capture on WAN when it's showing as down and make sure the monitoring pings are actually being sent.
-
Hi Stephen,
As you suggested, I ran a packet capture on the WAN interface (not in promiscuous mode) on the ICMP protocol. It looks like this when the WAN goes down:
It seems the packets are sent, but with no response. I also noticed that for some reason it starts pinging a different IP after some time. Not just 8.8.8.8, which is the monitoring IP for dpinger, but also an IP that whois claims belongs to Apple?
I also looked a bit more at the logs for when the Gateway is said to be down,. It seems there are intervals of exactly 20 minutes (or multiples of 20 minutes) if that could signify something:
Thanks!
Alex
-
20 mins sounds like an ARP issue. Check the actual pcap file or change the view type and make sure the MAC address it's sending those to doesn't change.
Those other pings could be from something on the LAN. In a WAN pcap they will have been translated to the WAN address.
The curious thing here is that as I understood it you said that during the outage LAN side clients could still ping 8.8.8.8. Anything upstream should see those identically to the pings from dpinger.
Is that correct?One possibility is that you have one the inconvenient ISPs that seem to forget your MAC address! We have seen a few users hit that and workaround it be setting a lower ARP timeout. However that breaks all traffic.
-
@stephenw10 said in WAN connection dropping intermittently:
20 mins sounds like an ARP issue. Check the actual pcap file or change the view type and make sure the MAC address it's sending those to doesn't change.
The destination MAC address remains unchanged before, during and after the connection drops.
@stephenw10 said in WAN connection dropping intermittently:
The curious thing here is that as I understood it you said that during the outage LAN side clients could still ping 8.8.8.8. Anything upstream should see those identically to the pings from dpinger.
Is that correct?No, when dpinger can't get out, neither can upstream clients. However, other devices placed on the WAN side work.
@stephenw10 said in WAN connection dropping intermittently:
One possibility is that you have one the inconvenient ISPs that seem to forget your MAC address! We have seen a few users hit that and workaround it be setting a lower ARP timeout. However that breaks all traffic.
It's a relatively small ISP and they've been pretty responsive - I could try asking them if I only knew what to ask :) But wouldn't that behaviour from the IPS have the same impact on other devices connected in place of pfsense?
-
Effectively the ISP gateway device loses your WAN from it's ARP table and it doesn't ARP for it. Instead it waits until pfSense renews it's ARP entry for the gateway.
Try setting:
sysctl net.link.ether.inet.max_age=300
That is 1200s by default, 20mins. If that seems to prevent it that confirms it's an ARP issue somewhere.
-
Thanks Stephen!
I've made that update - will revert back either if it continues dropping or in ~24 hours when it definitely would have without this change.
-
Is this your WAN IP :
?
I thought it was a RFC1918 IP.
Using a switch on the WAN side, and pfSense gets this 194.x.x.192 as a WAN IP, then what IP was used by the PC hooked up also to that switch ? How did this PC obtain a 'LAN' IP ? -
That is indeed the WAN IP.
The gateway is on the same subnet (just ending in 3 instead of 192). For the laptop on the WAN side I just grabbed another IP in the same subnet (it's a static IP setup so no DHCP), hoping they hadn't locked it down (which it turns out they hadn't).
Like I wrote a couple of responses above, it's a small ISP :-)
Cheers!
Alex
-
Ok, great, but the IP you auto assigned yourself could be assigned to some one else.
( and now 'ARP' gets confused, and the other person could experience WAN IP outages ... ^^) -
True, so I stopped doing that as soon as I had results from the test :-)
That said, there are only a few (<5) other users on this subnet (which seems accurate when I stare at ARP broadcasts), since almost all apartments have their home networks managed directly by the ISP (sitting behind their firewall and gateway), whereas I'm bypassing that.