Intermitent loss of WAN routing

michmoor

@AGawthrope what IP are you monitoring? Something on the internet or your providers gateway (public IP i would assume)

AGawthrope

@michmoor The service providers gateway. That known to pfSense+ as its default gateway and obtained via DHCP.

AGawthrope

A little further info from some analysis of the 'dpinger' log over the last six days. The problem has occured each day and healed itself after 20mins of outage. The times in the table are those from the log i.e. 'Status|System Logs|System|Gateways' so I'm not surprised the difference between the first Alarm latency message and the corresponding Clear latency isn't exactly 20mins.

Each outage is very close to 20mins or the ARP refresh timeout.....

Edit: I should add that there were no alarm messages logged between the 1st to 3rd June.

Screenshot 2024-06-06 at 17.44.12.png

stephenw10

If you see outgoing traffic but no replies that sounds like an ARP issue upstream. Perhaps something else sometimes using your IP address.

One thing you can try is setting net.link.ether.inet.max_age to something shorter than 1200 and seeing what difference that makes.

AGawthrope

@stephenw10 Thank you. Possibly, but that feels a little esoteric. My earlier investigations looked at the provision of an IP address and the ISP DHCP server always provides me with the same IPv4 address. If it was also providing it to another then I'd expect some variance from time to time.

Would I be correct in thinking that I could set the max_age variable via the System Tunables page?

Thanks
Andrew

stephenw10

Yes you can set that as a system tunable.

It doesn't have to be the ISP handing out your address via DHCP. Just some other device sending ARP packets with your IP. Potentially.

AGawthrope

@stephenw10 Thanks for that and understood. Getting anything sensible from the ISP regarding duplicate IP's is a non-starter. Their technical 'support' haven't even heard of IPv6! I'll leave changing the ARP timeout for a few days as I'm keen to see if there is any pattern to the problem.

I'm also keen to hear what others may suggest.

Thanks
Andrew

AGawthrope

I wanted to post an update to close-off this thread for now.

Further analysis has identified that the problem is only occuring Monday through Friday and during working hours. This alone makes me suspicious that its an ISP triggered event that is the root cause.

Nominally my pfSense+ WAN interface receives an ARP Request from the ISP virtual gateway/router (the client facing interface of a VRRP group router) every 60 secs. pfSense+ is configured with a 1200sec ARP table timeout. Because pfSense+ relearns the MAC address of the ISP gateway/router from these 60sec exchanges it does not originate its own ARP Request every 1200secs. So when the intermitent problem occurs - which it is still doing - and ALL packets, including ARP Requests from the ISP virtual gateway/router cease the ARP table entry in pfSense+ ages for 19/20mins until its expiry at which point pfSense+ originates an ARP Request and nominal service/routing is fully restored - including the 60sec reception of ARP Requests from the ISP.

If during the period when no traffic is being received from the ISP and before the pfSense+ ARP table timeout of the ISP gateway/router, any traffic is sent to the ISP gateway/router no response will be received.

On expiry of the ARP table entry in pfSense+ or by forcing pfSense+ to originate a new ARP Request (for any ISP host on the same subnet as the pfSense+ WAN interface that is not already in the pfSense+ ARP table) nominal service is immediately restored with the resumption of all incoming traffic - including the 60sec, ISP ARP Requests.

For now, I'm thinking that something is occurring on the ISP side which stops them originating ARP Requests and they quickly loose layer-2 addressing knowledge of my pfSense+ interface. Hence when pfSense+ transmits an ARP request the ISP is able to relearn the Ethernet and IPv4 address of the pfSense+ WAN interface and thus able to start passing traffic again.

As trying to communicate this to the ISP will be hell, I plan to explore all possible causes on my side. So my next step is to monitor all traffic passing between the Netgate 4100 and the ISP ONT on a separate computer with promiscusous interface. If when the problem occurs, I don't see incoming traffic then I'm confident its an ISP problem; If I do see incoming traffic then I know it's a pfSense+ problem.

@stephenw10 suggested reducing the ARP timeout period. I'm confident this would reduce the duration during which no traffic is received but the timeout would need to be very short and thus would create a high volume of ARP traffic to provide a usable workaround.
For now I'm running a script which ping's the ISP default router/gateway and when no response is received it ping's a different host on the same subnet and which is not in the pfSense+ ARP table. This restores service within the ping period (1 sec) making a usable workaround.

Andrew

stephenw10

Fun! If you just have something pinging continually against another host does that also fail? Since it would already be in the arp table, if it's a real host.

AGawthrope

@stephenw10 Yes, indeed :-). When pinging something continually and the problem occurs it will fail until pfSense+ ages and renews the ARP table entry or, as with my script, any ARP Request containing the layer-2 and layer-3 addresses of the pfSense+ WAN interface is transmitted to the ISP.

Thanks @stephenw10.

Andrew