is there a Supported way to have DPinger restart the wan interface on failure.

mikek

Forgot to list the Version: 23.05.1-RELEASE (amd64)

mikek

just experienced this again. arp table has entries for the wanip but not for the gateway. dhcp seems to retry many times prior to the actual event. opening the wan configuration screen and saving/applying with no changes brings everything back. really want/need some help creating a firewall that can recover from this. i will provide whatever information you request if possible. help please

Gertjan

@mikek said in is there a Supported way to have DPinger restart the wan interface on failure.:

I am not getting a lease from modem. I reject leases from 192.168.100.1

That's not a Starlink ?

@mikek said in is there a Supported way to have DPinger restart the wan interface on failure.:

sendto error 64

 64	EHOSTDOWN Host is down.	 A socket operation failed because the desti-
     nation host was down.

Check the System log ; the WAN NIC didn't fire a DOWN (and UP) event ? This is what happens when the cable is removed, or plain bad, or the other side pulled the connection DOWN for a moment.

The WAN drivers knows there is 'nobody' (no host) at the other side of the link, so it signals a '64'.

A test : between your upstream WAN device and pfSense : place a switch.

mikek

Gertjan, thanks a lot for the reply.

Arris sb8500, cable modem connected to comcast.
had a switch between pfsense and the modem for a week. made no difference to the behavior.
No wan up or down events that i can find..

This is not a great example as i was sitting in front of the machine trying to connect to work when the event occurred. so i almost immediately reset the interface to get back online.

18:17:31 dpinger: sends an alarm stating wan_dhcp has packet loss of 31%
18:17:32 php-fpm: one or more tunnel endpoints may have changed ip addresses. reloading enpoints that may use wan_dhcp

18:17:33 I get a bunch of filterdns messages about resolving alias to tables

Then it's absolute silence until 18:35:43 when the "wan_dhcp x.x.x.x : sendto error: 64" events start showing up from dpinger.

18:36:02 i get loss from opt1_vpnv4 from a gateway alarm. it restarts the openvpn tunnel.

still getting "wan_dhcp x.x.x.x : sendto error: 64" events repeating from dpinger

18:36:26 i get a single filterdns message saying it "failed to resolve host xxx will try later again." yes it says "later again" ;)

still getting "wan_dhcp x.x.x.x : sendto error: 64" events repeating from dpinger every second

18:38:38 openvpn tries to restart
18:38:58 openvpn tries to restart
18:39:08 openvpn tries to restart
18:39:18 openvpn tries to restart
18:39:28 openvpn tries to restart
.....

open vpn continues this behavior over and over. meanwhile still getting "wan_dhcp x.x.x.x : sendto error: 64" events repeating from dpinger every second

i then open of the WAN configuration screen and click save/apply:
18:41:24 I get a flood of different messages:
18:41:24 dhclient : connection closed
18:41:24 dhclient : exiting,

some filter dns failed to resolve host messages "will retry later again" messages
followed by filter dns events for resolving alias to tables.

then these messages start repeating:
18:41:25 kernel : arpresolve: can't alocate llinfo for x.x.x.x on igc0

while that is happening dpinger exits a couple of times:
18:41:25 dpinger : exit on signal 15

then it happens
18:41:25 dhclient : PREINIT
dhclient : broadcast request
dhclient : dhcpack from server
i get new route and interface assignments etc...
then i get bound the ip with renewal of 11113 seconds

followed by a completely working firewall again.

Gertjan

After this initial event :
@mikek said in is there a Supported way to have DPinger restart the wan interface on failure.:

18:17:31

What happens on the other channel .... DHCP Logs ?

filterdns messages : that's normal, they complain a there is no working WAN connection so "DNS" is out of order.
Same thing for the OpenVPN client : it uses the WAN to create its tunnel. That's also a no go while "WAN" is down. It tries to restart without being able to use the WAN interface.

For example these

@mikek said in is there a Supported way to have DPinger restart the wan interface on failure.:

18:41:24 dhclient : connection closed
18:41:24 dhclient : exiting,

dhclient, the DHCPv4 client process quiets, because the interface WAN went down ( ? ) like electrically (physically) disconnected.
Afaik : it gets started as soon as the WAN comes up.

The thing is : what is doing all this ?

You said :

@mikek said in is there a Supported way to have DPinger restart the wan interface on failure.:

The solution:
Simply open the (interfaces / wan) configuration screen, make no changes and click save.
Everything immediately returns to a functional state.

My turn :
My ISP router is up and running - it's a device using a fibre uplink.
As soon as I do something like this :
open the (interfaces / wan) configuration screen, make no changes and click save
(identical what you said)
my WAN connection goes into a permanent UP down UP down sequence.

I've already stopped using the DHCPv4 WAN client, I use now a static IPv4 setup. That was no joy.
Soon, I'm going to use a static IPv6 WAN setup (and assuming the prefix I used doesn't change - dono if this is even possible).
Then : I'm going to ditch all packages that are 'interface' related, one by one.
I'm not using the OpenVPN client, but I do have the OpenVPN server for remote admining.

This isn't a big issue for me, as my ISP router stays up and connected.
pfSense, once started up, is rock solid.

I have a Netgate 4100, using ix3 as my 1Gbit WAN.

I have the impression that we chase the same bug.
Some race condition during "WAN reconstruction".

edit : I don't have the "sendto 64" error messages.

mikek

dhcpv6 disabled on fw and all internal devices. have no real need for it currently.

dhcp logs last event
09:46:39 dhclient : bound to x.x.x.x --renewal in 43200 seconds
18:41:24 dhclient : connection closed

at 18:41:24 is when i initiated an action on the console by opening the wan configuration screen and clicking save then apply.

at 18:41:26 i have a completely working firewall that is stable again until the next event.

The next event could be in a few hours or a day or two later.

Gertjan

@mikek

The lease time later : 43200 seconds or 12 hours.
If the event happens before, then dhclient is innocent.

mikek

Should dhcp client not have tried to renew half way through the lease at 6 hours? not a network engineer but that is my understanding.

I also get frequent long sections in my dhclient log where there are repeated dhcprequest . which is why i suspected dhcp as possibly being involved.

hardware configuration:
intel: NUC13ANHI5 - 64GB RAM - 1TB drive - additional i226-V NIC expansion.

Gertjan

@mikek said in is there a Supported way to have DPinger restart the wan interface on failure.:

renew half way through the lease at 6 hours?

I vote for 43200 as is sais "renewal in 43200".
If the "lease period" is 43200 then dhcp client will renew half way, true.

mikek

Yes, you are right. It appears the lease expiration is:
option dhcp-lease-time 86400;
option dhcp-renewal-time 43200;
option dhcp-rebinding-time 75600;

Which means I should see a renew attempt in the logs at 12 hours. but it never made it that long. Yet another bunny trail.

All of this and we are right where we started. With an issues we can't fully identify yet. and no way to help the firewall recover. Even though we know the steps needed for recovery. Any idea how to force dpinger to initiate self recovery on failure in a supportable way? At least then the failures would be short until the issue/resolution could be discovered.

Guess I could cron a reboot or interface down/up every X hours, but that does seem a bit excessive, intrusive and may even be worse than the original failures.
Be nice if there was a "re-initialize all" option on failure instead of just state options for interface failure actions.

There are a couple of script I have found on the internet that can reboot only on ping failure, but in the years I have used PFSense, never needed them before. Don't really want to start down the road of adding unsupported scripts. Seems like a recipe for future troubles.

KGillesRomer

@mikek, it looks like I’m facing the same problem. Were you able to find a solution?

mikek

@KGillesRomer the solution i found was to buy a netgate 4200 appliance. the issue seems to be related to hardware. more specifically i believe it is related to the intel nic used on that series of NUC.

know that's not what you wanted to hear. but it is how i fixed it.

btw, the 4200 is rock solid with all the same other equipment and cables.

Mike

Gertjan

@mikek said in is there a Supported way to have DPinger restart the wan interface on failure.:

i believe it is related to the intel nic used on that series of NUC

While I agree with the "slap a 4x00 and you're good" (I have a 4100 myself and its a set it and forget it device) it should be any NIC, even a Realtek NIC, that should not have issues with world's most simple and basic protocol : DHCP : simple small size UDP packets. If NICs in 2025 can't handle a protocol dating from the seventies, last century, then the NIC (or its driver) belongs in the waste bin.
NICs can't influence lease time, or what so ever. NICs are there to put info on a electric wire, and take it off the wire, no more, no less.
Afaik, I believe, with any proof, that the compiled kea on arm (32 bits) devices could have issues. On x86_64 devices, kea is rock solid.

Also, because you now use a 4200 you have access to pfSense plus, which uses the latest Kea, and the latest GUI (= one big config file generator) and a good DHCP-to-DNS integration.
Kea on 24.11 is rock solid for me.
My LAN has mostly DHCP MAC Lease devices.
My second LAN, a captive portal, with only unknown devices (every possible brand, type OS etc) as my hotel clients bring them along : works fine.
Imho : if you use 2.7.2, stay with ISC DHCP. 2.8.0 will bring a usable kea DHCP.

mikek

@Gertjan there may be a misunderstanding of the issue I was facing. My DHCP issue was with my ISP providing PFSense an IP. Not internal DHCP leases from PFSense. Ultimately I concluded that DHCP was a symptom of a deeper issue with the intel I226-V implementation and bios that existed on the specific series of intel NUC I deployed on. Or an issue with implantation of the 2.5gb NIC on my ARRIS cable modems (I have 2 different ones) communicating with PFSense. I believe this is why a different hardware configuration resolved my issues. i.e.: the Netgate 4200.

I believe DHCP issue was a symptom, not the core issue. The issue I believe to be that the entire connection would self implode with out logging any event. so then DHCP tries to renew and fails.

btw, I was running PFSense Plus on the NUC and not community edition. from previous post: "Forgot to list the Version: 23.05.1-RELEASE (amd64)"

Hope this helps you some.

--Mike

Wylbur

@mikek I think I have had a problem similar to yours. My ISP uses IPv4 and fiber optic cable. So the ONT (aka "cable modem" would be told to change IP address from ISP. This caused realtek and Intel ethernet cards to stop communicating. An IPL would fix it (in my case). So I tried Watchdog. That seems to handle it (for me). Don't know if it will work for you. I'm told this is caused becasue of the way the system changes IP addresses and the IP addresses it uses (class B?).

HTHs.
Wylbur

Wylbur

@Wylbur

Sorry I replied to the wrong person. But I think you were also having a similar problem.