WAN failover works, but it leaves gateway in "Pending" status and does not delete orphan states

bunkerbob

I have a dual WAN setup with fiber for WAN1, and a slow backup WAN2. Both WAN's have dynamic IP's assigned by DHCP. They are in a failover gateway group, with WAN1 in Tier 1 and WAN2 in Tier2, with the intent that WAN2 will only be used if the fiber goes down. But when I have a failover event, my system logs are being flooded with arpresolve error messages like the one pasted below, occurring 50 to 100 times per SECOND (i.e. in two minutes I had 8000 of these entries in my log!)

 2/28/2024 13:16	kernel		arpresolve: can't allocate llinfo for XX.XX.XX.1 on igb0

(XX.XX.XX.1 is the gateway of WAN1, and igb0 is the WAN1 interface).

I suspect this is a result of some kind of configuration problem, but I've been scouring these forums and Netgate support materials and haven't been able to find a solution. . I did find some BSD documentation with this explanation of the error message:

 **arpresolve: can't allocate llinfo for %d.%d.%d.%d**
 The route for the referenced host points to a device upon which ARP is required, but ARP was unable to 
 allocate a routing table entry in which to store the host's MAC address. This usually points to a    
 misconfigured routing table. It can also occur if the kernel cannot allocate memory

Memory is not an issue: currently showing usage at 12% of 4 GB. So I checked out all the pfSense routing-related diagnostics, and quickly found a smoking gun: dozens of orphaned state entries involving the offline WAN1 gateway remain in the state table after the failover. If I manually delete all the orphan states, the arpresolve errors stop. So it seems the problem I need to solve is why the orphan states aren't being removed when failover occurs?

Interestingly, if I force a failover in software, by checking the box "Mark Gateway as Down", the status of the Gateway becomes "Offline (forced)", and the states associated with the downed gateway ARE deleted automatically. But if I force the failure by disconnecting the WAN Ethernet cable, the states aren't removed. The firewall does fail over to WAN2, but the WAN1 gateway, instead of being flagged as "Offline", goes into a "Pending" status, where it stays indefinitely. It seems that "Pending" isn't sufficient to trigger the cleanup of the orphan states.

The log entries below show that pfSense is certainly aware that the interface and Gateway are in trouble. The Gateway alarm showing 100% loss occurs within seconds of disconnecting the Ethernet cable.

 2/28/2024 13:16	kernel           igb0: link state changed to DOWN
 2/28/2024 13:16	php-fpm	  58894  /rc.linkup: Hotplug event detected for WAN1(wan) dynamic IP address (4: dhcp)
 2/28/2024 13:16	php-fpm	  58894  /rc.linkup: DEVD Ethernet detached event for wan
 2/28/2024 13:16	rc.gateway_alarm     6429	>>> Gateway alarm: WAN1_DHCP (Addr:XX.XX.XX.1 Alarm:down RTT:0ms RTTsd:0ms Loss:100%)

So why isn't this triggering a definitive "offline" status for the gateway, and a corresponding clean up of the states table?

I have tested all the options I could find related to state-killing behavior, including the global "Reset all states if WAN IP Address changes". But that didn't seem to help. I also tried changing the monitoring IP addresses to various recommended settings, making sure to use unique ones for each WAN. So far nothing has worked.

As soon as I plug the WAN1 cable back in, everything fails back immediately. So overall things seems to be working as they should other than my logs getting overwhelmed by arpresolve errors. Perhaps the errors are benign but I don't see other people complaining about them so I suspect I have some kind of configuration problem that could bite me in some other way when I need this to work and I'm 3000 miles away!

Would appreciate any suggestions for additional troubleshooting steps

halp

This may or may not be related to your issue. My config was not using the fiber ports for my 3 WANs. Found a problem with WANs showing a "Pending" status under Gateways when running pfSense version 23.09.1-RELEASE. Documented and I filed to bug report this week.

This appears to be caused by the process dpinger. Tied to using a DHCP from the ISP. Does not happen if ISP is using a static IP for the WAN.

Occurs if the cable (RJ-45) between the modem and router/firewall is unplugged/replugged or if the modem is power reset. The problem happens with a single WAN and is not related to using a failover configuration. Can stay stuck in Pending state and is hard to get out of. Try, disabling the interface, then be sure the modem is fully online and ready to issue the IP by DHCP, and then reactive the interface. I've also tried releasing the IP and reassigning it under the Interfaces page.

If your ISP is Comcast, I also found a bug in their lease time for DHCPs if their default value is changed. The problem I found is when the lease time is changed, the modem will fail at 1/2 the assigned lease time. Logs will show monitoring pings don't respond. Way to fix this was to reissue another DHCP lease time or release the IP and reassign it under the Interfaces page

Another problem I found is modems in bridge mode will randomly disconnect because they stop responding to ARPs. Think the phSense default is 1200. Found it was necessary to reduce under 5 mins. I am using 240 for net.link.ether.inet.max_age found under Advanced/System Tunables. So far, this change appears to help stop the disconnects.

I reported these bugs to Comcast, but likely will take a very long time to fix.

bunkerbob

@halp thanks for your comments; I'll chase down those leads. Some interesting observations for sure.

I had read elsewhere that having static IP addresses on the WAN solved some similar problems, so I'll test whether it solves it in my case. I can't do a test right now so will report back in a couple of days.

Thanks

bunkerbob

Following up on @halp's comment about this problem not occurring with a static IP, I set up a fresh pfSense 2.7.2 test box and used a couple of other routers to mimic WAN1 and WAN2. I kept the test box as close as possible to its raw state, changing only those settings required to activate dual WANs and a failover gateway group. I used DHCP address assignments for both WANs and let pfSense create the default gateways. Then I pulled the plug on one of the WAN's and it did exactly what my live system did: filled my logs with arpresolve errors. But then I changed the WAN to static IP, and that resolved the problem. Failover now works perfectly, with no unexpected errors in the logs, just as @halp had indicated.

So is this a bug? It seems that the logic used to determine whether a gateway group member is down does not work properly for gateways assigned by DHCP.

bunkerbob

SOLVED! on my test rig I tried a state-killing option that had NOT solved the problem on my live box, but on the test rig it worked. The setting is in System/Routing/Gateways, "State Killing on Gateway Failure". After changing that from the default to "Kill states using this gateway when it is down", subsequent failover events created a few arpresolve errors in the log, but within 1 second they stopped, after an entry in the log showing a state killing action:

/rc.filter_configure_sync: GW States: Killing states for dynamic down gateway: WAN_DHCP, XX.XX.XX.1

After that worked, I had to figure out why this solved the problem with my test rig but not my live box. Eventually I traced it to a setting in System/Advanced/Miscellaneous in the Gateway Monitoring Section, "Skip rules when gateway is down". In my live box, which has some traffic that needs to be routed only through a VPN, I had enabled the setting "Do not create rules when gateway is down" years ago to make sure, if the VPN was down, that pfSense wouldn't route the traffic through the non-VPN WAN. But as soon as I cleared that check box, my failover arpresolve problem went away. So apparently that setting interacts with the failover in a way that prevents the state-killing action from working properly.

Next job is to figure out a different way to kill VPN-bound traffic if the VPN is down... Googling that now.