pfSense 2.7 gateway groups (default gateway) functionality broken causing flapping
-
Hi I have a simple setup that has been working forever.
These issues started since 2.7 (and did not exist in 2.6).
2 gateways in 1 gateway group (tier 1, tier2), default gateway set to failover group in routing.
In 2.6 this worked perfectly, in 2.7 pfsense can't figure out which gateway is default (globe icon next to gateway on main dashboard page disappears from both interfaces).setting default gateway to the tier 1 gateway does not fix the issue
setting default gateway to the gateway group does not fix the issue.
Setting default gateway to automatic does not fix the issue.
deleting the gateway group does not fix the issue.disabling gateway monitoring action for tier 1 interface seems to resolve the issue. The problem is that usually when the gateway ping check fails the link is first degraded, it goes through different statuses before switching to the other failover (tier 2) interface, in this case, this normal behavior is not happening, the link is not marked down, it simply loses the "globe" (or default gateway) icon, and it does not appear for the failover gateway interface in the dashboard.
My primary gateway isn't t flapping, it's rock solid. However, my secondary gateway is flapping, it's 5G LTE link that can come up and down due to monitoring. It should not affect the primary link if it is solid, but I think that maybe what is happening here.
Maybe a regression?
https://redmine.pfsense.org/issues/11692
https://redmine.pfsense.org/issues/13228
https://redmine.pfsense.org/issues/14327 -
What do you have set for
State Killing on Gateway Failure
in Sys > Adv > MIsc?The behaviour sounds like it might be set to flush all states.
-
@stephenw10
sys -> adv -> misc is set to "kill states for all gateways which are down"So, now even with the other gateway interface disabled, from time to time, the only remaining (default) gateway loses it's little "globe" icon, and the connectivity to the gateway goes down. I wish this was just a cosmetic issue, but when the globe icon is missing,
In this first screencap we can see the gateway configuration, notably absent is the globe icon that should normally display for the fiber interface:
here we can see the dashboard displaying the relevant gateway interfaces, again the globe icon is missing:
Here now we can see i've "deleted" the gateway from the gateways dialog, and despite fiber still being selected as the default gateway and is online and up, it's missing the globe icon again:
In this final screencap, we can see the globe displayed now, and it appears that dnscrypt is again able to resolve dns queries being forwarded to it.
Note that currently the globe icon is still missing from the UI. I'd like to understand what (technically) drives this "globe icon" indicator to display in the UI, because I think the UI issue maybe cosmetic, and there's an underlying series of events that may be causing the issue). Again, I am running dnscrypt-proxy bound to a virtual IP, and I note that when the globe icon disappears from the default gateway, dnscrypt-proxy appears to be unable to resolve dns queries forwarded to it by unbound also running locally.
If it was only a cosmetic issue, I probably wouldn't have bothered investigating this in the first place, however, I would also now note that despite the gateway group being deleted and only 1 active gateway, the issue is still present.
Disabling the other gateway interface from the interfaces page also doesn't restore the globe icon. So I'm stumped now.The only thing that does restore the globe icon (at least temporarily before it disappears again), is changing the default gateway in the routing->gateways dialog to something else, e.g. automatic.
When I did that, I noticed this in /var/log/system.logDec 20 20:04:17 pfSense php-fpm[6686]: /system_gateways.php: Configuration Change: XXXX@192.168.1.XXX (Local Database): System - Gateways: save default gateway Dec 20 20:04:17 pfSense check_reload_status[533]: Syncing firewall Dec 20 20:04:18 pfSense php-fpm[22767]: /system_gateways.php: Removing static route for monitor 8.8.4.4 and adding a new route through X.X.X.X Dec 20 20:04:19 pfSense php-fpm[22767]: /system_gateways.php: Gateway, NONE AVAILABLE <<<<<<<<<<<<<<<<<<<<<<<<< Dec 20 20:04:19 pfSense php-fpm[22767]: /system_gateways.php: Default gateway setting XXXXX Fiber 2 Gigabit as default. Dec 20 20:04:19 pfSense php-fpm[22767]: /system_gateways.php: The command '/sbin/route -n6 get 'default' 2>/dev/null | /usr/bin/egrep 'flags: <.*PROTO.*>'' returned exit code '1', the output was '' Dec 20 20:04:20 pfSense check_reload_status[533]: Reloading filter
running the command manually:
#/sbin/route -n6 get 'default' route: route has not been found: No error: 0
How can I troubleshoot this issue further?
-
OK I think i understand what happened. At some point I had 3 gateways in a gateway group.
1 of these gateways was removed from the gateway group, and the interface was disabled, however, looking at my routing table I found 2 "default" gateways at the same time.
Destination Gateway Flags Netif Expire default XXX.XXX.XXX.XXX UGS igb4 default "supposedly.disabled.if.ip" UGS igb5 .......
I issued a
/sbin/route -n get 'default'
command manually, showing:route to: 0.0.0.0 destination: 0.0.0.0 mask: 0.0.0.0 gateway: >>>>GATEWAY THAT WAS DISABLED IN THE UI IP ADDR HERE<<<< fib: 0 interface: igb5 flags: <UP,GATEWAY,DONE,STATIC> recvpipe sendpipe ssthresh rtt,msec mtu weight expire 0 0 0 0 1500 0 0
In the UI -> I enabled and disabled the interface for the gateway in the UI.
after doing that, then inifconfig -a
output, I noticed that the interface no longer shows an IP Address and is not in "UP" status.I subsequently issued a
route delete default
command which removed both default routes (the correct one, and the stagnant one for the now down interface), followed by adding a default route for the correct interface gateway.I believe the issue is now resolved. since
netstat -rn
only shows 1 entry as 'default' now rather than two, and theroute -n get default
command now returns the correct gw ip addr.FYI, this issue has been plaguing me for quite some time, it may be worth adding some logic to check for the presence of this issue if the issue is non-deterministic and/or non-reproducible. I unfortunately cannot provide reproduction steps that would lead to the loss of configuration sync between the UI and the OS, but I would note that the offending 3rd gateway interface was disabled in the UI, and unfortunately, it's interface was still up and had an ip addr, and the routing table had two routes set to "default". Not sure what here could be extrapolated as either a bug or an enhancement request to prevent the issue from reoccurring for others. It appears that the "disabled" state for the interface didn't quite make it down to the OS level bringing the interface down for the gateway. The presence of two routes both "default" I think might not be errant in load balancing scenarios (but definitely a bad deal if the interface is disabled in the UI, also, I think enabling/disabling the IF didn't seem to remove the duplicate default route entry corresponding to it)?
I would note that my gateway group is configured as "failover" rather than a Load Balancing configuration, i.e. interfaces in my gateway groups are usually categorized as tier1 & tier 2 & tier 3. So perhaps this config sync mixup between the ui and OS happened at some point during a failover, and the config "mixup" remained in this state indefinitely until manual intervention was required.
Anyhow, Thank You kindly for your help and for responding to my forum post!