• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

pfSense 2.7 gateway groups (default gateway) functionality broken causing flapping

Scheduled Pinned Locked Moved General pfSense Questions
4 Posts 2 Posters 896 Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L
    LC
    last edited by LC Dec 20, 2023, 4:49 PM Dec 20, 2023, 4:31 PM

    Hi I have a simple setup that has been working forever.

    These issues started since 2.7 (and did not exist in 2.6).

    2 gateways in 1 gateway group (tier 1, tier2), default gateway set to failover group in routing.
    In 2.6 this worked perfectly, in 2.7 pfsense can't figure out which gateway is default (globe icon next to gateway on main dashboard page disappears from both interfaces).

    setting default gateway to the tier 1 gateway does not fix the issue
    setting default gateway to the gateway group does not fix the issue.
    Setting default gateway to automatic does not fix the issue.
    deleting the gateway group does not fix the issue.

    disabling gateway monitoring action for tier 1 interface seems to resolve the issue. The problem is that usually when the gateway ping check fails the link is first degraded, it goes through different statuses before switching to the other failover (tier 2) interface, in this case, this normal behavior is not happening, the link is not marked down, it simply loses the "globe" (or default gateway) icon, and it does not appear for the failover gateway interface in the dashboard.

    My primary gateway isn't t flapping, it's rock solid. However, my secondary gateway is flapping, it's 5G LTE link that can come up and down due to monitoring. It should not affect the primary link if it is solid, but I think that maybe what is happening here.

    Maybe a regression?

    https://redmine.pfsense.org/issues/11692
    https://redmine.pfsense.org/issues/13228
    https://redmine.pfsense.org/issues/14327

    1 Reply Last reply Reply Quote 0
    • S
      stephenw10 Netgate Administrator
      last edited by Dec 20, 2023, 6:36 PM

      What do you have set for State Killing on Gateway Failure in Sys > Adv > MIsc?

      The behaviour sounds like it might be set to flush all states.

      L 1 Reply Last reply Dec 21, 2023, 12:57 AM Reply Quote 0
      • L
        LC @stephenw10
        last edited by LC Dec 21, 2023, 1:15 AM Dec 21, 2023, 12:57 AM

        @stephenw10
        sys -> adv -> misc is set to "kill states for all gateways which are down"

        So, now even with the other gateway interface disabled, from time to time, the only remaining (default) gateway loses it's little "globe" icon, and the connectivity to the gateway goes down. I wish this was just a cosmetic issue, but when the globe icon is missing,

        In this first screencap we can see the gateway configuration, notably absent is the globe icon that should normally display for the fiber interface:
        74df7a6e-9f08-4dce-9d75-14fbe336d33a-image.png
        here we can see the dashboard displaying the relevant gateway interfaces, again the globe icon is missing:
        7b84ca82-3fb7-42bc-ab9f-8482039b02a4-image.png

        Here now we can see i've "deleted" the gateway from the gateways dialog, and despite fiber still being selected as the default gateway and is online and up, it's missing the globe icon again:
        d73966a2-6052-4a5b-a618-c231f30b09b7-image.png

        In this final screencap, we can see the globe displayed now, and it appears that dnscrypt is again able to resolve dns queries being forwarded to it.
        fc225d59-e277-453d-a483-0862c0c8096b-image.png

        Note that currently the globe icon is still missing from the UI. I'd like to understand what (technically) drives this "globe icon" indicator to display in the UI, because I think the UI issue maybe cosmetic, and there's an underlying series of events that may be causing the issue). Again, I am running dnscrypt-proxy bound to a virtual IP, and I note that when the globe icon disappears from the default gateway, dnscrypt-proxy appears to be unable to resolve dns queries forwarded to it by unbound also running locally.

        If it was only a cosmetic issue, I probably wouldn't have bothered investigating this in the first place, however, I would also now note that despite the gateway group being deleted and only 1 active gateway, the issue is still present.
        Disabling the other gateway interface from the interfaces page also doesn't restore the globe icon. So I'm stumped now.

        The only thing that does restore the globe icon (at least temporarily before it disappears again), is changing the default gateway in the routing->gateways dialog to something else, e.g. automatic.
        When I did that, I noticed this in /var/log/system.log

        Dec 20 20:04:17 pfSense php-fpm[6686]: /system_gateways.php: Configuration Change: XXXX@192.168.1.XXX (Local Database): System - Gateways: save default gateway
        Dec 20 20:04:17 pfSense check_reload_status[533]: Syncing firewall
        Dec 20 20:04:18 pfSense php-fpm[22767]: /system_gateways.php: Removing static route for monitor 8.8.4.4 and adding a new route through X.X.X.X
        Dec 20 20:04:19 pfSense php-fpm[22767]: /system_gateways.php: Gateway, NONE AVAILABLE <<<<<<<<<<<<<<<<<<<<<<<<<
        Dec 20 20:04:19 pfSense php-fpm[22767]: /system_gateways.php: Default gateway setting XXXXX Fiber 2 Gigabit as default.
        Dec 20 20:04:19 pfSense php-fpm[22767]: /system_gateways.php: The command '/sbin/route -n6 get 'default' 2>/dev/null | /usr/bin/egrep 'flags: <.*PROTO.*>'' returned exit code '1', the output was ''
        Dec 20 20:04:20 pfSense check_reload_status[533]: Reloading filter
        
        

        running the command manually:

        #/sbin/route -n6 get 'default'
        route: route has not been found: No error: 0
        

        How can I troubleshoot this issue further?

        L 1 Reply Last reply Dec 21, 2023, 1:32 AM Reply Quote 0
        • L
          LC @LC
          last edited by LC Dec 21, 2023, 2:05 AM Dec 21, 2023, 1:32 AM

          @stephenw10

          OK I think i understand what happened. At some point I had 3 gateways in a gateway group.

          1 of these gateways was removed from the gateway group, and the interface was disabled, however, looking at my routing table I found 2 "default" gateways at the same time.

          Destination        Gateway            Flags     Netif Expire
          default            XXX.XXX.XXX.XXX        UGS        igb4
          default            "supposedly.disabled.if.ip"         UGS        igb5
          .......
          

          I issued a /sbin/route -n get 'default' command manually, showing:

             route to: 0.0.0.0
          destination: 0.0.0.0
                 mask: 0.0.0.0
              gateway: >>>>GATEWAY THAT WAS DISABLED IN THE UI IP ADDR HERE<<<<
                  fib: 0
            interface: igb5
                flags: <UP,GATEWAY,DONE,STATIC>
           recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire
                 0         0         0         0      1500         0         0
          

          In the UI -> I enabled and disabled the interface for the gateway in the UI.
          after doing that, then in ifconfig -a output, I noticed that the interface no longer shows an IP Address and is not in "UP" status.

          I subsequently issued a route delete default command which removed both default routes (the correct one, and the stagnant one for the now down interface), followed by adding a default route for the correct interface gateway.

          I believe the issue is now resolved. since netstat -rn only shows 1 entry as 'default' now rather than two, and the route -n get default command now returns the correct gw ip addr.

          FYI, this issue has been plaguing me for quite some time, it may be worth adding some logic to check for the presence of this issue if the issue is non-deterministic and/or non-reproducible. I unfortunately cannot provide reproduction steps that would lead to the loss of configuration sync between the UI and the OS, but I would note that the offending 3rd gateway interface was disabled in the UI, and unfortunately, it's interface was still up and had an ip addr, and the routing table had two routes set to "default". Not sure what here could be extrapolated as either a bug or an enhancement request to prevent the issue from reoccurring for others. It appears that the "disabled" state for the interface didn't quite make it down to the OS level bringing the interface down for the gateway. The presence of two routes both "default" I think might not be errant in load balancing scenarios (but definitely a bad deal if the interface is disabled in the UI, also, I think enabling/disabling the IF didn't seem to remove the duplicate default route entry corresponding to it)?

          I would note that my gateway group is configured as "failover" rather than a Load Balancing configuration, i.e. interfaces in my gateway groups are usually categorized as tier1 & tier 2 & tier 3. So perhaps this config sync mixup between the ui and OS happened at some point during a failover, and the config "mixup" remained in this state indefinitely until manual intervention was required.

          Anyhow, Thank You kindly for your help and for responding to my forum post!

          1 Reply Last reply Reply Quote 1
          4 out of 4
          • First post
            4/4
            Last post
          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
            This community forum collects and processes your personal information.
            consent.not_received