Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down

    Scheduled Pinned Locked Moved Plus 25.07 Develoment Snapshots (Retired)
    66 Posts 7 Posters 1.5k Views 7 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • stephenw10S Online
      stephenw10 Netgate Administrator
      last edited by

      It seems like a bug to me. Because the WAN2 gateway would remain marked up for a while, even if dpinger starts to lose pings, and should be set as default.

      If there was any default route then dpinger would use it and pf would catch and reroute that via WAN2.

      It's an interesting issue. I don't think I've ever seen anyone using it without the static route set. I've seen numerous issues with conflicting routes for DNS and dpinger though 😉 But I have always resolved them by simply using a different target or making sure the both use the same gateway.

      1 Reply Last reply Reply Quote 1
      • stephenw10S Online
        stephenw10 Netgate Administrator
        last edited by

        Most of that code is script though so it should be patchable.

        1 Reply Last reply Reply Quote 0
        • M Offline
          marcosm Netgate
          last edited by

          At least there seem to be improvements to be made. I will dig further.

          1 Reply Last reply Reply Quote 2
          • M Offline
            marcosm Netgate
            last edited by

            @luckman212

            and this causes WAN2 to then go down leaving the box dead as a doornail?

            Yes.

            what about adding a simple option to the routing page something like "Do not remove a default gateway if there are no other online gateways in the group"

            When the WAN1 interface is detached the OS removes the (default) route using the gateway within that interface's subnet since the gateway address is no longer reachable. Hence there's not much that pfSense can do/prevent at that point since the default route has already been removed.

            Once the route is removed the packet loss percentage starts climbing. However other processes are triggered as part of the interface event which end up restarting dpinger and hence the gateway immediately shows offline. As a test I spent some time patching the various code paths so that the dpinger process would be kept running and allow the packet loss to slowly build up. That didn't help because 1) regardless of dpinger being restarted or kept running it still has the sendto error due to the default gateway being removed by the OS (and hence cannot be forced with route-to by pf), and 2) by the time the new default gateway would be added, the gateway is already marked offline due to packet loss.

            I don't know why it worked for you previously. There are a least a couple new related changes that are implemented to prevent the monitoring traffic from going out the wrong interface; perhaps that's part of it or maybe your configuration and environment allows the timing to work out. I did find various ways to trigger the issue while I was testing. Ultimately any workarounds I could think of would be prone to race conditions and hence I don't think it's worth pursuing. That leads me to the conclusion that a correct multi-WAN setup that uses gateway failover/recovery requires the static routes.

            dennypageD P luckman212L 3 Replies Last reply Reply Quote 3
            • dennypageD Offline
              dennypage @marcosm
              last edited by

              @marcosm said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:

              That leads me to the conclusion that a correct multi-WAN setup that uses gateway failover/recovery requires the static routes.

              Thanks @marcosm. What do you think about adding a note regarding this to the help text for the “Do not add static route for gateway monitor IP address via the chosen interface” option?

              Bob.DigB 1 Reply Last reply Reply Quote 0
              • Bob.DigB Offline
                Bob.Dig LAYER 8 @dennypage
                last edited by

                @dennypage Yeah. Or just remove them entirely.

                1 Reply Last reply Reply Quote 0
                • P Online
                  Patch @marcosm
                  last edited by Patch

                  @marcosm perhaps if the primary default pathway is removed a secondary default pathway should be added (ideally until the primary default pathway is active again)

                  1 Reply Last reply Reply Quote 0
                  • luckman212L Online
                    luckman212 LAYER 8 @marcosm
                    last edited by luckman212

                    @marcosm When you say "interface detached" is that what you meant, or are you saying this occurs even with a simple Link Down event? Because I figured a link down would be treated differently than an actual interface being removed (i.e. that interface is no longer in the device tree (like yanking the ethernet card out of a PCI slot)

                    I guess if you guys say this whole situation is an unsolvable problem I have to accept it. Yes I don't know why it's behaving like this either, when it used to work. I am now working on finding suitable monitor IPs for these WAN interfaces that don't cause other undesirable effects. People (or IoT crap) often use 8.8.8.8, 8.8.4.4 etc as hardcoded DNS servers and so I don't want to statically route those out of either WAN. I can run traceroute on the FIOS connection and get some reasonable targets there (I even wrote a script that does this on a cronjob and updates the monitor IP) but I have yet to find a pingable host anywhere along the route on the T-mobile LTE WAN2. I may just give up on monitoring that and just mark it always "up" as it's my failover anyway so even if it's down, the behavior is effectively the same.

                    Now, on to a new bug I found where the static routes are not being removed after changing the monitor IPs... will start a new thread / redmine about that. Possibly related to #16343

                    dennypageD 1 Reply Last reply Reply Quote 0
                    • dennypageD Offline
                      dennypage @luckman212
                      last edited by

                      @luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:

                      People (or IoT crap) often use 8.8.8.8, 8.8.4.4 etc as hardcoded DNS servers and so I don't want to statically route those out of either WAN.

                      Yeah, I have a lot of those as well. To address this, and prevent devices from bypassing the host overrides in the DNS resolver, I redirect all external DNS requests on my internal subnets to the firewall using port forwarding:

                      Screenshot 2025-08-06 at 18.28.17.png
                      Screenshot 2025-08-06 at 18.32.26.png

                      luckman212L 1 Reply Last reply Reply Quote 0
                      • luckman212L Online
                        luckman212 LAYER 8 @dennypage
                        last edited by

                        That's a smart trick, but it makes it impossible to use or test any external DNS servers, which is something I need to be able to do for work. It also won't work for DoT/DoH.

                        dennypageD 1 Reply Last reply Reply Quote 0
                        • dennypageD Offline
                          dennypage @luckman212
                          last edited by

                          @luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:

                          That's a smart trick, but it makes it impossible to use or test any external DNS servers, which is something I need to be able to do for work. It also won't work for DoT/DoH.

                          The rule above is the device network, which is where the majority of the IoT devices are. The LAN rule looks like this:

                          Screenshot 2025-08-07 at 05.18.48.png

                          host_admin is an alias list of admin hosts, such as my workstation, that are permitted to make direct DNS enquiries outside the network as needed.

                          Yep, can't stop DoT. But not a lot of IoT devices using that yet. 😊

                          luckman212L 1 Reply Last reply Reply Quote 0
                          • luckman212L Online
                            luckman212 LAYER 8 @dennypage
                            last edited by

                            @dennypage Ah, indeed - that's a nice way to handle it

                            1 Reply Last reply Reply Quote 0
                            • stephenw10S Online
                              stephenw10 Netgate Administrator
                              last edited by

                              Mmm, this still seems painful but I think we need to accept it's not going to change at least in the short term. This must be solvable but the number of interacting pieces here makes it non-trivial!

                              1 Reply Last reply Reply Quote 1
                              • stephenw10S Online
                                stephenw10 Netgate Administrator
                                last edited by stephenw10

                                Ok, here's a hacky workaround that works for me you might try.

                                Add a 3rd dummy gateway that always remains up to provide a default route. Add that to the failover group as some high tier.

                                So in my case I added the LAN interface as a gateway on LAN. It's local so always up and doesn't require a static route. It take a few loops to come back up but does end up with the tier 2 gateway as default.

                                So:

                                [25.07-RELEASE][root@m470-3.stevew.lan]/root: netstat -rn4
                                Routing tables
                                
                                Internet:
                                Destination        Gateway            Flags         Netif Expire
                                0.0.0.0            172.21.16.1        UGS            igb0
                                10.0.5.1           link#14            UHS             lo0
                                10.0.5.128         link#20            UH           pppoe0
                                127.0.0.1          link#14            UH              lo0
                                172.21.16.0/24     link#5             U              igb0
                                172.21.16.1        link#5             UHS            igb0
                                172.21.16.182      link#14            UHS             lo0
                                192.168.182.0/24   link#6             U              igb1
                                192.168.182.1      link#14            UHS             lo0
                                

                                Before failover:

                                [25.07-RELEASE][root@m470-3.stevew.lan]/root: pfSsh.php playback gatewaystatus
                                Name             Monitor        Source             Delay   StdDev  Loss  Status  Substatus
                                LAN_GW           192.168.182.1  192.168.182.1    0.059ms   0.02ms  0.0%  online       none
                                PPPOE_WAN_PPPOE  1.1.1.1        10.0.5.1         5.694ms  0.199ms  0.0%  online       none
                                WAN_DHCP         1.0.0.1        172.21.16.182    6.011ms   0.15ms  0.0%  online       none
                                

                                Immediately after disconnecting igb0, the DHCP WAN:

                                [25.07-RELEASE][root@m470-3.stevew.lan]/root: pfSsh.php playback gatewaystatus
                                Name             Monitor  Source      Delay  StdDev  Loss  Status  Substatus
                                PPPOE_WAN_PPPOE  1.1.1.1  10.0.5.1      0ms     0ms  100%    down   highloss
                                

                                After a few restart loops:

                                [25.07-RELEASE][root@m470-3.stevew.lan]/root: pfSsh.php playback gatewaystatus
                                Name             Monitor        Source             Delay   StdDev  Loss  Status  Substatus
                                LAN_GW           192.168.182.1  192.168.182.1    0.056ms  0.016ms  0.0%  online       none
                                PPPOE_WAN_PPPOE  1.1.1.1        10.0.5.1         7.242ms  0.164ms  0.0%  online       none
                                

                                Might be able to improve that behaviour....

                                1 Reply Last reply Reply Quote 0
                                • M Offline
                                  marcosm Netgate
                                  last edited by

                                  I say "detached" because that's what the system log says when I disconnect the interface on the VM - it results in the interface being "UP" with a status of "no carrier".

                                  Let's keep in focus the following: what exactly is the problem that needs to be solved that necessitates avoiding a route? The checkbox in question removes the static route but I don't see much difference in the traffic being routed by the OS or being routed by pf. One way or another the traffic has to go out the intended the interface. I'm not convinced that a pf-only routing solution is necessary.

                                  luckman212L 1 Reply Last reply Reply Quote 0
                                  • luckman212L Online
                                    luckman212 LAYER 8 @marcosm
                                    last edited by luckman212

                                    @stephenw10 Interesting workaround you posted above, I will try it!

                                    @marcosm To answer the question, "what exactly is the problem that needs to be solved that necessitates avoiding a route?", my answer would be:

                                    Adding a static route to a monitor IP can (not will) cause 2 main problems:

                                    Problem 1

                                    Users who try to access a service (DNS, HTTP etc) hosted on that IP will always use that one specific gateway. This gateway might be:

                                    1. slow
                                    2. expensive
                                    3. both 1 & 2
                                    4. administratively down or limited (e.g. 4G with data cap)
                                    5. blocked at the far side by firewall rules

                                    People using the network, who are likely unaware of such a configuration, will not understand why certain things are slow or broken, and simply complain. These users might be business users, or worse– family (wife, kids etc).

                                    Problem 2

                                    As a network administrator, having such a static route in place makes troubleshooting certain things difficult. For example, using 8.8.8.8 as a monitor IP means that you can't perform DNS lookups to Google DNS without adding more layers of complexity to your setup such as static LAN IPs and firewall rules to redirect DNS queries (as mentioned in the clever solution by Denny above).

                                    One simple example:

                                    • WAN2 is a backup connection (LTE, metered) with monitor IP 8.8.8.8
                                    • A user joins the network and, being a savvy user, has their DNS server hard-coded to 8.8.8.8
                                    • Savvy user makes a lot of DNS requests
                                    • All of that traffic egresses WAN2
                                    • Company receives a $100 mobile data bill for exceeding their data cap for the month

                                    This might seem to be an extreme example but, it has happened to me.

                                    1 Reply Last reply Reply Quote 0
                                    • stephenw10S Online
                                      stephenw10 Netgate Administrator
                                      last edited by

                                      Mmm, a good solution here would be some anycast ping targets that aren't DNS servers. But using DNS servers there is really convenient! 😉

                                      dennypageD 1 Reply Last reply Reply Quote 1
                                      • dennypageD Offline
                                        dennypage
                                        last edited by

                                        Effectively, @luckman212’s request is for a static route that only applies to IGMP echos originating from the firewall itself.

                                        M 1 Reply Last reply Reply Quote 0
                                        • M Offline
                                          marcosm Netgate @dennypage
                                          last edited by

                                          @dennypage FWIW that doesn't happen currently even with pf. The route-to rule is based on the interface's source address with any destination that's not in the interface's subnet. Still, a rule can be created that applies to the correct traffic.

                                          Given the feedback, it sounds like the issue isn't that a route should not exist, but rather some route is needed to allow pf to force the traffic. That's effectively the workaround @stephenw10 showed. Any potential undesired behavior from that kind of solution needs to be considered.

                                          1 Reply Last reply Reply Quote 2
                                          • dennypageD Offline
                                            dennypage @stephenw10
                                            last edited by

                                            @stephenw10 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:

                                            Mmm, a good solution here would be some anycast ping targets that aren't DNS servers. But using DNS servers there is really convenient! 😉

                                            Convenient yes, but from time to time, Google and others get annoyed with everyone using their DNS servers as monitor targets and put temporary blocks in place. I generally recommend people to use regional routers in their ISP instead.

                                            luckman212L 1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.