Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down

    Scheduled Pinned Locked Moved Plus 25.07 Develoment Snapshots
    36 Posts 6 Posters 986 Views 6 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • luckman212L Offline
      luckman212 LAYER 8 @dennypage
      last edited by

      Denny thank you for your help, I think somehow we are now talking about multiple different things.

      I have "normal" outbound NAT rules on both WAN1 + WAN2. So by the time a packet has arrived on Verizon's or T-mobile's network, its source address has already been rewritten to the public WAN side of either router in the diagram, right? So Verizon, T-mobile, Cloudflare etc don't know about or care about 192.168.191.0/24. It's up to the router(s): my 6100 as well as the Teltonika RUTX11 (which also does its own NAT of course), to keep track of the states (I'm sure I don't need to tell you any of this).

      Yes, in my diagram, I am aware that I am "double-natting" on the WAN2 side. I know the limitations of that, but prefer it to trying to use IPPT (pass thru) mode—which is not stable in my testing as T-mobile rotates IPs very frequently on their LTE network and when pfSense runs it's various rc.newwanip* scripts it can be mildly disruptive.

      All that being said, I again want to point out that all of this routing/NATting was and is working fine, as long as I don't unplug my WAN1 cable. That's the strange part, and the new behavior which wasn't happening before I installed 25.07.

      Side note: I find Cloudflare anycast DNS IPs (1.1.1.1, 1.0.0.1) to be highly unreliable for ICMP, they frequently drop packets and experience wide latency fluctuations. I don't recommend them as dpinger monitor IPs. YMMV.

      Here are a few more screenshots:

      Some pings (with source address binding) and routes

      note the hugely different latency, clearly the packets are traversing the LTE network and later the FIOS.

      [25.07-RC][root@r1.lan]/root: ping -S 192.168.191.2 8.8.4.4
      PING 8.8.4.4 (8.8.4.4) from 192.168.191.2: 56 data bytes
      64 bytes from 8.8.4.4: icmp_seq=0 ttl=112 time=58.607 ms
      64 bytes from 8.8.4.4: icmp_seq=1 ttl=112 time=57.743 ms
      64 bytes from 8.8.4.4: icmp_seq=2 ttl=112 time=61.948 ms
      64 bytes from 8.8.4.4: icmp_seq=3 ttl=112 time=57.283 ms
      ^C
      --- 8.8.4.4 ping statistics ---
      4 packets transmitted, 4 packets received, 0.0% packet loss
      round-trip min/avg/max/stddev = 57.283/58.895/61.948/1.825 ms
      
      [25.07-RC][root@r1.lan]/root: ping -S 192.168.20.1 8.8.4.4
      PING 8.8.4.4 (8.8.4.4) from 192.168.20.1: 56 data bytes
      64 bytes from 8.8.4.4: icmp_seq=0 ttl=120 time=3.940 ms
      64 bytes from 8.8.4.4: icmp_seq=1 ttl=120 time=3.257 ms
      64 bytes from 8.8.4.4: icmp_seq=2 ttl=120 time=3.770 ms
      64 bytes from 8.8.4.4: icmp_seq=3 ttl=120 time=3.185 ms
      ^C
      --- 8.8.4.4 ping statistics ---
      4 packets transmitted, 4 packets received, 0.0% packet loss
      round-trip min/avg/max/stddev = 3.185/3.538/3.940/0.324 ms
      
      [25.07-RC][root@r1.lan]/root: ping -S 74.101.221.156 8.8.4.4
      PING 8.8.4.4 (8.8.4.4) from 74.101.221.156: 56 data bytes
      64 bytes from 8.8.4.4: icmp_seq=0 ttl=120 time=3.074 ms
      64 bytes from 8.8.4.4: icmp_seq=1 ttl=120 time=2.985 ms
      64 bytes from 8.8.4.4: icmp_seq=2 ttl=120 time=2.823 ms
      64 bytes from 8.8.4.4: icmp_seq=3 ttl=120 time=3.022 ms
      ^C
      --- 8.8.4.4 ping statistics ---
      4 packets transmitted, 4 packets received, 0.0% packet loss
      round-trip min/avg/max/stddev = 2.823/2.976/3.074/0.094 ms
      
      [25.07-RC][root@r1.lan]/root: route -n get 192.168.191.1
         route to: 192.168.191.1
      destination: 192.168.191.0
             mask: 255.255.255.0
              fib: 0
        interface: ix2
            flags: <UP,DONE,PINNED>
       recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire
             0         0         0         0      1500         1         0
      
      [25.07-RC][root@r1.lan]/root: route -n get 8.8.4.4
         route to: 8.8.4.4
      destination: 0.0.0.0
             mask: 0.0.0.0
          gateway: 74.101.221.1
              fib: 0
        interface: ix0
            flags: <UP,GATEWAY,DONE,STATIC>
       recvpipe  sendpipe  ssthresh  rtt,msec    mtu        weight    expire
             0         0         0         0      1500         1         0
      
      Gateways + Interfaces

      screen 2.png

      System > Routing

      screen 4.png

      Routing > Gateway Groups

      screen 5.png

      Diags > Routes showing route to 192.168.191.0/24 via ix2

      screen 6.png

      Outbound NAT rules showing 1 for each WAN if ! → rfc1918 dest

      screen.png

      My "rfc1918" alias definition

      6f2455e1-c395-4dc4-a750-89b67b3c5d94-image.png

      (yes I know 25.07 now has its own native _private4_ for the same...)

      1 Reply Last reply Reply Quote 0
      • luckman212L Offline
        luckman212 LAYER 8 @stephenw10
        last edited by

        To be clear, does it work as expected if you allow it to create the static route?

        Sorry @stephenw10 I missed this question before, yes I just tested it—removed the dpinger_dont_add_static_route option from WAN2, and failover works normally again.

        There should not need to be a static route to 8.8.8.8 bound to WAN2, and in fact requiring such a thing would be very problematic (all DNS queries would be routed over my slow LTE connection...). Also, to say again, this used to work, so feels like a regression. What I wrote above about having a system with literally no default route makes no "sense".

        dennypageD 1 Reply Last reply Reply Quote 0
        • dennypageD Offline
          dennypage @luckman212
          last edited by

          @luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:

          Sorry @stephenw10 I missed this question before, yes I just tested it—removed the dpinger_dont_add_static_route option from WAN2, and failover works normally again.

          This is as expected. I don't see Multi-WAN monitoring working correctly without having static routes for the monitor addresses. Btw, make sure you enable static routes for both gateways.

          I cannot explain why it appeared to work previously. Perhaps some interaction with Floating States? @stephenw10 might have thoughts on this.

          The only other possibility that occurs to me is that there might have been a configuration interaction if you were using the same destination as a DNS server with a gateway set in DNS Server Settings--see the doc on DNS Resolver and Multi-WAN. I can't speak directly to this because I've never used that style of configuration for DNS. Probably just a red herring, but worth a check.

          The issue with routing all DNS queries via the wrong WAN interface can easily be addressed by not using the same address for DNS that you use for monitoring. Instead of Google or Cloudflare for DNS, I recommend using DNS over TLS with quad9. Better security, and you don't expose your queries to your ISPs.

          Regardless, I'm glad you have it working.

          luckman212L 1 Reply Last reply Reply Quote 0
          • luckman212L Offline
            luckman212 LAYER 8 @dennypage
            last edited by

            I wouldn't consider this current state "working" - I really want to know why things break so badly when I don't have a static route to both monitor IPs. As I proved in my screenshots and command output above, the routing works as expected without the routes.

            The bug / problem is because, when WAN1 loses its gateway, the gateway code for some reason ends up removing BOTH gateways and leaving the firewall without any default route AND apparently no route to the next hop on WAN2 either, which to me seems like a regression and not something I would consider ready for wide release.

            I'd like to help debug this by whatever means necessary. I mentioned in the top post that I will share my configs. or invite Netgate to take a direct look via remote access, etc.

            dennypageD Bob.DigB 2 Replies Last reply Reply Quote 1
            • dennypageD Offline
              dennypage @luckman212
              last edited by

              @luckman212 I consider the static routes to be required for correct Multi-WAN monitoring. Unless there is something doesn't work correctly with the static routes in place, I don't see an issue worth pursuing.

              However, I don't speak for Netgate -- perhaps they have a different opinion and will be willing to explore it further.

              1 Reply Last reply Reply Quote 0
              • Bob.DigB Offline
                Bob.Dig LAYER 8 @luckman212
                last edited by Bob.Dig

                @luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:

                the gateway code for some reason ends up removing BOTH gateways

                Maybe it thinks, it has no working gateways anymore because PING failed for all at the same time because everything got routed through the now down gateway. It looks like it is working like expected at this point. Maybe that checkbox should be removed. 😉

                luckman212L 1 Reply Last reply Reply Quote 0
                • luckman212L Offline
                  luckman212 LAYER 8 @Bob.Dig
                  last edited by luckman212

                  @Bob.Dig I don't think that's what's happening. If you scroll up a few posts to where I have a section called "Some pings (with source address binding) and routes" you can see that the pings are traversing each separate gateway (you can tell from the vastly different latencies).

                  I just ran a few tcpdumps to confirm as well, the packets are definitely egressing out the separate correct gateways without the static routes:

                  [25.07-RC][root@r1.lan]/root: tcpdump -ni ix0 dst host 8.8.8.8
                  tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
                  listening on ix0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
                  ^C
                  0 packets captured        <<–– ✅ no packets to the monitor IP seen on the WAN1 interface
                  857 packets received by filter
                  0 packets dropped by kernel
                  
                  [25.07-RC][root@r1.lan]/root: tcpdump -ni ix2 dst host 8.8.8.8
                  tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
                  listening on ix2, link-type EN10MB (Ethernet), snapshot length 262144 bytes
                  06:22:32.463054 IP 192.168.191.2 > 8.8.8.8: ICMP echo request, id 22849, seq 36, length 9
                  06:22:37.497085 IP 192.168.191.2 > 8.8.8.8: ICMP echo request, id 22849, seq 37, length 9
                  06:22:42.500047 IP 192.168.191.2 > 8.8.8.8: ICMP echo request, id 22849, seq 38, length 9
                  ^C
                  3 packets captured        <<–– ✅ packets are being sent via WAN2
                  166 packets received by filter
                  0 packets dropped by kernel
                  
                  luckman212L 1 Reply Last reply Reply Quote 0
                  • luckman212L Offline
                    luckman212 LAYER 8 @luckman212
                    last edited by

                    @stephenw10 @marcosm Since you guys seem to be unable to replicate this (?) would you be able to send me the 25.07-RELEASE image to test with? I see on redmine (e.g. here, here, and here) that there's a build you guys are testing on tagged -RELEASE (built on 2025-07-22). Maybe there are some small differences in that build that are affecting my results? I've lost a good portion of my weekend on this and growing more desperate.

                    1 Reply Last reply Reply Quote 0
                    • M Offline
                      marcosm Netgate
                      last edited by

                      I am able to reproduce the issue by checking the option to not add the automatic route and failing over to a DHCP WAN from a static WAN. Arguably this is not a valid setup when you want to monitor multiple WANs hence the issue of it not working is not in itself necessarily a bug. Note that even if the service is bound to an address or interface, as mentioned, the OS still decides where that traffic will be routed. That's why you see the state with origif for ix0 - pf overrides the OS and sends it over ix2.

                      The fact that the system is left without a default gateway does warrant further digging. From looking at the code I see that it's left without a default gateway because at that moment both gateways have been marked as down. I will need to dig further to understand why it's marked as down and if that's an accurate status at that point.

                      luckman212L 1 Reply Last reply Reply Quote 0
                      • luckman212L Offline
                        luckman212 LAYER 8 @marcosm
                        last edited by luckman212

                        @marcosm Thanks very much for looking. I would't really mind leaving the static route, IF there were any pingable hosts along the nearby path that I seem to be able to derive from traceroute on that WAN2 (Tmobile 4G) connection. I don't want to use 8.8.8.8, 8.8.4.4, 1.1.1.1, 9.9.9.9 etc because then ALL traffic to that host will flow over the backup (slow, expensive) connection.

                        I enabled the hidden system/route-debug option, and am still trying to track down the chain of events that leads pfSense to marking WAN2 down and removing the default route. But any help would be most appreciated and let me know if I can provide anything more.

                        P dennypageD 2 Replies Last reply Reply Quote 0
                        • P Offline
                          Patch @luckman212
                          last edited by

                          @luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:

                          I don't want to use 8.8.8.8, 8.8.4.4, 1.1.1.1, 9.9.9.9 etc because then ALL traffic to that host will flow over the backup (slow, expensive) connection.

                          Is that really the case?
                          Surely both the main and backup internet connection can reach all internet sites but the route taken by each packet does not just depend on which route has reached that site in the past.

                          1 Reply Last reply Reply Quote 0
                          • dennypageD Offline
                            dennypage @luckman212
                            last edited by

                            @luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:

                            I don't want to use 8.8.8.8, 8.8.4.4, 1.1.1.1, 9.9.9.9 etc because then ALL traffic to that host will flow over the backup (slow, expensive) connection.

                            If you want to monitor the backup connection, something has to flow over that connection. No way around that. If you need a public DNS server as a target, just pick an address that you are not using as an active DNS server. There are lots to choose from, even from the common DNS hosts (8.8.8.8, 8.8.4.4, 1.1.1.1, 1.0.0.1, and your ISP's DNS servers). You don't need all of them as DNS servers.

                            However, if you absolutely don't want anything going over the backup connection, another option would be to just disable gateway monitoring on the backup connection altogether. Given your setup, I expect that you have disabled the gateway monitoring action on the backup connection, so the monitoring of the backup connection is really only for human consumption.

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.