25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down
-
I saw 25.07 release was published. So I guess this is a moot point for now, as the next major release won't be before 25.11 at the earliest. I will keep monkeying around I guess.
@dennypage if what you wrote is true, then how can you explain the tcpdumps above, when both WAN1 and WAN2 are "up", and I have the "don't create static routes for monitor IPs" option enabled on WAN2, and I see no packets to 8.8.8.8 leaving ix0—they are 100% going out on ix2, confirmed with tcpdump and the 50+ms latency indicative of the 4G connection, and at the same time my default route being via the WAN1/FIOS... ?
-
@luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:
@dennypage if what you wrote is true, then how can you explain the tcpdumps above, when both WAN1 and WAN2 are "up", and I have the "don't create static routes for monitor IPs" option enabled on WAN2, and I see no packets to 8.8.8.8 leaving ix0—they are 100% going out on ix2, confirmed with tcpdump and the 50+ms latency indicative of the 4G connection, and at the same time my default route being via the WAN1/FIOS... ?
“If what you wrote is true”? Do you think I am lying to you? Really?
Yes, it’s true that Unix uses destination based routing. Yes, it’s true that static routes are required for monitoring Multi-Wan. And monitoring works correctly if you set the static route, yes? QED. I don’t know what else to say.
If it’s important to you to understand the reason for the specific results of the test above, it’s your system so you’ll have to figure it out based on the system state at the time of the test. I’d suggest that you start by examining your routing tables:
netstat -rn
-
@luckman212 It works without the option because pf "catches" the traffic before it leaves ix0 - hence my previous comment "pf overrides the OS and sends it over ix2". The reason why pf can't do its job in your case is because the default route goes away; since there's no route for the OS to use for dpinger, you get the sendto error and pf doesn't get the chance to override the path to send it out of ix2.
-
Nobody said anything about lying. I should have phrased it as "Let's assume that FreeBSD routing behaves as you've outlined... in that case, how can I be observing XYZ"
I'm sorry this thread is starting to derail. I appreciate all your help. I am not nor never claimed to have all the answers. Just looking for explanations for the new, unwanted and somewhat unexplainable behavior I am seeing here.
-
@marcosm said:
pf can't do its job in your case is because the default route goes away
So is that still being considered a bug then? I still can't figure out why WAN1 going down (either by way of physically downing the interface by removing the cable, or by dpinger triggering a down event) should cause pfSense to mark the other gateway down and/or remove the default gateway. Feels wrong.
Is the explanation that, WAN1 goes down, and before the system has a chance to set WAN2 as the default gateway, the pings to 8.8.8.8 start failing because "technically" there's no longer or not yet a valid default route to send those packets (pf ignored) - and this causes WAN2 to then go down leaving the box dead as a doornail?
If that's loosely what's going on here, then what about adding a simple option to the routing page something like "Do not remove a default gateway if there are no other online gateways in the group"
-
It seems like a bug to me. Because the WAN2 gateway would remain marked up for a while, even if dpinger starts to lose pings, and should be set as default.
If there was any default route then dpinger would use it and pf would catch and reroute that via WAN2.
It's an interesting issue. I don't think I've ever seen anyone using it without the static route set. I've seen numerous issues with conflicting routes for DNS and dpinger though
But I have always resolved them by simply using a different target or making sure the both use the same gateway.
-
Most of that code is script though so it should be patchable.
-
At least there seem to be improvements to be made. I will dig further.
-
and this causes WAN2 to then go down leaving the box dead as a doornail?
Yes.
what about adding a simple option to the routing page something like "Do not remove a default gateway if there are no other online gateways in the group"
When the WAN1 interface is detached the OS removes the (default) route using the gateway within that interface's subnet since the gateway address is no longer reachable. Hence there's not much that pfSense can do/prevent at that point since the default route has already been removed.
Once the route is removed the packet loss percentage starts climbing. However other processes are triggered as part of the interface event which end up restarting dpinger and hence the gateway immediately shows offline. As a test I spent some time patching the various code paths so that the dpinger process would be kept running and allow the packet loss to slowly build up. That didn't help because 1) regardless of dpinger being restarted or kept running it still has the sendto error due to the default gateway being removed by the OS (and hence cannot be forced with route-to by pf), and 2) by the time the new default gateway would be added, the gateway is already marked offline due to packet loss.
I don't know why it worked for you previously. There are a least a couple new related changes that are implemented to prevent the monitoring traffic from going out the wrong interface; perhaps that's part of it or maybe your configuration and environment allows the timing to work out. I did find various ways to trigger the issue while I was testing. Ultimately any workarounds I could think of would be prone to race conditions and hence I don't think it's worth pursuing. That leads me to the conclusion that a correct multi-WAN setup that uses gateway failover/recovery requires the static routes.
-
@marcosm said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:
That leads me to the conclusion that a correct multi-WAN setup that uses gateway failover/recovery requires the static routes.
Thanks @marcosm. What do you think about adding a note regarding this to the help text for the “Do not add static route for gateway monitor IP address via the chosen interface” option?
-
@dennypage Yeah. Or just remove them entirely.
-
@marcosm perhaps if the primary default pathway is removed a secondary default pathway should be added (ideally until the primary default pathway is active again)
-
@marcosm When you say "interface detached" is that what you meant, or are you saying this occurs even with a simple Link Down event? Because I figured a link down would be treated differently than an actual interface being removed (i.e. that interface is no longer in the device tree (like yanking the ethernet card out of a PCI slot)
I guess if you guys say this whole situation is an unsolvable problem I have to accept it. Yes I don't know why it's behaving like this either, when it used to work. I am now working on finding suitable monitor IPs for these WAN interfaces that don't cause other undesirable effects. People (or IoT crap) often use 8.8.8.8, 8.8.4.4 etc as hardcoded DNS servers and so I don't want to statically route those out of either WAN. I can run traceroute on the FIOS connection and get some reasonable targets there (I even wrote a script that does this on a cronjob and updates the monitor IP) but I have yet to find a pingable host anywhere along the route on the T-mobile LTE WAN2. I may just give up on monitoring that and just mark it always "up" as it's my failover anyway so even if it's down, the behavior is effectively the same.
Now, on to a new bug I found where the static routes are not being removed after changing the monitor IPs... will start a new thread / redmine about that. Possibly related to #16343
-
@luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:
People (or IoT crap) often use 8.8.8.8, 8.8.4.4 etc as hardcoded DNS servers and so I don't want to statically route those out of either WAN.
Yeah, I have a lot of those as well. To address this, and prevent devices from bypassing the host overrides in the DNS resolver, I redirect all external DNS requests on my internal subnets to the firewall using port forwarding:
-
That's a smart trick, but it makes it impossible to use or test any external DNS servers, which is something I need to be able to do for work. It also won't work for DoT/DoH.
-
@luckman212 said in 25.07 RC - no default gateway being set if default route is set to a gateway group and the Tier 1 member interface is down:
That's a smart trick, but it makes it impossible to use or test any external DNS servers, which is something I need to be able to do for work. It also won't work for DoT/DoH.
The rule above is the device network, which is where the majority of the IoT devices are. The LAN rule looks like this:
host_admin is an alias list of admin hosts, such as my workstation, that are permitted to make direct DNS enquiries outside the network as needed.
Yep, can't stop DoT. But not a lot of IoT devices using that yet.
-
@dennypage Ah, indeed - that's a nice way to handle it
-
Mmm, this still seems painful but I think we need to accept it's not going to change at least in the short term. This must be solvable but the number of interacting pieces here makes it non-trivial!
-
Ok, here's a hacky workaround that works for me you might try.
Add a 3rd dummy gateway that always remains up to provide a default route. Add that to the failover group as some high tier.
So in my case I added the LAN interface as a gateway on LAN. It's local so always up and doesn't require a static route. It take a few loops to come back up but does end up with the tier 2 gateway as default.
So:
[25.07-RELEASE][root@m470-3.stevew.lan]/root: netstat -rn4 Routing tables Internet: Destination Gateway Flags Netif Expire 0.0.0.0 172.21.16.1 UGS igb0 10.0.5.1 link#14 UHS lo0 10.0.5.128 link#20 UH pppoe0 127.0.0.1 link#14 UH lo0 172.21.16.0/24 link#5 U igb0 172.21.16.1 link#5 UHS igb0 172.21.16.182 link#14 UHS lo0 192.168.182.0/24 link#6 U igb1 192.168.182.1 link#14 UHS lo0
Before failover:
[25.07-RELEASE][root@m470-3.stevew.lan]/root: pfSsh.php playback gatewaystatus Name Monitor Source Delay StdDev Loss Status Substatus LAN_GW 192.168.182.1 192.168.182.1 0.059ms 0.02ms 0.0% online none PPPOE_WAN_PPPOE 1.1.1.1 10.0.5.1 5.694ms 0.199ms 0.0% online none WAN_DHCP 1.0.0.1 172.21.16.182 6.011ms 0.15ms 0.0% online none
Immediately after disconnecting igb0, the DHCP WAN:
[25.07-RELEASE][root@m470-3.stevew.lan]/root: pfSsh.php playback gatewaystatus Name Monitor Source Delay StdDev Loss Status Substatus PPPOE_WAN_PPPOE 1.1.1.1 10.0.5.1 0ms 0ms 100% down highloss
After a few restart loops:
[25.07-RELEASE][root@m470-3.stevew.lan]/root: pfSsh.php playback gatewaystatus Name Monitor Source Delay StdDev Loss Status Substatus LAN_GW 192.168.182.1 192.168.182.1 0.056ms 0.016ms 0.0% online none PPPOE_WAN_PPPOE 1.1.1.1 10.0.5.1 7.242ms 0.164ms 0.0% online none
Might be able to improve that behaviour....
-
I say "detached" because that's what the system log says when I disconnect the interface on the VM - it results in the interface being "UP" with a status of "no carrier".
Let's keep in focus the following: what exactly is the problem that needs to be solved that necessitates avoiding a route? The checkbox in question removes the static route but I don't see much difference in the traffic being routed by the OS or being routed by pf. One way or another the traffic has to go out the intended the interface. I'm not convinced that a pf-only routing solution is necessary.