Multi-WAN + failover: gw not switching back

cyberzeus

PREFACE: I know there is a fair amount of discussion on this topic and the goto resolution is the clearing of states that were created on the backup gateway. However, I have tried this and it still does not work as expected.

My understanding is that after a GW is restored, then traffic configured to traverse it will not resume if there are any states for that traffic on the Backup-GW. Seems straight-forward so I used the test described below to see if this actually works. My results - it didn't. So either (a) my understanding is wrong; (b) this test is bad; or (c) the feature doesn't work as expected.

I will also add that there is one thing I have configured that I have not seen in the other threads: NAT - I have outbound NAT configured. This may not matter but I was thinking maybe there is some kind of binding that occurs at the GW level when NAT is in play. If so, then simply clearing states may not be the ticket when getting traffic flowing back through the Restored-GW.

In any event, can the community please review the following test procedure and let me know if this accurately tests the issue? Thanks in advance for your time and feedback.

Begin traffic via the GWUT (gateway under test); Verify that a state was created on the GWUT for this IP-pair.
Create a failure on the GWUT.
Verify all states on the GWUT are cleared.
Restart traffic for the same IP-pair in #1 above; Observe that a new state for this IP-pair is created on the Backup-GW.
Restore the GWUT; Verify that the state verified in #4 is still on the Backup-GW (i.e. it did not switch after the GWUT was restored).
Manually clear the state created in #4.
Restart traffic for the same IP-pair in #1 above; Observe that a new state for this IP-pair is now created on the GWUT.

If this is a good test, then #7 never happens. When I restart traffic, all states continue to be created on the Backup-GW despite the GWUT being online.

cyberzeus

One thing I neglected to add - the only way I am able to restore traffic through the Restored-GW is to manually disable & re-enable the interface to which the Restored-GW connects. Nothing else works.

Gblenn

@cyberzeus I was playing around with this setup the other day, in a test environment.

What seemed to do the trick for me was to change "Trigger Level" for the Gateway group to "Packet loss or High Latency". It was set as "Member down" initially. I would have guessed that either setting should work, but for some reason it didn't...

cyberzeus

Thanks @gblenn --- Mine is actually set to just Packet Loss but I agree with you that whatever the setting is, the resolution behavior should be the same.

Because I am able to restore the traffic through the Restored-GW by doing a manual interface reset, I suspect that something in the code treats a GW restoration event differently if the fault involved the interface staying up (as with a Packet Loss or Latency issue) vs. the interface actually going offline.

Gblenn

@cyberzeus Well, I did kill the connection completely, so the interface is down. And of course the setting is flush states, which I believe is the default anyway.

cyberzeus

@gblenn: Sorry but I'm not understanding...

"I did kill the connection completely"

Do you mean that with an interface actually down, traffic did not resume when it came back up?
Also, did the traffic resume on the Restored-GW without further intervention?

"the setting is flush states...default anyway"

Well, I'm not seeing an issue here - the issue I describe is after the GW is restored.
In any event, the nomenclature for this setting is a bit off. It says to check the box to have the states flushed with GW down but the flushing seems to occur even when this isn't checked.

Gblenn

@cyberzeus I did some more testing and I think I have to take back what I said earlier...

It actually works quite well no matter what config I try out...

However, what seems to provide the best experience is the following:

Gateway group Trigger Level set to 'Member Down'. This turned out to be slightly quicker to switch over in my case, vs 'Packet Loss or High Latency'. Perhaps tuning the thresholds could still make Loss/Latency a better trigger.
System > Advanced > Networking - Reset All States changed to NOT ticked
System > Advanced > Miscellaneous - Gatway Monitoring set to Kill states for all gateways that are down, instead of Flush all states on Gateway failure.

So I actually ended up making several changes compared to what I had before...

My test setup (22.05) is in my home system where I have one VLAN used to connect WAN2 (from both systems) to my LTE Router. This was maintained throughout the test.
I have another VLAN set up to provide the (Tier1) WAN IP for the test system (double NAT). I can switch this VLAN "on/off" using profiles in the switch, basically mimicking pulling the fiber connection.
So the test system is set up as a replica of my main system, although I run 23.01 there currently.

I have one single PC connected to the test system, where I run a Youtube session which allows me to easily see when the changes happen on the traffic graph.
Youtube is of course buffering so it's not a constant stream, but I do get 5-8 Mbit spikes at about 10-15 second intervals.
When I "pull the fibre" I of course immediately stop seeing traffic on WAN. I also see Packet Loss going up and eventually the gateway status shifts to Offline and the default gateway symbol moves to WAN2.
Not long after, and before youtube runs out of buffer, I can see the spikes starting on WAN2 again. I then try whatismyip which correctly reports my LTE IP (WAN2) address.

When I "reconnect" the fiber I see WAN eventually being set to online again and the default shifts back. Almost immediately after that I see the traffic spikes back on WAN, and whatismyip reports my fiber IP.

Seems like if I was too quick to test whatsmyip, it didn't work and I had to reload it once or twice. And at one point did it report that it wasn't able to resolve my IP. But it doesn't take long for things to be "back to normal".

Clearly, a real time session like a Teams meeting would be interrupted when the fiber is disconnected. But I guess it would quickly reconnect as soon as the shift has taken place. Not sure how it would look when the fiber is up again since in that case both WAN's are online before it shifts over.

cyberzeus

Hi @gblenn - interesting results. This is a slightly different test scenario than the issue I describe so it would be interesting to know what happens if you did the following:

Change the "Trigger Level" to Packet Loss.
Rather than a "pull the fiber" event, create a packet loss issue; for example, have the WAN side start filtering ICMP packets but on the device connected to your pfSense box, not the pfSense box itself.
Try the above with both setting on the "State Killing on Gateway Failure" feature; checked vs. not checked.

The above would mimic my scenario and while I can reproduce the failure every single time, it would be interesting to learn if you also see the issue. If you don't, then of course either we missed something in test scenario duplication OR there is a setting on your end that prevents the issue from occurring.

Gblenn

@cyberzeus Ok so I made this testing with the same set up as before and then changed the following:

A rule on the main pfsense to block all ICMP on the TestVLAN (kill states required for it to "kick in").
Trigger Level set to Packet Loss
State Killing on Gateway failure:
a. Kill states for all gateways which are down
b. Flush all states on gateway failure

Regardless of 3a or 3b, I see the exact same behaviour as before. When invoking the rule on the main pfsense, "Loss" starts to rise and soon after passing 20+, it switches over to WAN2.

Spikes now start to show up on the WAN2 graph and whatsmyip shows my correct LTE IP.
Toggling the rule off, and "Loss" goes down again and seconds after WAN indicates online, traffic shifts back and whatsmyip shows my fiber IP.

The only thing when using "Flush all states" (which affects LAN side states as well) is that the pfsense GUI appears to freeze for ~15 seconds before that session reengages. Using "Kill states" isn't noticed at all from a LAN to LAN perspective. This was of course true in my previous testing as well...