Failover doesn't work.

darkcorner

I have configured a dual WAN with LoadBalance and Failover, but the failover doesn't work.
When I unplug the cable of a line, then the PING (in Win10 on 8.8.8.8 or on google.com) stops working until I reconnect it.

In summary these are the operations done:

WAN with static IP
4 DNS (2 from Google and 2 from Open DNS)
2 Gateways, associated with Google DNS and OpenDNS
Monitor with OpenDNS (first and second DNS). I initially used those from Google, but yesterday with these, the Gateway on the optical fiber was oflline due to too many lost packets.
Gateway group "LoadBalancer" with both at Tier1
2 Gateway groups for FailOver by inverting Tiers and with Trigger Level to "Member down"
In LAN-DMZ1-DMZ2 I created 3 rules, on top of all other rules, with gateway load-balancer, failover1 and failover2
Default gateway IPv4 = Automatic

I did not use weights between the two lines so as not to further complicate things (also because I do not yet know the performance of the two lines when everything is up and running).

Yet as I said, when I unplug one of the two "WAN" cables, the PING on the PC stops working.
I checked everything both still using the guide and of course the official documentation.

The only thing is that in System/Advanced/Miscellaneous/Load Balancing
Load Balancing / Use sticky connections = OFF
But I have nowhere found a hint as to whether it is mandatory to activate it. In any case, it would be more about Load Balancing than Failover.

darkcorner

I solved it by setting both Gateways to Automatic, although I have not found any indications to do so in the documentation

bp81

@darkcorner said in Failover doesn't work.:

I have configured a dual WAN with LoadBalance and Failover, but the failover doesn't work.
When I unplug the cable of a line, then the PING (in Win10 on 8.8.8.8 or on google.com) stops working until I reconnect it.

In summary these are the operations done:

WAN with static IP

4 DNS (2 from Google and 2 from Open DNS)

2 Gateways, associated with Google DNS and OpenDNS

Monitor with OpenDNS (first and second DNS). I initially used those from Google, but yesterday with these, the Gateway on the optical fiber was oflline due to too many lost packets.

Gateway group "LoadBalancer" with both at Tier1

2 Gateway groups for FailOver by inverting Tiers and with Trigger Level to "Member down"

In LAN-DMZ1-DMZ2 I created 3 rules, on top of all other rules, with gateway load-balancer, failover1 and failover2

Default gateway IPv4 = Automatic

I did not use weights between the two lines so as not to further complicate things (also because I do not yet know the performance of the two lines when everything is up and running).

Yet as I said, when I unplug one of the two "WAN" cables, the PING on the PC stops working.
I checked everything both still using the guide and of course the official documentation.

The only thing is that in System/Advanced/Miscellaneous/Load Balancing
Load Balancing / Use sticky connections = OFF
But I have nowhere found a hint as to whether it is mandatory to activate it. In any case, it would be more about Load Balancing than Failover.

Another setting to look at also in System -> Advanced -> Miscellaneous is "State Killing on Gateway Failure". The information on that option reads:

"The monitoring process will flush all states when a gateway goes down if this box is checked."

My observation of WAN failover in pfSense by default settings, in which State Killing on Gateway Failure is NOT checked goes like this scenario:

Client1 is streaming from YouTube and is connected to Google's search engine
Primary WAN gateway goes down
After the Primary WAN goes down and the Secondary WAN picks up, Client1 establishes a connection to Spotify to stream music
Client1's connections to YouTube and Google at first continue to try to use the Primary WAN, and thus don't perform. This situation persists until Client1's web browser decides the connections are lost and attempts to establish a new connection, which the router will utilize the Secondary WAN for.
Client1's connection to Spotify works without issues because it established over the Secondary WAN initially.

This type of behavior is especially a problem if your primary WAN's failure mode isn't to go hard down, but is simply considered down by the router when in reality it's "up" but performing extremely poorly due to excessive latency or excessive packet loss. In this scenario, Client1's web browser with active connections to Google and YouTube over the Primary WAN won't try to establish new connections, because the connection via the Primary WAN is "up" it just works extremely poorly.

This can be redressed by using State Killing on Gateway Failure. Use this with caution. What it does, is when the Primary WAN fails, the monitor process clears ALL states in the router. This is equivalent to going to Diagnostics -> States -> Reset States -> checking the Checkbox and hitting Reset button. This breaks ALL existing connections and forces all client systems to re-establish all connections.

This can be good if, in a failover scenario, you want to immediately re-establish all connections over a Secondary WAN. However, if you have a primary WAN that has intermittent connection issues where it flips from up to down frequently, it can be extremely disruptive, because this feature will clear ALL states when a gateway goes down, NOT just the states associated to the gateway that went down. If you have primary WAN with intermittent connectivity issues, this could cause the router to 'thrash' between Primary and Secondary WAN. As a result, I think this feature is more useful for effecting a quicker failover to Secondary when the Primary WAN is hard down and not up with very bad latency / high packet loss. It can be used in a scenario with bad latency / packet loss causing primary to get marked down, but you have to be very cautious about how you tune the gateway monitoring to prevent thrashing.

Personally I'd like to see a feature where we could only kill states associated with a down gateway if possible, but not knowing the code in the networking stack, I don't know if that's actually possible or not.

In any case, bottom line: for your scenario, you are describing a hard down type event. Killing states on gateway failure should work well in your scenario and effect a faster and smoother transition to Secondary WAN.