Failback from Primary WAN after failover to Secondary WAN
-
I have a 2 WAN connections configured with failover using gateway groups. This works very well to fail over, however the connections that get opened after the failover do not fail back automatically. I use a metered secondary WAN connection, and so not failing back is a problem as it creates meaningful expense for no reason.
I noticed that when failover occurs it is actually fairly smooth, with connections restarting themselves very quickly after a failure is detected. I wanted the failback to happen in a similar fashion, vs. taking 1-2 minutes for connections to come back live.
I did not use default gateway switching because I need to make sure only certain users failover. The rest of the users do not require a high availability connection.
After searching the forums and watching some hangouts, I came to understand that failback is not implemented as of 2.4.3-RELEASE-p1.
I got this working and am sharing the procedure for others.
- Modify the /etc/rc.kill_states script so that it no longer checks to see if a gateway is down.
1a. Copy /etc/rc.kill_states to /etc/my_kill_states, and then edit my_kill_states.
1b. Comment out the if statement in my_kill_states:
#if (isset($config['system']['gw_down_kill_states'])) {
and its closing brace line as well
#}
1c. Now /etc/my_kill_states will gracefully reset the state table when you pass it an IP address and an interface for the backup WAN. If the primary WAN is up and running, connections will automatically re-establish over the primary WAN. The entire my_kill_states file is duplicated at the end of this post.- I created a script check_backup_wan that can be run by cron every few minutes. This script checks to see if there is live traffic on the backup WAN, and if so, then checks if the primary WAN is functioning. If the primary WAN is functioning with traffic on the backup WAN, it will use my_kill_states to kill connections on the backup WAN gracefully. These connections will then re-establish over the primary WAN.
2a. My implementation of the check_backup_wan script is below.
2b. Final step is to run the check_backup_wan script automatically via cron. I think a 2 minute interval is best as it gives time to close out connections.
===group
/root/check_backup_wan#!/bin/sh # check_backup_wan script # mvneta0 is the 2nd WAN interface. mvneta2 is the primary WAN # 8.8.4.4 is set as the monitor IP on the primary WAN interface # The idea is to get the IP addresses of the primary and secondary WAN interfaces. # If the primary WAN IP address is not available, assume the primary WAN is still down. # Assuming the primary WAN is still up, check if there any live TCP connections on the backup WAN. # If live TCP connections are found on the backup WAN, check that the primary WAN is responding to # pings on the monitor IP address. If the primary WAN is responding to pings, then kill the states # on the backup WAN, and they will automatically reconnect over the primary WAN. check_wan_time=`date "+%Y-%m-%d %H:%M:%S"` check_wan=8.8.4.4 wan_ipaddress=`ifconfig mvneta2 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1` wan2_ipaddress=`ifconfig mvneta0 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1` echo 'primary, backup WAN IP address ' ${wan_ipaddress} '(primary) ' ${wan2_ipaddress} '(backup)' # check for valid primary WAN IP address. if [ -z "${wan_ipaddress}" ]; then echo ${check_wan_time} '... primary WAN is still down (no WAN IP)' | tee -a /var/log/check_backup_wan.log exit 0 fi # check for active connections on backup_wan pfctl -i mvneta0 -ss | grep 'tcp' wan2_liveconn=`pfctl -i mvneta0 -ss | grep 'tcp'` if [ -n "${wan2_liveconn}" ]; then # found a tcp connection on the backup wan interface ping -c 2 -t 2 -S ${wan_ipaddress} ${check_wan} > /dev/null 2>&1 wan1_resp=$? wan_resp=`expr ${wan1_resp}` echo 'primary WAN ping check (0 means passed)' ${wan1_resp} if [ ${wan_resp} -eq 0 ]; then echo ${check_wan_time} '... killing states and resetting connections on backup WAN' | tee -a /var/log/check_backup_wan.log /etc/my_kill_states mvneta0 ${wan2_ipaddress} else echo ${check_wan_time} '... primary WAN is still downi (pings failing)' | tee -a /var/log/check_backup_wan.log fi else echo ${check_wan_time} '... no active tcp connections found on backup WAN' | tee -a /var/log/check_backup_wan.log fi
/etc/my_kill_states:
#!/usr/local/bin/php-cgi -f <?php /* * my_kill_states * derived from: * part of pfSense (https://www.pfsense.org) * Copyright (c) 2004-2018 Rubicon Communications, LLC (Netgate) * All rights reserved. * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ /* parse the configuration and include all functions used below */ require_once("globals.inc"); require_once("config.inc"); require_once("interfaces.inc"); require_once("util.inc"); // Do not process while booting if (platform_booting()) { return; } /* Interface address to cleanup states */ $interface = str_replace("\n", "", $argv[1]); /* IP address to cleanup states */ $local_ip = str_replace("\n", "", $argv[2]); if (empty($interface) || !does_interface_exist($interface)) { log_error("rc.kill_states: Invalid interface '{$interface}'"); return; } if (!empty($local_ip)) { list($local_ip, $subnet_bits) = explode("/", $local_ip); if (empty($subnet_bits)) { $subnet_bits = "32"; } if (!is_ipaddr($local_ip)) { log_error("rc.kill_states: Invalid IP address '{$local_ip}'"); return; } } # for my_kill_states just assume the gateway is down and rebuild #if (isset($config['system']['gw_down_kill_states'])) { if (!empty($local_ip)) { log_error("rc.kill_states: Removing states for IP {$local_ip}/{$subnet_bits}"); $nat_states = exec_command("/sbin/pfctl -i {$interface} -ss | " . "/usr/bin/egrep '\-> +{$local_ip}:[0-9]+ +\->'"); $cleared_states = array(); foreach (explode("\n", $nat_states) as $nat_state) { if (preg_match_all('/([\d\.]+):[\d]+[\s->]+/i', $nat_state, $matches, PREG_SET_ORDER) != 3) { continue; } $src = $matches[0][1]; $dst = $matches[2][1]; if (empty($src) || empty($dst) || in_array("{$src},{$dst}", $cleared_states)) { continue; } $cleared_states[] = "{$src},{$dst}"; pfSense_kill_states($src, $dst); } pfSense_kill_states("0.0.0.0/0", "{$local_ip}/{$subnet_bits}"); pfSense_kill_states("{$local_ip}/{$subnet_bits}"); pfSense_kill_srcstates("{$local_ip}/{$subnet_bits}"); } log_error("rc.kill_states: Removing states for interface {$interface}"); mwexec("/sbin/pfctl -i {$interface} -Fs", true); #}
===
-
Thank you. I just got my failover WAN working. As it is also a metered connection I had noticed a lot of data use after failback and I was just looking into this.
Was easy enough to deploy on my system. Just had to change the interface names to match mine.
I triggered it manually to test and it seems to work fine. Will continue monitoring it.
I was surprised that this was not already supported.
thanks
david -
Nice scripts. Is this still necessary in newer versions?
Thanks
-
Still required as of 2.4.5. It is a needed feature in the core; to reset states on primary WAN recovery. This is particularly a problem for IPSec connections.
See:
- https://redmine.pfsense.org/issues/855
- https://redmine.pfsense.org/issues/6370
-
Thanks for posting this script.
Two hopefully simple questions:- Is there anything that needs to be done to rotate the log "/var/log/check_backup_wan.log" or will PF sense take of that for me.
- Will the survive pf sense upgrades or do the scripts need to be restored?
Thanks so much,
Craig
-
Nice scirpt, works well if you use 2. wan as failover only.
How whould this work with policy based routing?I use my low latency wan for gaming and my other wan for downloads only.
When the gaming wan is down there is a group (WAN1GW) with failover to wan2 which works fine. If the Gaming wan is back up it stays on wan2.
Is tehre a possibility to force it to my Gamingwan again?
-
So, this is still needed in 2.5.0. I am surprised. So few people use failover?
The /etc/rc.kill_states script was very slightly modified in 2.5.0 (added utf8_encode to IP variables). I recreated the /etc/my_kill_states and left the /root/check_backup_wan untouched (they weren't wiped out after the upgrade) and everything seems to work as in 2.4.5.
I hope the next build will have this or something better built-in.
-
Yes it is:
https://redmine.pfsense.org/issues/855 -
Can anyone confirm if this is still needed in 2.5.2? It seems so.
-
Failback works as expected for me on 21.05 on an SG-3100.
I'm not 100% sure when it started working but I beleive it was at some point in the past 12 months.
I assume this isn't a planned divergance between Plus vs CE?
-
@njacobs Failback worked for me in one test on 2.5.2. But this thread is not about whether it's working or not. It was working when this thread was created but didn't kill states properly. And it seems still doesn't.
-
@pfpv My understanding of the issue was that any connections which failover - or are established whilst in failover - don’t failback. This appears to work for me. Have I misunderstood the issue?
-
@njacobs If your secondary WAN is truly for backup only, like an LTE connection (expensive/limited bandwidth) then you want your IPSec tunnels to revert to the primary WAN when it is restored. However, current behavior as of 2.5.2, the established connections remain on the secondary WAN. This is a problem in most scenarios with LTE backup as it will chew through all the data limit and/or incur significant charges when it wasn’t necessary to do so. This thread and referenced ticket are requesting the capability to automatically kill open connection states when reverting back to primary WAN to achieve the desired behavior.
-
@mjh_ca Yes. This is my setup and my experience, however I haven’t tried specifically with an IPSec tunnel. Direct traffic fails back to the primary WAN as soon as it is available again.
-
I just hit the same states-issue in 2.5.2.
- Primary WAN (Tier 1) goes offline --> failover to secondary WAN2 (Tier 2, mobile plan).
- WAN connection comes back online --> failover returns to primary WAN as expected.
WAN2 is still online, ready as a backup connection, which seems to not trigger clearing of WAN2 active states. WAN2 states continue to consume data from data plan as described above which is not desired. A "Clear states when returning to higher Tier" would be great for solutions implemented with LTE and limited data plans.
-
@jimmyb said in Failback from Primary WAN after failover to Secondary WAN:
I just hit the same states-issue in 2.5.2.
- Primary WAN (Tier 1) goes offline --> failover to secondary WAN2 (Tier 2, mobile plan).
- WAN connection comes back online --> failover returns to primary WAN as expected.
WAN2 is still online, ready as a backup connection, which seems to not trigger clearing of WAN2 active states. WAN2 states continue to consume data from data plan as described above which is not desired. A "Clear states when returning to higher Tier" would be great for solutions implemented with LTE and limited data plans.
On 21.05 I unknowingly consumed LTE allowance on old connections even though WAN fios was only down for a moment.
-
@njacobs said in Failback from Primary WAN after failover to Secondary WAN:
@mjh_ca Yes. This is my setup and my experience, however I haven’t tried specifically with an IPSec tunnel. Direct traffic fails back to the primary WAN as soon as it is available again.
I stand corrected. On further investigation it appears I was actually seeing traffic from new connections.
-
I know this thread is quite old, but I wonder if anyone who suffer[s/ed] from this issue tried the state killing option in:
System -> Advanced -> Networking:
??
-
@manicmoose That option is responsible for killing states when a particular WAN interface's IP address changes (via DHCP, for example). The option is to kill ALL states when this happens, instead of just those on the old WAN IP. This has absolutely nothing to do with WAN failover, since in that case, the interface IP addresses don't change, just which one is being used for routing.
Another option which seems helpful is the "Flush all states when a gateway goes down" option on the System->Advanced->Miscellaneous tab. However, this is what enables the failover, but not the failback (i.e. this won't do anything when a down gateway comes up).
This issue remains, and I've been correcting it for a few years now using this script (more or less).
-
The scripts in the OP stopped working for me in 22.05. I found that
pfctl -i mvneta0 -ss
stopped outputting anything. I tried
pfctl -i mvneta0 -s states
and still nothing. I wonder what's up as it is a standard command in FreeBSD.
-
-
-