Some connections survive killing all states on Tier 1 gateway recovery
-
Hi.
I have a pfSense 2.5.0 x86 machine with 2 WAN connections.
I set up a gateway group WAN_GWs with failover:
WAN1: Tier 1 (~1gbit PPPoE connection over a GPON modem)
WAN2: Tier 2 (100mbit Ethernet connection).NAT rules:
LAN to WAN1
LAN to WAN2Firewall rule:
PASS LAN to any, gateway: WAN_GWs groupBoth "Flush all states when a gateway goes down" and "Reset all states if WAN IP Address changes" checkboxes in the System -> Advanced menu are checked.
I my LAN I have a machine that has Wireguard connection.
When both WAN links are up, everything works fine.
If I disconnect the fiber link on WAN1 modem (the Ethernet interface on pfSense stays up), the corresponding gateway changes status to "Offline", all the traffic from LAN, including the aforementioned Wireguard connection now goes over the WAN2, like it should.
Then, I restore the WAN1 link, the WAN1 gateway status becomes "Online" again, all new connections are routed over the WAN1, but the existing Wireguard connection is not, it remains on the slow WAN2 gateway (I see it both by ping values and states dump in Web GUI).
In the System -> General log I see this:
Mar 31 22:25:00 php-fpm 338 /rc.newwanip: Gateway, switch to: WAN1 Mar 31 22:25:00 php-fpm 338 /rc.newwanip: Default gateway setting Interface WAN1 Gateway as default. Mar 31 22:25:00 php-fpm 338 /rc.newwanip: IP Address has changed, killing all states (ip_change_kill_states is set). Mar 31 22:25:00 check_reload_status 376 Reloading filter
and I can confirm that states are flushed indeed - for example, my SSH connection to pfSense breaks at this moment, yet the Wireguard connection somehow survives and its traffic is still going over WAN2.
If after WAN1 recovery I kill the states manually running the "pfctl -F states", the Wireguard connection finally switches to WAN1.
I took a look at the /etc/rc.newwanip, and it actually calls the same "pfctl -F states" inside (line 236):
/* If the IP address changed, kill old states after rules and routing have been updated */ if ($curwanip != $oldip) { if (isset($config['system']['ip_change_kill_states'])) { log_error("IP Address has changed, killing all states (ip_change_kill_states is set)."); pfSense_kill_states(utf8_encode($oldip)); filter_flush_state_table(); // <-- KILLING ALL STATES HERE } else { log_error("IP Address has changed, killing states on former IP Address $oldip."); pfSense_kill_states(utf8_encode($oldip)); } }
/etc/inc/filter.inc, line 1344:
function filter_flush_state_table() { return mwexec("/sbin/pfctl -F state"); }
So, basically it calls the very same command I execute manually (except I use "-F states" as the man pfctl(8) says, but since there is no ambiguity it should work either way), yet it doesn't produce the same effect.
Maybe there is some racing condition somewhere or am I just missing something here?
-
OK, I implemented a workaround for this problem.
I wrote this little script:
<?php require_once("interfaces.inc"); $cached_def_gw_if_file = '/tmp/cached_def_gw_if'; $current_gw_ip = route_get_default('inet'); $current_gw_if = get_gateway_interface($current_gw_ip); $old_gw_if = file_get_contents($cached_def_gw_if_file); if ($old_gw_if === false) { file_put_contents($cached_def_gw_if_file, $current_gw_if); exit; } // Just in case the file was edited manually for test purposes and contains some whitespace $old_gw_if = trim($old_gw_if); if ($current_gw_if != $old_gw_if) { log_error("Default gateway interface changed from $old_gw_if to $current_gw_if, killing old states..."); mwexec("/sbin/pfctl -F states"); file_put_contents($cached_def_gw_if_file, $current_gw_if); } function get_gateway_interface($gateway_ip) { $interfaces = get_interfaces_with_gateway(); foreach ($interfaces as $interface) { $interface_gw = get_interface_gateway($interface); if ($gateway_ip == $interface_gw) { $real_interface = get_real_interface($interface); return $real_interface; } } } ?>
and created a cron task which executes this script every minute.
I am not sure how reliable this "solution" will be, but a dozen of gateway switchover has shown that everything works as expected, the Wireguard connection is routed over Tier 1 gateway after it recovers.
During my little research, I also found this post, but unfortunately samtoopid's script didn't work for me - it only kills states on WAN interface, and in my case it's not enough, only flushing all states makes Wireguard connection switch to new Tier 1 gateway.
-
Does the script work on the latest version? It is very annoying that all VPNs remain on the backup line after the restoration of the main wan.