WG 0.1.5 / pfS+ 21.05.1 - 2 WAN→1 WAN failover, not "failing back"
-
I followed this video guide to get a site-to-site tunnel up between 2 Netgate appliances (SG3100 / SG-8860). After much hair pulling, the tunnel is up and it's fast. That's the good part.
- Site A (branch) has 2 WANs, Site B (DC) has 1 WAN.
- Static IP on all 3 interfaces involved.
- Since Site A has a gateway group configured (Failover, not load balanced) I set that as a Dynamic Endpoint on the Site B end.
- I tried both with and without a keep-alive set (30s).
Everything "works" — but when I yank the plug on WAN1 at the branch site, and then later plug it back in, it never fails back. The tunnel remains locked on the WAN2 interface. Manually bouncing the WG service at the branch site gets it flowing over WAN1 again.
I naively thought that since WG was sort of stateless / UDP based that it would follow the system routing table, but it seems not to do that. I thought about writing a hook script to fire on
/etc/rc.gateway_alarm
etc but I was hoping with this new shiny protocol we wouldn't have to monkey patch anymore.Anyone got any ideas here?
edit: Wanted to add that I have noticed that even with KeepAlive disabled (field blank), the console shows
persistent keepalive: every 30 seconds
when I query wg status... seems odd.
Also, how can we "restart" or "resync" the wg service/daemon/whatever it is? I saw that it's not a real service, so not even sure how to go about bouncing it from the console so it picks up the new default gateway...
-
I came up with what I'm calling the Poor Man's Failback script:
v2.0: https://github.com/luckman212/wgfix
For anyone who wants to give this a try, follow the instructions in the README. Your wg tunnel should now "fail back" when your firewall emerges from a failover state.
Note: if you're using this with a PPPoE WAN, you might need to add a trigger using a custom
devd
config, see here.#!/bin/sh # put the line below at the end of /etc/rc.gatway_alarm, just above the final `exit`: # /root/wgfix.sh "${GW}" "${alarm_flag}" # set the 2 variables below to match the interface name and public key # of the wg tunnel that you want to "fail back" when your default gateway changes # WG_PEER_PUBLIC_KEY should be the public key from the FAR side (i.e the one from the PEERS tab) WG_IFNAME='tun_wg0' WG_PEER_PUBLIC_KEY='Vh+y3uVDnbmJtL7O4tgjaInRmerV3dWq/dM8LeSqbFY=' acquire_lock() { if /bin/pgrep -F "$LOCKFILE" >/dev/null 2>&1; then /usr/bin/logger -t wgfix "lockfile present, aborting" exit 1 fi /usr/bin/logger -t wgfix "acquiring lockfile" echo $$ >"$LOCKFILE" } die() { /usr/bin/logger -t wgfix "done, removing lockfile" [ -f "$LOCKFILE" ] && rm "$LOCKFILE" exit $1 } LOCKFILE="/tmp/${0##*/}.lock" /usr/bin/logger -t wgfix "$0 called, args: $1 $2" # the point of this script is "fail back" so we only care about "WAN up" events if [ "$2" != "0" ]; then /usr/bin/logger -t wgfix "ignoring WAN down event" die 0 fi acquire_lock /usr/bin/logger -t wgfix "WAN UP: $1" /usr/local/bin/wg showconf $WG_IFNAME | /usr/bin/awk -v PK="$WG_PEER_PUBLIC_KEY" ' BEGIN {FS=" = "} ($1 == "PublicKey" && $2 == PK) {f=1} /^Endpoint/ && f {e=$2} /^$/ {f=""} END {if(e) {print e}}' >/tmp/${WG_IFNAME}_endpoint IFS=: read -r IP PORT </tmp/${WG_IFNAME}_endpoint if [ -n "$IP" ] && [ -n "$PORT" ]; then /usr/bin/logger -t wgfix "WG endpoint: $IP:$PORT" /usr/bin/logger -t wgfix "pausing 10s to allow gateway change to occur" /bin/sleep 10 DEF_GW=$(/sbin/route -n get "$IP" | /usr/bin/awk '/interface:/ {print $2; exit;}') /usr/bin/logger -t wgfix "Default gateway iface: $DEF_GW" BAD_STATES=$(/sbin/pfctl -vvss | /usr/bin/grep "$IP:$PORT" | /usr/bin/grep -v "$DEF_GW" | wc -l) if [ "$BAD_STATES" -gt 0 ]; then /usr/bin/logger -t wgfix "found $BAD_STATES bad states; bouncing wg service" /usr/local/bin/php_wg -f /usr/local/pkg/wireguard/includes/wg_service.inc stop /sbin/pfctl -vvss | /usr/bin/grep -A2 "$IP:$PORT" | /usr/bin/awk 'BEGIN {OFS="/"} /id:/ {print $2,$4}' | while read -r STATE; do /usr/bin/logger -t wgfix "killing state $STATE" /sbin/pfctl -k id -k "$STATE" done /usr/local/bin/php_wg -f /usr/local/pkg/wireguard/includes/wg_service.inc start else /usr/bin/logger -t wgfix "no bad states found" fi else /usr/bin/logger -t wgfix "WG endpoint could not be determined" fi die 0
It works but it'd be better to have this handled by the wg service itself...
-
-
-
@luckman212 Is this still working for you? Unfortunately, it isnt for me. Something has changed in wireguard and it not using the specified route any more.
-
@trumee Haven't explicitly tested it with 22.05, but I have it running successfully on 22.01 with the most recent WG package. I plan to update a couple of them to 22.05 this weekend so I can definitely post back with results. In the meantime, what's in your logs? The script logs a bit of detail, if you filter on
wgfix
. -
@luckman212 I am still on pfsense 2.6.0. Thanks for reminding me about wgfix. I did a cold boot and issued
/root/wgfix.sh WAN3_PPPOE 0
and WG used the WAN i wanted.Here is a what the log shows,
#cat /var/log/system.log | grep wgfix Jul 2 02:00:55 pfSense wgfix[4608]: /root/wgfix.sh called, args: WAN3_PPPOE 0 Jul 2 02:00:55 pfSense wgfix[4994]: acquiring lockfile Jul 2 02:00:55 pfSense wgfix[5279]: WAN UP: WAN3_PPPOE Jul 2 02:00:55 pfSense wgfix[5819]: WG endpoint could not be determined Jul 2 02:00:55 pfSense wgfix[6140]: done, removing lockfile Jul 2 02:22:20 pfSense wgfix[44381]: /root/wgfix.sh called, args: WAN3_PPPOE 0 Jul 2 02:22:20 pfSense wgfix[44775]: acquiring lockfile Jul 2 02:22:20 pfSense wgfix[44813]: WAN UP: WAN3_PPPOE Jul 2 02:22:20 pfSense wgfix[44974]: WG endpoint: redacted:51823 Jul 2 02:22:20 pfSense wgfix[44980]: pausing 20s to allow gateway change to occur Jul 2 02:22:40 pfSense wgfix[47710]: Default gateway iface: pppoe2 Jul 2 02:22:40 pfSense wgfix[48920]: found 1 bad states; bouncing wg service Jul 2 02:22:47 pfSense wgfix[75223]: killing state 615abf6200000002/801cbc2f Jul 2 02:22:57 pfSense wgfix[69630]: done, removing lockfile
Notice after a cold reboot the message is
WG endpoint could not be determined
, however once i issue the command manually the WAN is changed to WAN3_PPPOE. -
@trumee Ok, I don't have any PPPoE systems to test with, so I'm guessing this is related to that.
Immediately after a fresh boot, what is the output of
wg showconf tun_wg0
(or whatever your wg tunnel interface is from theWG_IFNAME=
line in the script) -
@luckman212 said in WG 0.1.5 / pfS+ 21.05.1 - 2 WAN→1 WAN failover, not "failing back":
wg showconf tun_wg0
It is as follows,
#root: wg showconf tun_wg0 [Interface] ListenPort = 51820 PrivateKey = mykeyredacted [Peer] PublicKey = mykeyredacted AllowedIPs = 0.0.0.0/0 Endpoint = remotepublicip:51823 PersistentKeepalive = 25
-
@trumee That looks fine. I read some of the older comments and I saw that you had to use devd to trigger on the WANUP event for PPPoE. Is that custom config still in effect?
-
@luckman212 Yes, the devd trigger is still in place. I am on pfsense+ (22.05) now.
-
@trumee I'm guessing that this is a timing issue; maybe the PPPoE connection comes up too quickly and the lockfile from the previous run is still in place, etc. Can you try this modified version (removes the mutex check) and see if it behaves differently?
gist:
wgfix.sh
(no locks) -
@luckman212 Unfortunately, i am seeing a bigger issue right now for this WAN. I will back to this once that is resolved.
-
-
-
Wireguard aside, does failback work for just the WANs at Site A? Once I failover to my LTE, and WAN comes back up, my states on the LTE interface remain.
-
@ddbnj I created this to operate specifically on WireGuard states. If you need generic "fallback" state killing, you can try enabling the Reset all states if WAN IP Address changes option at the bottom of System → Advanced → Networking.
-
Thanks.
Evidently resetting all states works sporadically at best.
There is a long history of pfsense users asking for failback on interfaces. Scripts were written but no longer seem to be working.
https://forum.netgate.com/topic/135614/failback-from-primary-wan-after-failover-to-secondary-wan/19
I was hoping to repurpose your script.
-
@ddbnj Feel free to fork and modify it- I had a "StateKiller" package that I was working on to do more complex rule-based state killing / failback but I sadly never finished it. Not sure how much interest there is for that now that they added some more general purpose state killing options in the recent builds.