Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    WG 0.1.5 / pfS+ 21.05.1 - 2 WAN→1 WAN failover, not "failing back"

    Scheduled Pinned Locked Moved WireGuard
    16 Posts 3 Posters 3.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • luckman212L
      luckman212 LAYER 8
      last edited by luckman212

      I followed this video guide to get a site-to-site tunnel up between 2 Netgate appliances (SG3100 / SG-8860). After much hair pulling, the tunnel is up and it's fast. That's the good part.

      • Site A (branch) has 2 WANs, Site B (DC) has 1 WAN.
      • Static IP on all 3 interfaces involved.
      • Since Site A has a gateway group configured (Failover, not load balanced) I set that as a Dynamic Endpoint on the Site B end.
      • I tried both with and without a keep-alive set (30s).

      Everything "works" — but when I yank the plug on WAN1 at the branch site, and then later plug it back in, it never fails back. The tunnel remains locked on the WAN2 interface. Manually bouncing the WG service at the branch site gets it flowing over WAN1 again.

      I naively thought that since WG was sort of stateless / UDP based that it would follow the system routing table, but it seems not to do that. I thought about writing a hook script to fire on /etc/rc.gateway_alarm etc but I was hoping with this new shiny protocol we wouldn't have to monkey patch anymore.

      Anyone got any ideas here?

      edit: Wanted to add that I have noticed that even with KeepAlive disabled (field blank), the console shows persistent keepalive: every 30 seconds when I query wg status... seems odd.
      f778b3bb-9e34-450e-bbd0-fd45c43ab564-image.png

      Also, how can we "restart" or "resync" the wg service/daemon/whatever it is? I saw that it's not a real service, so not even sure how to go about bouncing it from the console so it picks up the new default gateway...

      luckman212L D 2 Replies Last reply Reply Quote 0
      • luckman212L
        luckman212 LAYER 8 @luckman212
        last edited by luckman212

        I came up with what I'm calling the Poor Man's Failback script:

        v2.0: https://github.com/luckman212/wgfix

        For anyone who wants to give this a try, follow the instructions in the README. Your wg tunnel should now "fail back" when your firewall emerges from a failover state.

        Note: if you're using this with a PPPoE WAN, you might need to add a trigger using a custom devd config, see here.

        #!/bin/sh
        
        # put the line below at the end of /etc/rc.gatway_alarm, just above the final `exit`:
        # /root/wgfix.sh "${GW}" "${alarm_flag}"
        
        # set the 2 variables below to match the interface name and public key
        # of the wg tunnel that you want to "fail back" when your default gateway changes
        # WG_PEER_PUBLIC_KEY should be the public key from the FAR side (i.e the one from the PEERS tab)
        WG_IFNAME='tun_wg0'
        WG_PEER_PUBLIC_KEY='Vh+y3uVDnbmJtL7O4tgjaInRmerV3dWq/dM8LeSqbFY='
        
        acquire_lock() {
          if /bin/pgrep -F "$LOCKFILE" >/dev/null 2>&1; then
            /usr/bin/logger -t wgfix "lockfile present, aborting"
            exit 1
          fi
          /usr/bin/logger -t wgfix "acquiring lockfile"
          echo $$ >"$LOCKFILE"
        }
        
        die() {
          /usr/bin/logger -t wgfix "done, removing lockfile"
          [ -f "$LOCKFILE" ] && rm "$LOCKFILE"
          exit $1
        }
        
        LOCKFILE="/tmp/${0##*/}.lock"
        /usr/bin/logger -t wgfix "$0 called, args: $1 $2"
        # the point of this script is "fail back" so we only care about "WAN up" events
        if [ "$2" != "0" ]; then
          /usr/bin/logger -t wgfix "ignoring WAN down event"
          die 0
        fi
        acquire_lock
        /usr/bin/logger -t wgfix "WAN UP: $1"
        
        /usr/local/bin/wg showconf $WG_IFNAME |
        /usr/bin/awk -v PK="$WG_PEER_PUBLIC_KEY" '
          BEGIN {FS=" = "}
          ($1 == "PublicKey" && $2 == PK) {f=1}
          /^Endpoint/ && f {e=$2}
          /^$/ {f=""}
          END {if(e) {print e}}' >/tmp/${WG_IFNAME}_endpoint
        
        IFS=: read -r IP PORT </tmp/${WG_IFNAME}_endpoint
        if [ -n "$IP" ] && [ -n "$PORT" ]; then
          /usr/bin/logger -t wgfix "WG endpoint: $IP:$PORT"
          /usr/bin/logger -t wgfix "pausing 10s to allow gateway change to occur"
          /bin/sleep 10
          DEF_GW=$(/sbin/route -n get "$IP" | /usr/bin/awk '/interface:/ {print $2; exit;}')
          /usr/bin/logger -t wgfix "Default gateway iface: $DEF_GW"
          BAD_STATES=$(/sbin/pfctl -vvss | /usr/bin/grep "$IP:$PORT" | /usr/bin/grep -v "$DEF_GW" | wc -l)
          if [ "$BAD_STATES" -gt 0 ]; then
            /usr/bin/logger -t wgfix "found $BAD_STATES bad states; bouncing wg service"
            /usr/local/bin/php_wg -f /usr/local/pkg/wireguard/includes/wg_service.inc stop
            /sbin/pfctl -vvss |
            /usr/bin/grep -A2 "$IP:$PORT" |
            /usr/bin/awk 'BEGIN {OFS="/"} /id:/ {print $2,$4}' |
            while read -r STATE; do
              /usr/bin/logger -t wgfix "killing state $STATE"
              /sbin/pfctl -k id -k "$STATE"
            done
            /usr/local/bin/php_wg -f /usr/local/pkg/wireguard/includes/wg_service.inc start
          else
            /usr/bin/logger -t wgfix "no bad states found"
          fi
        else
          /usr/bin/logger -t wgfix "WG endpoint could not be determined"
        fi
        
        die 0
        

        It works but it'd be better to have this handled by the wg service itself...

        T 1 Reply Last reply Reply Quote 2
        • T trumee referenced this topic on
        • T trumee referenced this topic on
        • T
          trumee @luckman212
          last edited by

          @luckman212 Is this still working for you? Unfortunately, it isnt for me. Something has changed in wireguard and it not using the specified route any more.

          luckman212L 1 Reply Last reply Reply Quote 0
          • luckman212L
            luckman212 LAYER 8 @trumee
            last edited by

            @trumee Haven't explicitly tested it with 22.05, but I have it running successfully on 22.01 with the most recent WG package. I plan to update a couple of them to 22.05 this weekend so I can definitely post back with results. In the meantime, what's in your logs? The script logs a bit of detail, if you filter on wgfix.

            T 1 Reply Last reply Reply Quote 1
            • T
              trumee @luckman212
              last edited by

              @luckman212 I am still on pfsense 2.6.0. Thanks for reminding me about wgfix. I did a cold boot and issued /root/wgfix.sh WAN3_PPPOE 0 and WG used the WAN i wanted.

              Here is a what the log shows,

              #cat /var/log/system.log | grep wgfix
              Jul  2 02:00:55 pfSense wgfix[4608]: /root/wgfix.sh called, args: WAN3_PPPOE 0
              Jul  2 02:00:55 pfSense wgfix[4994]: acquiring lockfile
              Jul  2 02:00:55 pfSense wgfix[5279]: WAN UP: WAN3_PPPOE
              Jul  2 02:00:55 pfSense wgfix[5819]: WG endpoint could not be determined
              Jul  2 02:00:55 pfSense wgfix[6140]: done, removing lockfile
              Jul  2 02:22:20 pfSense wgfix[44381]: /root/wgfix.sh called, args: WAN3_PPPOE 0
              Jul  2 02:22:20 pfSense wgfix[44775]: acquiring lockfile
              Jul  2 02:22:20 pfSense wgfix[44813]: WAN UP: WAN3_PPPOE
              Jul  2 02:22:20 pfSense wgfix[44974]: WG endpoint: redacted:51823
              Jul  2 02:22:20 pfSense wgfix[44980]: pausing 20s to allow gateway change to occur
              Jul  2 02:22:40 pfSense wgfix[47710]: Default gateway iface: pppoe2
              Jul  2 02:22:40 pfSense wgfix[48920]: found        1 bad states; bouncing wg service
              Jul  2 02:22:47 pfSense wgfix[75223]: killing state 615abf6200000002/801cbc2f
              Jul  2 02:22:57 pfSense wgfix[69630]: done, removing lockfile
              

              Notice after a cold reboot the message is WG endpoint could not be determined, however once i issue the command manually the WAN is changed to WAN3_PPPOE.

              luckman212L 1 Reply Last reply Reply Quote 0
              • luckman212L
                luckman212 LAYER 8 @trumee
                last edited by

                @trumee Ok, I don't have any PPPoE systems to test with, so I'm guessing this is related to that.

                Immediately after a fresh boot, what is the output of wg showconf tun_wg0 (or whatever your wg tunnel interface is from the WG_IFNAME= line in the script)

                T 1 Reply Last reply Reply Quote 0
                • T
                  trumee @luckman212
                  last edited by

                  @luckman212 said in WG 0.1.5 / pfS+ 21.05.1 - 2 WAN→1 WAN failover, not "failing back":

                  wg showconf tun_wg0

                  It is as follows,

                  #root: wg showconf tun_wg0
                  [Interface]
                  ListenPort = 51820
                  PrivateKey = mykeyredacted
                  
                  [Peer]
                  PublicKey = mykeyredacted
                  AllowedIPs = 0.0.0.0/0
                  Endpoint = remotepublicip:51823
                  PersistentKeepalive = 25
                  
                  luckman212L 1 Reply Last reply Reply Quote 0
                  • luckman212L
                    luckman212 LAYER 8 @trumee
                    last edited by

                    @trumee That looks fine. I read some of the older comments and I saw that you had to use devd to trigger on the WANUP event for PPPoE. Is that custom config still in effect?

                    T 1 Reply Last reply Reply Quote 0
                    • T
                      trumee @luckman212
                      last edited by

                      @luckman212 Yes, the devd trigger is still in place. I am on pfsense+ (22.05) now.

                      luckman212L 1 Reply Last reply Reply Quote 0
                      • luckman212L
                        luckman212 LAYER 8 @trumee
                        last edited by

                        @trumee I'm guessing that this is a timing issue; maybe the PPPoE connection comes up too quickly and the lockfile from the previous run is still in place, etc. Can you try this modified version (removes the mutex check) and see if it behaves differently?

                        gist: wgfix.sh (no locks)

                        T 1 Reply Last reply Reply Quote 0
                        • T
                          trumee @luckman212
                          last edited by

                          @luckman212 Unfortunately, i am seeing a bigger issue right now for this WAN. I will back to this once that is resolved.

                          luckman212L 1 Reply Last reply Reply Quote 0
                          • T trumee referenced this topic on
                          • luckman212L
                            luckman212 LAYER 8 @trumee
                            last edited by

                            2.0: https://github.com/luckman212/wgfix

                            1 Reply Last reply Reply Quote 0
                            • D
                              ddbnj @luckman212
                              last edited by

                              @luckman212

                              Wireguard aside, does failback work for just the WANs at Site A? Once I failover to my LTE, and WAN comes back up, my states on the LTE interface remain.

                              luckman212L 1 Reply Last reply Reply Quote 0
                              • luckman212L
                                luckman212 LAYER 8 @ddbnj
                                last edited by

                                @ddbnj I created this to operate specifically on WireGuard states. If you need generic "fallback" state killing, you can try enabling the Reset all states if WAN IP Address changes option at the bottom of System → Advanced → Networking.

                                D 1 Reply Last reply Reply Quote 1
                                • D
                                  ddbnj @luckman212
                                  last edited by ddbnj

                                  @luckman212

                                  Thanks.

                                  Evidently resetting all states works sporadically at best.

                                  There is a long history of pfsense users asking for failback on interfaces. Scripts were written but no longer seem to be working.

                                  https://forum.netgate.com/topic/135614/failback-from-primary-wan-after-failover-to-secondary-wan/19

                                  I was hoping to repurpose your script.

                                  luckman212L 1 Reply Last reply Reply Quote 0
                                  • luckman212L
                                    luckman212 LAYER 8 @ddbnj
                                    last edited by

                                    @ddbnj Feel free to fork and modify it- I had a "StateKiller" package that I was working on to do more complex rule-based state killing / failback but I sadly never finished it. Not sure how much interest there is for that now that they added some more general purpose state killing options in the recent builds.

                                    1 Reply Last reply Reply Quote 0
                                    • First post
                                      Last post
                                    Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.