Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Multi-WAN gateway failover not switching back to tier 1 gw after back online

    Scheduled Pinned Locked Moved Routing and Multi WAN
    119 Posts 35 Posters 53.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • DerelictD
      Derelict LAYER 8 Netgate
      last edited by

      @MrD:

      @Derelict:

      I did not see one for this on redmine.pfsense.org.

      https://redmine.pfsense.org/issues/855

      So the final answer is FAILOVER DO NOT GO BACK TO INITIAL STATE
      This is suprising but knowing this, I stop loosing time trying different config options…

      There. Feature #855. My redmine searching could obviously use a tuneup.

      Chattanooga, Tennessee, USA
      A comprehensive network diagram is worth 10,000 words and 15 conference calls.
      DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
      Do Not Chat For Help! NO_WAN_EGRESS(TM)

      1 Reply Last reply Reply Quote 0
      • J
        jmonline
        last edited by

        @Derelict:

        Please demonstrate with evidence.

        Ok so in very basic terms since I already have quite a lot of information on this post here https://forum.pfsense.org/index.php?topic=86851.msg632594#msg632594

        • The connection has failed over to the backup WAN when the primary WAN has gone down. (Failover has worked as expected)

        • The primary WAN has come back up (Status > Gateways confirms this is up/online).

        • The states (VoIP sessions for phones) are still showing in the state table 12hrs later going over the backup WAN.

        • No new or refreshed sessions from the phones go over the primary connection.

        • Current state table (filtered by the phone with the IP of 10.10.30.55) looks like this:

          WAN_EFM udp 135.196.xxx.xxx:41809 (10.10.30.55:49679) -> 185.83.xxx.xxx:5060 MULTIPLE:MULTIPLE 201.589 K / 102.513 K 125.60 MiB / 39.52 MiB
          30VOICELAN udp 185.83.xxx.xxx:5060 <- 10.10.30.55:49679 MULTIPLE:MULTIPLE 99.293 K / 99.502 K 61.87 MiB / 38.35 MiB

          To clarify:
          WAN_EFM - is the backup WAN connection
          30VOICELAN - is the LAN network for the phones
          135.196.xxx.xxx - is the IP of my backup WAN connection
          185.83.xxx.xxx - is the IP of my externally hosted VoIP platform

        • I have then "Reset the firewall state table"

        • At this point SOMETIMES the states will clear and obey the correct Gateway fail-over rule and be sent back over the primary WAN.
          SOMETIMES they will stay where they are (on the backup WAN)

        I can understand the argument that it is a feature request to have the states clear on the re-establishment of the primary wan connection.
        However, why have I seen the following…

        • Primary connection has been down for a length of time and has since come back online.

        • A brand new device which has never connected to the network (so therefore has no open states) is connected.

        • This new device states are sent over the backup WAN - even though the primary wan is available

        • "Reset the firewall state table" and the new device has states over the primary wan (as it should have done when it first connected to the network)

        I also ran a test of this in a virtual environment and simulated the primary WAN connection dropping and re-connecting.
        I was using a Linux machine as a test client and just running a PING and TRACEROUTE to use as example states on the firewall (eliminating the VoIP aspect).
        Sometimes, when you bought the Primary WAN connection back online, a new TRACEROUTE to a different IP address could go over the primary WAN, and other times it would remain over the backup WAN.
        I have not been able to prove what causes this - it appears random.

        In my mind, if the primary wan connection is reconnected and online, then any NEW state that hits the firewall should always follow the gateway group rule and go over the Tier1 connection.

        Why does running a pfctl and targeting the relevant hosts/network not force clearing of the states just for the VoIP devices (without clearing the whole state table)?

        Another simple way of putting it….........

        If your primary connection goes down for an hour and then comes back online. At what point should your traffic start to reuse that connection again. What if your "backup" connection has a very data usage charge?

        Bit of history for you…..

        I used to use Draytek equipment for all my client sites, on their old 2830 series of routers, they had the WAN failover options, but the same applied… if the primary went down everything would failover to the backup and then never fail back again when the primary connection returned.

        On their newer 2860 series of routers, they added one simple check box labelled "Failback" and it moved your sessions/states back to the correct primary connection when it was available again.

        However on the Draytek I never had the issue where a NEW state/session would still go over the backup WAN when the primary was available. If it was a new session it always followed the rules correctly.

        I hope that makes sense to some of you :)

        1 Reply Last reply Reply Quote 0
        • DerelictD
          Derelict LAYER 8 Netgate
          last edited by

          You still didn't show the Tier 1 being back online and new states still being created on Tier 2. I think if you really take a look at this you will find that is not happening.

          And nothing can "move a state" back to the original connection. All you can do is kill the old state and let a new one be established on a reconnection.

          Chattanooga, Tennessee, USA
          A comprehensive network diagram is worth 10,000 words and 15 conference calls.
          DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
          Do Not Chat For Help! NO_WAN_EGRESS(TM)

          1 Reply Last reply Reply Quote 0
          • J
            jmonline
            last edited by

            Ok lets simplify it even more…

            Take the traceroute facility in Diagnostics > Traceroute

            • My primary wan is back online.

            • I enter a hostname to trace (Google - 8.8.8.8)

            • I pick the Source Address as 30VOICELAN

            • I get the following result:
                -1  135.196.xxx.xxx  7.671 ms  6.869 ms  7.008 ms
                -2  135.196.xxx.xxx  7.016 ms  7.195 ms  7.164 ms
                -3  5.57.80.136  7.218 ms  7.199 ms  11.125 ms
                -4  216.239.54.243  7.922 ms
                -5  216.239.58.95  9.010 ms
                -6  8.8.8.8  8.139 ms  8.010 ms  8.626 ms

            • The first line on the traceroute with the IP starting 135.196 is my backup internet connection. Not my primary.

            How is that possible?

            The firewall rule on the 30VOICELAN has the Gateway set as the Gateway Group named "DSLFirst".
            The Gateway group "DSLFirst" has the (Primary) DSL WAN connection as Tier1 and the (Backup) EFM WAN connection as Tier2.
            Status > Gateways shows both gateways online.

            1 Reply Last reply Reply Quote 0
            • DerelictD
              Derelict LAYER 8 Netgate
              last edited by

              Show me the states, bro. pfctl -vss

              Chattanooga, Tennessee, USA
              A comprehensive network diagram is worth 10,000 words and 15 conference calls.
              DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
              Do Not Chat For Help! NO_WAN_EGRESS(TM)

              1 Reply Last reply Reply Quote 0
              • J
                jmonline
                last edited by

                @Derelict:

                Show me the states, bro. pfctl -vss

                Ok so perfect time for a test  :) Last night looks like BT did their usual maintenance on the DSL network around 1am so the ADSL line was down for 5mins. This morning I have the following in the states table for the phone on IP 10.10.30.27.

                30VOICELAN tcp 185.83.xxx.xxx:5060 <- 10.10.30.27:55778 ESTABLISHED:ESTABLISHED 8.933 K / 10.417 K 3.13 MiB / 3.43 MiB
                WAN_EFM tcp 185.3.xxx.xxx:40781 (10.10.30.27:55778) -> 185.83.xxx.xxx:5060 ESTABLISHED:ESTABLISHED 8.933 K / 10.417 K 3.13 MiB / 3.43 MiB

                To clarify:
                185.83.xxx.xxx is the external VoIP pbx.
                185.3.xxx.xxx is the IP of the WAN_EFM (backup) connection.
                30VOICELAN is my internal network with a subnet of 10.10.30.0/24

                pfctl -vss shows the following:

                igb1_vlan30 tcp 185.83.xxx.xxx:5060 <- 10.10.30.27:55778      ESTABLISHED:ESTABLISHED
                  [1594456643 + 42272] wscale 8  [1007765254 + 183296] wscale 5
                  age 05:58:04, expires in 119:59:52, 8954:10441 pkts, 3290011:3604569 bytes, rule 119

                igb2 tcp 185.3.xxx.xxx:40781 (10.10.30.27:55778) -> 185.83.xxx.xxx:5060      ESTABLISHED:ESTABLISHED
                  [1007765254 + 183296] wscale 5  [1594456643 + 42272] wscale 8
                  age 05:58:04, expires in 119:59:52, 8954:10441 pkts, 3290011:3604569 bytes, rule 96

                Our of interest, what should be the correct pfctl command to run to force killing these states (so all states on the WAN_EFM connection from the subnet 10.10.30.0/24)?

                If I can get a command to successfully kill these states when they get stuck here, I am happy for that as a work around until someone can work out how to automate it. I don't want to be Resetting the whole state table every time since that kills sessions which should be legitimately open.

                Thanks

                1 Reply Last reply Reply Quote 0
                • DerelictD
                  Derelict LAYER 8 Netgate
                  last edited by

                  You probably want to kill all connections to the PBX. That would be:

                  pfctl -k 0.0.0.0/0 -k 185.83.xxx.xxx

                  That will kill everything even phones that are connected out the Tier 1.

                  You can try just killing one side of the connection that is tied to WAN_EFM with:

                  pfctl -i igb2 -k 0.0.0.0/0 -k 185.83.xxx.xxx

                  If, when the phones reconnect, they use the Tier1 connection, great. In my testing they continued to use the other connection so it doesn't look like you can do that.

                  Chattanooga, Tennessee, USA
                  A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                  DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                  Do Not Chat For Help! NO_WAN_EGRESS(TM)

                  1 Reply Last reply Reply Quote 0
                  • J
                    jmonline
                    last edited by

                    Ok so the following command cleared the sessions stuck on the failover WAN.

                    pfctl -i igb2 -k 0.0.0.0/0 -k 185.83.xxx.xxx

                    This is a good step forward since I can now manually force the sessions back when I know they haven't moved on their own.

                    I presume I may be able to scheduled this via some sort of script to run a specified period of time after the primary connection comes back online…...?

                    Thanks for help so far Derelict :)

                    1 Reply Last reply Reply Quote 0
                    • D
                      devmaybe
                      last edited by

                      Hello,

                      I know this is an old thread but I have the same problem and now I am able to reliably reproduce the behavior in a test environment.

                      If the "primary" WAN is a PPPoE connection and the secondary WAN is a "standard" static or DHCP assigned IP address connection when the primary goes down failover to the secondary work as expected but when the primary comes back up no traffic will flow through it.

                      In such cases on my production systems I usually edit my default gateway entry in System->Routing.
                      I uncheck the "Default Gateway" mark, re-check it and then save and apply.
                      Traffic starts flowing again through the PPPoE connection.

                      The same always works in my virtual machines test environment too.

                      I hope this can help in tracking down the source of the problem or at least in finding some solution.

                      Thanks

                      1 Reply Last reply Reply Quote 0
                      • S
                        sandrino
                        last edited by

                        Hi all,

                        same problem WAN1 tier 1 (cable - default GW - 2Mb/2Mb), WAN2 tier 1 (WiMAX pppoe 12Mb/3Mb)  weigth 1 WAN1 : 4 WAN2

                        If WAN2 goes down all traffic switch on WAN1

                        When WAN2 return online (GatewayGrops all online) all connections still in WAN1.

                        If I reload filter everythinks turns all rigth WAN1 1 : WAN2 4 as weigth.

                        I don't use DNS Forwarder and fror monitor I use IPS dns (2 per connections).

                        Please help!!!!

                        Bye
                        Sandro

                        1 Reply Last reply Reply Quote 0
                        • S
                          sandrino
                          last edited by

                          Hi

                          in "miscellaneus config" under "Gateway Monitoring" there are:

                          Gateway Monitoring
                          State Killing on Gateway Failure
                          Flush all states when a gateway goes down The monitoring process will flush all states when a gateway goes down if this box is checked.

                          Skip rules when gateway is down
                          Do not create rules when gateway is down By default, when a rule has a gateway specified and this gateway is down, the rule is created omitting the gateway. This option overrides that behavior by omitting the entire rule instead.

                          Someone could explain it?

                          Thanks
                          Bye
                          Sandro

                          1 Reply Last reply Reply Quote 0
                          • D
                            devmaybe
                            last edited by

                            I think I have found a solution.

                            I have tested it on 2.3.2 release, it consists of 2 steps

                            1. Take note of the name you assigned to your PPPoE connection (WAN2 in this example)
                            2. Add the following lines at the end of "/usr/local/sbin/ppp-linkup" script (between "fi" and "exit 0" lines)

                            –-----------------------
                            fi

                            sleep 5
                            /etc/rc.newwanip wan2

                            exit 0

                            In all my tests traffic switches back correctly.

                            Note: without the "sleep" instructions I was having mixed results, maybe is only a timing problem with pppoe activation?

                            Bye

                            1 Reply Last reply Reply Quote 0
                            • S
                              SecureIS
                              last edited by

                              +1 that failback would be very valuable. I have a deployment where the Tier 2 connection is pay per GB so it would be nice to be able to automate failover AND failback but I have to keep that WAN disconnected to make sure no connections get stuck on it. It's not a PPPoE link so sadly I can't use an up/down script for this :(

                              We need a setting for "Flush all states when a lower tier gateway comes back up. The monitoring process will flush all states when a lower tier gateway comes up if this box is checked"

                              1 Reply Last reply Reply Quote 0
                              • luckman212L
                                luckman212 LAYER 8
                                last edited by

                                I'm working on a script to kill VOIP states when WAN1 (primary) comes back online.  As mentioned elsewhere in this thread, this is a critical feature in real-world scenarios due to (a) costly metered backup connections as well as (b) SIP interop issues when devices behind the same LAN are seen registering from different public IPs.  So I won't rehash all of that. I am trying to automate pfctl from the rc.gateway_alarm script that gets called on WANUP.  I also see that a PR has been recently merged that might help make this even easier and less hacky.  Has anyone hooked into these new functions yet to make this more reliable?

                                TL;DR— pfctl is not killing all of the related states. Can someone help me to understand something regarding states?

                                • Assume vlan100 is dedicated for voice, with subnet 192.168.20.0/24
                                • WAN1=primary, WAN2=backup
                                • When a "fail back" WAN2–>WAN1 event happens, I need to kill all states: (any)->WAN2->vlan100 and vlan100->WAN2->(any)
                                • I try using a command like:

                                pfctl -i igb0_vlan100 -k 0.0.0.0/0

                                But, this only seems to kill the states originating from inside the LAN. There are still tracked states via WAN2 that are NAT'ted to –> internal igb0_vlan100 IPs. Do I also need to run the commands like this instead?

                                pfctl -k 192.168.20.0/24 -k 0.0.0.0/0
                                pfctl -k 0.0.0.0/0 -k 192.168.20.0/24

                                Or, some other command?  Is there a better way….  ???

                                1 Reply Last reply Reply Quote 0
                                • N
                                  nemanager
                                  last edited by

                                  Any news ?  :(

                                  1 Reply Last reply Reply Quote 0
                                  • K
                                    kimkhan
                                    last edited by

                                    Failback to default WAN works for me.

                                    I have a Gigabit Fiber connection and a Cable modem connection. I put one of them as Tier1 and the other as Tier2.

                                    I used 8.8.8.8 for one and 8.8.4.4 for the other.

                                    But just following all the instructions in pfsense documentation and postings here in the forum that suggests with creating groups and different level of Tiers and etc. will not work unless you have the 'Default gateway switching' box checked. You can find it under System > Advanced > Miscellaneous

                                    http://prntscr.com/evn3ub

                                    I tested with disconnecting WAN1 and going to whatismyip.com and then plugging WAN1 back and going to a different what is my ip site. Don't go to the first one as it will be cached and will not show your original/default wan IP.

                                    Or you can just do a ping.

                                    Let me know if this helps. I can also post my configurations if you need to see.

                                    KK

                                    Netgate SG-2440
                                    2.3.3-RELEASE-p1

                                    1 Reply Last reply Reply Quote 0
                                    • R
                                      red_cat1930
                                      last edited by

                                      2.3.3-RELEASE-p1 (amd64), MultiWAN, VM on Hyper-V

                                      WAN1 ( tier2, monitor ip 8.8.4.4 )
                                      WAN2 ( tier1, monitor ip 8.8.8.8 ).

                                      Today WAN2 had alarm latecy but no clear latency occured despite the fact line becomes stable (accordingly to dashboard)

                                      Usual (System logs->Gateways):
                                      Apr 12 03:29:32 dpinger WAN2_DHCP 8.8.8.8: Clear latency 39052us stddev 2978us loss 5%
                                      Apr 12 03:28:34 dpinger WAN2_DHCP 8.8.8.8: Alarm latency 34409us stddev 429us loss 22%

                                      Today (no clear latency event):
                                      –-
                                      Apr 13 13:19:23 dpinger WAN2_DHCP 8.8.8.8: Alarm latency 34494us stddev 342us loss 21%

                                      All clients from from LAN were using WAN1 until i manually simulated WAN2 disconnect (set 1.1.1.1 as monitor ip for a minute, then revert back to 8.8.8.8 )

                                      1 Reply Last reply Reply Quote 0
                                      • C
                                        carmico
                                        last edited by

                                        same problem here

                                        failover is working tier1 to tier2, but when tier1 recovers, monitor says "online" but the traffic doesn't switch back to tier1 , remains on tier2

                                        PFsense ver. 2.3.3-RELEASE-p1

                                        1 Reply Last reply Reply Quote 0
                                        • R
                                          ronnysa
                                          last edited by

                                          @carmico:

                                          same problem here

                                          failover is working tier1 to tier2, but when tier1 recovers, monitor says "online" but the traffic doesn't switch back to tier1 , remains on tier2

                                          PFsense ver. 2.3.3-RELEASE-p1

                                          I am having the exact same problem here.

                                          2.3.3-RELEASE-p1 (amd64)
                                          built on Thu Mar 09 07:17:41 CST 2017
                                          FreeBSD 10.3-RELEASE-p17

                                          1 Reply Last reply Reply Quote 0
                                          • J
                                            jono_white
                                            last edited by

                                            The fail back seems to work providing the PC's connection is left idle for 20 Seconds or so, but if theres an active connection after your primary connection goes down (voip, video/audio streaming or even a continuous ping), it seems to remain on the redundant connection.

                                            The following script seems to work for my situation (4g modem failover with limited quota), it's nowhere near perfect but it'll shut the 4g interface down long enough for the states to be killed when the Primary WAN is up ,  would be better if it exited if there was no active states on 4G but meh..

                                            (Using cron to run every 5 minutes or so,  */5 * * * * root /bin/sh /root/routercheck.sh)

                                            #!/bin/sh

                                            check_wan1=8.8.8.8
                                            check_wan2=8.8.4.4

                                            wan_ipaddress=ifconfig rl0 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
                                            backupwan_ipaddress=ifconfig rl1 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1

                                            ping -c 2 -S {backupwan_ipaddress} ${check_wan2} > /dev/null 2>&1
                                            wan2_resp=$?

                                            backupwan_resp=expr ${wan2_resp}

                                            if [ ${backupwan_resp} -gt 0 ]; then
                                                exit 1
                                            fi

                                            ping -c 2 -S ${wan_ipaddress} ${check_wan1} > /dev/null 2>&1
                                            wan1_resp=$?

                                            wan_resp=expr ${wan1_resp}

                                            if [ ${wan_resp} -eq 0 ]; then

                                            #service netif restart rl1   
                                            ifconfig rl1 down;sleep 15;ifconfig rl1 up

                                            fi

                                            #end

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.