Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Multi-WAN gateway failover not switching back to tier 1 gw after back online

    Scheduled Pinned Locked Moved Routing and Multi WAN
    119 Posts 35 Posters 54.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • C
      cheonne
      last edited by

      you should put a working monitor ip for each interfaces like dns ip

      1 Reply Last reply Reply Quote 0
      • M
        MrD
        last edited by

        Hello,

        I'm facing the same problem. I've read those (with no solution)

        • https://forum.pfsense.org/index.php?topic=111143.0
        • https://redmine.pfsense.org/issues/5090

        I'm runing 2.3.1 wit 2 wans (1 cable/main and 1dsl-pppoe/secondary), 2 groups. Failover is working (trigger ok) but not switching back after weak connection is back at 100%.

        Ready to send screenshots. Ask

        logs:

        Jun 16 13:49:20 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 9274us stddev 5829us loss 21%
        Jun 16 13:49:42 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7586us stddev 4056us loss 15%
        Jun 16 13:53:58 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 7362us stddev 4941us loss 21%
        Jun 16 13:54:15 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 6543us stddev 3719us loss 19%
        Jun 16 13:54:39 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 6692us stddev 3840us loss 21%
        Jun 16 13:54:57 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 6644us stddev 3338us loss 15%
        Jun 16 13:56:03 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 8839us stddev 5402us loss 21%
        Jun 16 13:56:19 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 8292us stddev 4864us loss 19%
        Jun 16 13:56:43 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 8431us stddev 5556us loss 22%
        Jun 16 13:57:02 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7940us stddev 5158us loss 15%
        Jun 16 13:58:35 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 12630us stddev 12111us loss 21%
        Jun 16 13:58:53 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 8282us stddev 4592us loss 15%
        Jun 16 13:59:21 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 8983us stddev 5856us loss 21%
        Jun 16 13:59:32 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 8447us stddev 5473us loss 16%
        Jun 16 13:59:58 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 8206us stddev 5630us loss 21%
        Jun 16 14:00:11 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7373us stddev 4132us loss 14%
        Jun 16 14:01:14 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 8049us stddev 4691us loss 21%
        Jun 16 14:01:44 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7842us stddev 3865us loss 18%
        Jun 16 14:01:47 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 7944us stddev 3892us loss 21%
        Jun 16 14:02:18 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7717us stddev 3673us loss 12%
        Jun 16 14:03:51 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 7952us stddev 4608us loss 21%
        Jun 16 14:04:16 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7030us stddev 3415us loss 12%
        Jun 16 14:04:28 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 7711us stddev 4555us loss 21%
        Jun 16 14:04:56 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7538us stddev 4081us loss 14%
        Jun 16 14:05:10 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 8504us stddev 5216us loss 21%
        Jun 16 14:05:32 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 8245us stddev 4794us loss 13%
        Jun 16 14:05:51 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 8544us stddev 5200us loss 21%
        Jun 16 14:06:14 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7369us stddev 3934us loss 16%
        Jun 16 14:06:26 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 7862us stddev 4613us loss 21%
        Jun 16 14:06:56 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 7151us stddev 3861us loss 13%
        Jun 16 14:11:12 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 7910us stddev 4976us loss 21%
        Jun 16 14:11:26 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 6648us stddev 3553us loss 15%
        Jun 16 14:11:49 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 6910us stddev 4027us loss 21%
        Jun 16 14:12:11 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 6271us stddev 2901us loss 15%
        Jun 16 14:12:28 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 6705us stddev 3698us loss 21%
        Jun 16 14:12:52 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 6371us stddev 2763us loss 11%
        Jun 16 14:13:45 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Alarm latency 8346us stddev 5486us loss 21%
        Jun 16 14:14:57 dpinger WAN2CABLEGW xxx.xxx.xxx.xxx: Clear latency 8065us stddev 5624us loss 16%

        dash.jpg
        dash.jpg_thumb
        gw.jpg
        gw.jpg_thumb
        GW-grups.png
        GW-grups.png_thumb
        WAN1.png
        WAN1.png_thumb
        WAN2.jpg
        WAN2.jpg_thumb

        1 Reply Last reply Reply Quote 0
        • DerelictD
          Derelict LAYER 8 Netgate
          last edited by

          I would change the monitor IP in the WAN2CABLEGW to 8.8.8.8 or anything else that responds reliably and see if things improve. You can't expect any multi-WAN routing solution to perform with any semblance of continuity with flapping like that.

          Chattanooga, Tennessee, USA
          A comprehensive network diagram is worth 10,000 words and 15 conference calls.
          DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
          Do Not Chat For Help! NO_WAN_EGRESS(TM)

          1 Reply Last reply Reply Quote 0
          • M
            MrD
            last edited by

            Actually, the monitor IP is default so it is the gateway of each wan.

            I will give a try for a one like google (8.8.8.8) but the problem is not the monitor ip or the trigger, the problem is that when the deffect wan is back to normal (ping to monitor ip is better quality) the system do not switch back. I mean If I log in pfsense (hours after the problem), and watch my "gateway groups" status, they are all in green (same for gateways) but the system do not switch back to the favorite gateway.

            @Derelict:

            I would change the monitor IP in the WAN2CABLEGW to 8.8.8.8 or anything else that responds reliably and see if things improve. You can't expect any multi-WAN routing solution to perform with any semblance of continuity with flapping like that.

            1 Reply Last reply Reply Quote 0
            • DerelictD
              Derelict LAYER 8 Netgate
              last edited by

              No, the problem is your gateway is flapping about every minute due to packet loss to your monitor IP. If that was in my multi-wan group I would disable it until it was fixed. If that's "just the way it is" you will need to increase your monitoring threshholds and consider it up when it sucks like that.

              Chattanooga, Tennessee, USA
              A comprehensive network diagram is worth 10,000 words and 15 conference calls.
              DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
              Do Not Chat For Help! NO_WAN_EGRESS(TM)

              1 Reply Last reply Reply Quote 0
              • M
                MrD
                last edited by

                OK my gateway has "troubles" for few minutes a day (not all the time).

                That's precisely why I want a failover.

                And this do not explain why the system do not go back to it's first wan after the first wan is seen by ths system in green. (hours after!)

                If the goal of a failover is to work on connections that never have tropubles, it's non sense to me…

                @jahonix:

                You have one or two Gateway Groups defined? The one with time stamp 02-25-44.

                What you call "WANGROUP" is easier to handle when called " PPPoE 2 UPC"
                Now you need an additional "UPC 2 PPPoE" group with reversed tiers.
                Add another firewall rule for that one as well and it should work.

                And start with setting both "Trigger levels" to "Member Down".

                1 Reply Last reply Reply Quote 0
                • J
                  jmonline
                  last edited by

                  I am pretty sure this is exactly related to my issue and my most recent detailed post here:

                  https://forum.pfsense.org/index.php?topic=86851.msg632594#msg632594

                  1 Reply Last reply Reply Quote 0
                  • M
                    MrD
                    last edited by

                    Hello JM,

                    Globally it seems to be the same problem reported by most people writing in your post. I've seen a bug report but dev team consider it is not a bug but misconfiguration without explaining where is the misconfiguration… strange

                    @jmonline:

                    I am pretty sure this is exactly related to my issue and my most recent detailed post here:

                    https://forum.pfsense.org/index.php?topic=86851.msg632594#msg632594

                    1 Reply Last reply Reply Quote 0
                    • DerelictD
                      Derelict LAYER 8 Netgate
                      last edited by

                      It is not a bug.

                      A setting that kills all states on a Tier X interface when a Tier < X interface returns to service would be a feature request.

                      I did not see one for this on redmine.pfsense.org.

                      Chattanooga, Tennessee, USA
                      A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                      DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                      Do Not Chat For Help! NO_WAN_EGRESS(TM)

                      1 Reply Last reply Reply Quote 0
                      • M
                        MrD
                        last edited by

                        After many readings on this subject it is the first time I read that this is normal and this is a feature request. I've read that this was the result of missconfiguration meaning that connection should go back to what it was before failover…

                        For example:
                        https://redmine.pfsense.org/issues/5090

                        Chris Buechler
                        …
                        I went through and re-tested multi-WAN in general on 2.2.5 (which is the same as 2.2.4 in that regard) and it fails over and back as it should just fine every time.
                        ...
                        There may be some edge case but nothing here to suggest what that might be.

                        BUT fiew lines later, it goes another way

                        Chris Buechler
                        …
                        that's how it's supposed to work at this point. Sounds like you want state killing on failback, which doesn't exist at this time. feature #855 covers that

                        https://redmine.pfsense.org/issues/855

                        So the final answer is FAILOVER DO NOT GO BACK TO INITIAL STATE
                        This is suprising but knowing this, I stop loosing time trying different config options…

                        @Derelict:

                        It is not a bug.

                        A setting that kills all states on a Tier X interface when a Tier < X interface returns to service would be a feature request.

                        I did not see one for this on redmine.pfsense.org.

                        1 Reply Last reply Reply Quote 0
                        • J
                          jmonline
                          last edited by

                          @Derelict:

                          It is not a bug.

                          A setting that kills all states on a Tier X interface when a Tier < X interface returns to service would be a feature request.

                          I did not see one for this on redmine.pfsense.org.

                          Right, but if it's not a bug, then how do you get traffic to go back over the original interface when it returns online.

                          Killing the states does not always work.

                          I have also been able to test that a brand new device connected to the network, will still route in the same way (onto the failover interface)  even if the primary wan was back online BEFORE the new device was connected.

                          I have also been testing this in a virtual environment and can replicate the issue. Although it is not always the same. Sometimes new states will follow the correct route (back over the primary wan) and other times they will get stuck on the backup wan. It is not consistent which doesn't make sense.

                          1 Reply Last reply Reply Quote 0
                          • M
                            MrD
                            last edited by

                            Let's be clear, to me it is a bug. But if they say no, I have no choice.

                            Actually, I reset all states and sometimes I change the firewall rule (time consuming!!!) If better proposition I'm interested.

                            1 Reply Last reply Reply Quote 0
                            • DerelictD
                              Derelict LAYER 8 Netgate
                              last edited by

                              Killing the states does not always work.

                              Please demonstrate with evidence.

                              Chattanooga, Tennessee, USA
                              A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                              DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                              Do Not Chat For Help! NO_WAN_EGRESS(TM)

                              1 Reply Last reply Reply Quote 0
                              • DerelictD
                                Derelict LAYER 8 Netgate
                                last edited by

                                @MrD:

                                @Derelict:

                                I did not see one for this on redmine.pfsense.org.

                                https://redmine.pfsense.org/issues/855

                                So the final answer is FAILOVER DO NOT GO BACK TO INITIAL STATE
                                This is suprising but knowing this, I stop loosing time trying different config options…

                                There. Feature #855. My redmine searching could obviously use a tuneup.

                                Chattanooga, Tennessee, USA
                                A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                                DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                                Do Not Chat For Help! NO_WAN_EGRESS(TM)

                                1 Reply Last reply Reply Quote 0
                                • J
                                  jmonline
                                  last edited by

                                  @Derelict:

                                  Please demonstrate with evidence.

                                  Ok so in very basic terms since I already have quite a lot of information on this post here https://forum.pfsense.org/index.php?topic=86851.msg632594#msg632594

                                  • The connection has failed over to the backup WAN when the primary WAN has gone down. (Failover has worked as expected)

                                  • The primary WAN has come back up (Status > Gateways confirms this is up/online).

                                  • The states (VoIP sessions for phones) are still showing in the state table 12hrs later going over the backup WAN.

                                  • No new or refreshed sessions from the phones go over the primary connection.

                                  • Current state table (filtered by the phone with the IP of 10.10.30.55) looks like this:

                                    WAN_EFM udp 135.196.xxx.xxx:41809 (10.10.30.55:49679) -> 185.83.xxx.xxx:5060 MULTIPLE:MULTIPLE 201.589 K / 102.513 K 125.60 MiB / 39.52 MiB
                                    30VOICELAN udp 185.83.xxx.xxx:5060 <- 10.10.30.55:49679 MULTIPLE:MULTIPLE 99.293 K / 99.502 K 61.87 MiB / 38.35 MiB

                                    To clarify:
                                    WAN_EFM - is the backup WAN connection
                                    30VOICELAN - is the LAN network for the phones
                                    135.196.xxx.xxx - is the IP of my backup WAN connection
                                    185.83.xxx.xxx - is the IP of my externally hosted VoIP platform

                                  • I have then "Reset the firewall state table"

                                  • At this point SOMETIMES the states will clear and obey the correct Gateway fail-over rule and be sent back over the primary WAN.
                                    SOMETIMES they will stay where they are (on the backup WAN)

                                  I can understand the argument that it is a feature request to have the states clear on the re-establishment of the primary wan connection.
                                  However, why have I seen the following…

                                  • Primary connection has been down for a length of time and has since come back online.

                                  • A brand new device which has never connected to the network (so therefore has no open states) is connected.

                                  • This new device states are sent over the backup WAN - even though the primary wan is available

                                  • "Reset the firewall state table" and the new device has states over the primary wan (as it should have done when it first connected to the network)

                                  I also ran a test of this in a virtual environment and simulated the primary WAN connection dropping and re-connecting.
                                  I was using a Linux machine as a test client and just running a PING and TRACEROUTE to use as example states on the firewall (eliminating the VoIP aspect).
                                  Sometimes, when you bought the Primary WAN connection back online, a new TRACEROUTE to a different IP address could go over the primary WAN, and other times it would remain over the backup WAN.
                                  I have not been able to prove what causes this - it appears random.

                                  In my mind, if the primary wan connection is reconnected and online, then any NEW state that hits the firewall should always follow the gateway group rule and go over the Tier1 connection.

                                  Why does running a pfctl and targeting the relevant hosts/network not force clearing of the states just for the VoIP devices (without clearing the whole state table)?

                                  Another simple way of putting it….........

                                  If your primary connection goes down for an hour and then comes back online. At what point should your traffic start to reuse that connection again. What if your "backup" connection has a very data usage charge?

                                  Bit of history for you…..

                                  I used to use Draytek equipment for all my client sites, on their old 2830 series of routers, they had the WAN failover options, but the same applied… if the primary went down everything would failover to the backup and then never fail back again when the primary connection returned.

                                  On their newer 2860 series of routers, they added one simple check box labelled "Failback" and it moved your sessions/states back to the correct primary connection when it was available again.

                                  However on the Draytek I never had the issue where a NEW state/session would still go over the backup WAN when the primary was available. If it was a new session it always followed the rules correctly.

                                  I hope that makes sense to some of you :)

                                  1 Reply Last reply Reply Quote 0
                                  • DerelictD
                                    Derelict LAYER 8 Netgate
                                    last edited by

                                    You still didn't show the Tier 1 being back online and new states still being created on Tier 2. I think if you really take a look at this you will find that is not happening.

                                    And nothing can "move a state" back to the original connection. All you can do is kill the old state and let a new one be established on a reconnection.

                                    Chattanooga, Tennessee, USA
                                    A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                                    DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                                    Do Not Chat For Help! NO_WAN_EGRESS(TM)

                                    1 Reply Last reply Reply Quote 0
                                    • J
                                      jmonline
                                      last edited by

                                      Ok lets simplify it even more…

                                      Take the traceroute facility in Diagnostics > Traceroute

                                      • My primary wan is back online.

                                      • I enter a hostname to trace (Google - 8.8.8.8)

                                      • I pick the Source Address as 30VOICELAN

                                      • I get the following result:
                                          -1  135.196.xxx.xxx  7.671 ms  6.869 ms  7.008 ms
                                          -2  135.196.xxx.xxx  7.016 ms  7.195 ms  7.164 ms
                                          -3  5.57.80.136  7.218 ms  7.199 ms  11.125 ms
                                          -4  216.239.54.243  7.922 ms
                                          -5  216.239.58.95  9.010 ms
                                          -6  8.8.8.8  8.139 ms  8.010 ms  8.626 ms

                                      • The first line on the traceroute with the IP starting 135.196 is my backup internet connection. Not my primary.

                                      How is that possible?

                                      The firewall rule on the 30VOICELAN has the Gateway set as the Gateway Group named "DSLFirst".
                                      The Gateway group "DSLFirst" has the (Primary) DSL WAN connection as Tier1 and the (Backup) EFM WAN connection as Tier2.
                                      Status > Gateways shows both gateways online.

                                      1 Reply Last reply Reply Quote 0
                                      • DerelictD
                                        Derelict LAYER 8 Netgate
                                        last edited by

                                        Show me the states, bro. pfctl -vss

                                        Chattanooga, Tennessee, USA
                                        A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                                        DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                                        Do Not Chat For Help! NO_WAN_EGRESS(TM)

                                        1 Reply Last reply Reply Quote 0
                                        • J
                                          jmonline
                                          last edited by

                                          @Derelict:

                                          Show me the states, bro. pfctl -vss

                                          Ok so perfect time for a test  :) Last night looks like BT did their usual maintenance on the DSL network around 1am so the ADSL line was down for 5mins. This morning I have the following in the states table for the phone on IP 10.10.30.27.

                                          30VOICELAN tcp 185.83.xxx.xxx:5060 <- 10.10.30.27:55778 ESTABLISHED:ESTABLISHED 8.933 K / 10.417 K 3.13 MiB / 3.43 MiB
                                          WAN_EFM tcp 185.3.xxx.xxx:40781 (10.10.30.27:55778) -> 185.83.xxx.xxx:5060 ESTABLISHED:ESTABLISHED 8.933 K / 10.417 K 3.13 MiB / 3.43 MiB

                                          To clarify:
                                          185.83.xxx.xxx is the external VoIP pbx.
                                          185.3.xxx.xxx is the IP of the WAN_EFM (backup) connection.
                                          30VOICELAN is my internal network with a subnet of 10.10.30.0/24

                                          pfctl -vss shows the following:

                                          igb1_vlan30 tcp 185.83.xxx.xxx:5060 <- 10.10.30.27:55778      ESTABLISHED:ESTABLISHED
                                            [1594456643 + 42272] wscale 8  [1007765254 + 183296] wscale 5
                                            age 05:58:04, expires in 119:59:52, 8954:10441 pkts, 3290011:3604569 bytes, rule 119

                                          igb2 tcp 185.3.xxx.xxx:40781 (10.10.30.27:55778) -> 185.83.xxx.xxx:5060      ESTABLISHED:ESTABLISHED
                                            [1007765254 + 183296] wscale 5  [1594456643 + 42272] wscale 8
                                            age 05:58:04, expires in 119:59:52, 8954:10441 pkts, 3290011:3604569 bytes, rule 96

                                          Our of interest, what should be the correct pfctl command to run to force killing these states (so all states on the WAN_EFM connection from the subnet 10.10.30.0/24)?

                                          If I can get a command to successfully kill these states when they get stuck here, I am happy for that as a work around until someone can work out how to automate it. I don't want to be Resetting the whole state table every time since that kills sessions which should be legitimately open.

                                          Thanks

                                          1 Reply Last reply Reply Quote 0
                                          • DerelictD
                                            Derelict LAYER 8 Netgate
                                            last edited by

                                            You probably want to kill all connections to the PBX. That would be:

                                            pfctl -k 0.0.0.0/0 -k 185.83.xxx.xxx

                                            That will kill everything even phones that are connected out the Tier 1.

                                            You can try just killing one side of the connection that is tied to WAN_EFM with:

                                            pfctl -i igb2 -k 0.0.0.0/0 -k 185.83.xxx.xxx

                                            If, when the phones reconnect, they use the Tier1 connection, great. In my testing they continued to use the other connection so it doesn't look like you can do that.

                                            Chattanooga, Tennessee, USA
                                            A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                                            DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                                            Do Not Chat For Help! NO_WAN_EGRESS(TM)

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.