Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Multi-WAN gateway failover not switching back to tier 1 gw after back online

    Scheduled Pinned Locked Moved Routing and Multi WAN
    119 Posts 35 Posters 57.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • O
      obstler
      last edited by

      hello,

      I have a problem with pfSense 2.2.2 with Multi-WAN and failover not returning routing to tier 1 gateway after it failed and is back online.

      More details:

      Two WAN connections, WAN1 and WAN2. WAN1 is a reliable but slow DSL connection and also the default gateway. WAN1 has static IP assignment and basically almost never goes down.

      Now I've added a 4G/LTE connection as WAN2. WAN2 is fast, has dynamic IP assignment, disconnects once every 24 hours for 1-2 minutes, to come back up with a new IP. WAN2 also sometimes has longer, unplanned outages. WAN2 connects to LTE in bridge mode, so the dynamic public IP is assigned directly to the WAN2 interface via DHCP.

      I've setup a gateway group with WAN2 as tier 1, and WAN1 as tier 2, and assigned an appropriate rule to use this gateway group. Trigger level for gateway group is set to "Member down".

      So far so good, this basically works until WAN2 goes down a couple times (usually after 2 or 3 times): traffic fails over to WAN1 automatically, but then after WAN2 returns back online, all traffic is still routed via WAN1 and stays there until I manually change some interface, gateway or other setting on pfSense – this apparently somehow solves the stuck routing via WAN1. This happens after a couple failovers, no matter if WAN2 was down for just one minute (for the new IP every 24h), or one hour or more.

      I can see that both gateways are online again in the status page, and here are the apinger log entries for when WAN2 went down and back online today (GW_OPT1 is WAN2):

      May 16 15:04:49 	apinger: alarm canceled (config reload): GW_OPT1(8.8.8.8) *** down ***
      May 16 15:04:49 	apinger: SIGHUP received, reloading configuration.
      May 16 15:03:23 	apinger: ALARM: GW_OPT1(8.8.8.8) *** down ***
      

      Just after this cycle the connections were stuck on WAN1 again.

      Does anyone have some suggestions what I could do to troubleshoot this or how to fix this?

      Are there any other settings that need to be made for failover to work properly?

      Is the SIGHUP log entry normal for apinger? I did not do anything on pfSense at this time.

      thanks.

      1 Reply Last reply Reply Quote 0
      • K
        klona
        last edited by

        Hi,
        Working fine on my pfsense2.2.2, but I set up my main WAN (WAN2 also) as default gateway (system/routing). But I do have several cut down everyday..

        You also have some parameters hidden in System/advanced/mescellanous ==> Load Balancing

        /Klona

        1 Reply Last reply Reply Quote 0
        • O
          obstler
          last edited by

          As the issue happens almost daily now, I was trying different things in the webinterface to figure out workarounds – one method that always seems to help is the filter reload page (status_filter_reload.php) and then clicking the reload filter button.

          Is there some way to do this programmatically in pfsense (cron or such)?

          of course, a real solution would be the proper way, but a working workaround is better than having to manually reload via the web interface.

          1 Reply Last reply Reply Quote 0
          • A
            arcanos
            last edited by

            Same problem with 2.2.4. In fact, I have to different problems related to this. Similar to your case, I have two wan connections configures with failover (tier 1 and tier 2).

            • When both are ok, it uses the wan with tier 1 (wan1), which is ok. If I disconnect wan1, it starts using wan with tier 2 (wan2), wich is also ok. After 2-3 minutes, I reconnect wan1, but the "Status Gateway" page doens't recover wan1, and keeps appearing as offline. If I restart apinger service, wan1 appears as online again.

            • Besides that, like in your case, although wan1 is back online, pfsense keeps using wan2 until I do some change or manually make wan2 fail.

            To sum up, commutation is done ok in case of fail, but when you recover the failed gateway, nothing comes bak automatically. I've tested with two different machines, and always happens the same.

            Any ideas??

            Regards

            1 Reply Last reply Reply Quote 0
            • Y
              yanakis
              last edited by

              Hi. I have the same problem as you guys (never switch to main unless I change something in config) but it seems we also have another thread about this:
              https://forum.pfsense.org/index.php?topic=88723.0

              should we all post in a single thread?

              Thanks

              1 Reply Last reply Reply Quote 0
              • Y
                yanakis
                last edited by

                Update: I read somewhere on the forum that a broken install might be the problem.

                I reinstalled from scratch and it's working, a simple failover only. I anyone interested, I will add the full setup later

                1 Reply Last reply Reply Quote 0
                • A
                  arcanos
                  last edited by

                  Hi yanakis

                  And in your case, when it was failing, did pfsense detect that the gateway was online again or did you have to restart apinger like in my case? Apart from the fact that even after restarting apinger and main gateway appears as online again, it doesn't goes back…

                  After the reinstall, have you installed a backup with previous configuration or have you reconfigured everything manually??

                  Thanks in advance

                  1 Reply Last reply Reply Quote 0
                  • Y
                    yanakis
                    last edited by

                    @arcanos:

                    Hi yanakis

                    And in your case, when it was failing, did pfsense detect that the gateway was online again or did you have to restart apinger like in my case? Apart from the fact that even after restarting apinger and main gateway appears as online again, it doesn't goes back…

                    After the reinstall, have you installed a backup with previous configuration or have you reconfigured everything manually??

                    Thanks in advance

                    Hi.
                    Yes, the gateway appeared back online (green), no need to restart apinger. I suggest you to add in GW settings, IP MONITOR google dns servers (8.8.8.8 and 8.8.4.4)

                    After reinstall I reconfigured everything manually to avoid any bad setting  in the previous config.

                    regards,

                    1 Reply Last reply Reply Quote 0
                    • Y
                      yanakis
                      last edited by

                      @yanakis:

                      @arcanos:

                      Hi yanakis

                      And in your case, when it was failing, did pfsense detect that the gateway was online again or did you have to restart apinger like in my case? Apart from the fact that even after restarting apinger and main gateway appears as online again, it doesn't goes back…

                      After the reinstall, have you installed a backup with previous configuration or have you reconfigured everything manually??

                      Thanks in advance

                      Hi.
                      Yes, the gateway appeared back online (green), no need to restart apinger. I suggest you to add in GW settings, IP MONITOR google dns servers (8.8.8.8 and 8.8.4.4)

                      After reinstall I reconfigured everything manually to avoid any bad setting  in the previous config.

                      regards,

                      Well, it seems that after more configs (just added dynamic dns & NAT) it stopped working again :(. No idea what's going on but I feel let down by pfsense.

                      Here are some logs, maybe some is figuring it out:

                      Sep 3 23:44:42 kernel: em0: link state changed to DOWN
                      Sep 3 23:45:01 check_reload_status: updating dyndns WAN_PPPOE
                      Sep 3 23:45:01 check_reload_status: Restarting ipsec tunnels
                      Sep 3 23:45:01 check_reload_status: Restarting OpenVPN tunnels/interfaces
                      Sep 3 23:45:01 check_reload_status: Reloading filter
                      Sep 3 23:45:02 php-fpm[42490]: /rc.dyndns.update: MONITOR: WAN_PPPOE is down, omitting from routing group WANFailover
                      Sep 3 23:45:02 php-fpm[42490]: /rc.dyndns.update: phpDynDNS (xxxxxxxxxxxxxx): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
                      Sep 3 23:45:03 php-fpm[52556]: /rc.filter_configure_sync: MONITOR: WAN_PPPOE is down, omitting from routing group WANFailover
                      Sep 3 23:45:46 check_reload_status: Rewriting resolv.conf
                      Sep 3 23:45:46 check_reload_status: Rewriting resolv.conf
                      ______________________________________________________________________interface up

                      heck_reload_status: Linkup starting em0
                      Sep 3 23:47:30 kernel: em0: link state changed to UP
                      Sep 3 23:47:37 php-fpm[18877]: /rc.newwanipv6: rc.newwanipv6: Failed to update WAN[wan] IPv6, restarting…
                      Sep 3 23:47:37 php-fpm[18877]: /rc.newwanip: IP has changed, killing states on former IP 86.xxx.xxx.xxx.
                      Sep 3 23:47:37 php-fpm[18877]: /rc.newwanip: MONITOR: WAN_PPPOE is down, omitting from routing group WANFailover
                      Sep 3 23:47:37 php-fpm[18877]: /rc.newwanip: ROUTING: setting default route to 10.0.0.1
                      Sep 3 23:47:37 php-fpm[18877]: /rc.newwanip: Removing static route for monitor 8.8.4.4 and adding a new route through 192.168.0.1
                      Sep 3 23:47:37 php-fpm[18877]: /rc.newwanip: Removing static route for monitor 8.8.8.8 and adding a new route through 10.0.0.1
                      Sep 3 23:47:40 php-fpm[18877]: /rc.newwanip: phpDynDNS: updating cache file /conf/dyndns_wannoip'yanakis.xxxxxxxxxx'0.cache: 188.xxx.xxx.xxx
                      Sep 3 23:47:40 php-fpm[18877]: /rc.newwanip: phpDynDNS (yanakis.xxxxxxxxxxx): (Success) DNS hostname update successful.
                      Sep 3 23:47:41 php-fpm[18877]: /rc.newwanip: Resyncing OpenVPN instances for interface WAN.
                      Sep 3 23:47:41 php-fpm[18877]: /rc.newwanip: Creating rrd update script
                      Sep 3 23:47:41 php-fpm[73772]: /rc.linkup: Accept router advertisements on interface em0
                      Sep 3 23:47:41 php-fpm[73772]: /rc.linkup: ROUTING: setting default route to 10.0.0.1
                      Sep 3 23:47:41 check_reload_status: Restarting ipsec tunnels
                      Sep 3 23:47:42 rtsold[8762]: <sendpacket>sendmsg on em0: Operation not permitted
                      Sep 3 23:47:43 php-fpm[18877]: /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 86.xxx.xxx.xxx -> 188.26.229.135 - Restarting packages.
                      Sep 3 23:47:43 check_reload_status: Starting packages
                      Sep 3 23:47:44 php-fpm[6347]: /rc.start_packages: Restarting/Starting all packages.
                      Sep 3 23:47:44 check_reload_status: updating dyndns wan
                      Sep 3 23:47:45 php-fpm[6347]: /rc.dyndns.update: phpDynDNS (yanakis.xxxxxxxxxxx): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry.
                      Sep 3 23:47:46 rtsold[8762]: <sendpacket>sendmsg on em0: Operation not permitted
                      Sep 3 23:47:50 rtsold[8762]: <sendpacket>sendmsg on em0: Operation not permitted

                      And that's all in system logs.</sendpacket></sendpacket></sendpacket>

                      1 Reply Last reply Reply Quote 0
                      • A
                        arcanos
                        last edited by

                        :( :(

                        That's bad news. Yes, I already use Google DNS servers 8.8.8.8 and 8.8.4.4 as monitor IP's, and DNS association selected in the General Section. I'm pretty sure configuration is OK (at this point, I've reviewed it a hundred times :) ) and this looks like a pfsense problem.

                        I've openned a bug (https://redmine.pfsense.org/issues/5090#change-20401). Would be good if you add your comments there, so they can investigate the problem.

                        1 Reply Last reply Reply Quote 0
                        • Y
                          yanakis
                          last edited by

                          @arcanos:

                          :( :(

                          That's bad news. Yes, I already use Google DNS servers 8.8.8.8 and 8.8.4.4 as monitor IP's, and DNS association selected in the General Section. I'm pretty sure configuration is OK (at this point, I've reviewed it a hundred times :) ) and this looks like a pfsense problem.

                          I also tend to believe the same, pfsense has an issue. I spent a week trying to figure this out, I will try one more setup from scratch and make snapshots in vmware.

                          I've openned a bug (https://redmine.pfsense.org/issues/5090#change-20401). Would be good if you add your comments there, so they can investigate the problem.

                          1 Reply Last reply Reply Quote 0
                          • jahonixJ
                            jahonix
                            last edited by

                            @arcanos:

                            …and this looks like a pfsense problem...

                            I cannot second that!
                            I have this working for quite some time now with WAN1 (100Mb cable) and a rather old WAN2 (6Mb DSL).
                            I have failover to W2 if W1 is down and immediately W1 again when available.

                            Show us your System | Routing | Gateway Groups page.

                            1 Reply Last reply Reply Quote 0
                            • Y
                              yanakis
                              last edited by

                              @jahonix:

                              @arcanos:

                              …and this looks like a pfsense problem...

                              I cannot second that!
                              I have this working for quite some time now with WAN1 (100Mb cable) and a rather old WAN2 (6Mb DSL).
                              I have failover to W2 if W1 is down and immediately W1 again when available.

                              Show us your System | Routing | Gateway Groups page.

                              Hi. Thanks for reply. Please see the screenshots

                              Screenshot_2015-09-06-02-24-12.png
                              Screenshot_2015-09-06-02-24-12.png_thumb
                              Screenshot_2015-09-06-02-24-56.png
                              Screenshot_2015-09-06-02-24-56.png_thumb
                              Screenshot_2015-09-06-02-25-44.png
                              Screenshot_2015-09-06-02-25-44.png_thumb
                              Screenshot_2015-09-06-02-26-19.png
                              Screenshot_2015-09-06-02-26-19.png_thumb

                              1 Reply Last reply Reply Quote 0
                              • jahonixJ
                                jahonix
                                last edited by

                                You have one or two Gateway Groups defined? The one with time stamp 02-25-44.

                                What you call "WANGROUP" is easier to handle when called " PPPoE 2 UPC"
                                Now you need an additional "UPC 2 PPPoE" group with reversed tiers.
                                Add another firewall rule for that one as well and it should work.

                                And start with setting both "Trigger levels" to "Member Down".

                                1 Reply Last reply Reply Quote 0
                                • A
                                  arcanos
                                  last edited by

                                  Hi again

                                  But this is a temporal solution you've found to make it work or is the normal configuration for failover? I can't understand why we have to create two Gateways Groups, cause we only want one failover direction, not the oposite. I've been reviewing documentation and online info, and you should only need to create one gateway group.

                                  When you create the second firewall rule with the inverted Tier numbers, if that rule goes after the normal rule, the firewall should never reach the second one, because it looks at the rules sequentially, so when it reaches the first one, it directs the traffic through the main group (the group is online, because wan2 is online). It should never reach the second rule, so if it's doing it in your case, I think something strange happens.

                                  I'm probably wrong or I'm missing something, but I just want to clarify if with your configuration it's working because that's the normal config or is some other problem that makes it work although is not the right configuration.

                                  Now I can't test it in my client, because is a production system, but I'll try to make a demo in our office to see if I can verify your configuration. If it works, it could be a good temporal patch to solve the problem, but I still think something is not going well. The group should recover gateways and order preference automatically (that's what Tier is for).

                                  Thanks for your help

                                  1 Reply Last reply Reply Quote 0
                                  • Y
                                    yanakis
                                    last edited by

                                    @arcanos:

                                    Hi again

                                    But this is a temporal solution you've found to make it work or is the normal configuration for failover? I can't understand why we have to create two Gateways Groups, cause we only want one failover direction, not the oposite. I've been reviewing documentation and online info, and you should only need to create one gateway group.

                                    When you create the second firewall rule with the inverted Tier numbers, if that rule goes after the normal rule, the firewall should never reach the second one, because it looks at the rules sequentially, so when it reaches the first one, it directs the traffic through the main group (the group is online, because wan2 is online). It should never reach the second rule, so if it's doing it in your case, I think something strange happens.

                                    I'm probably wrong or I'm missing something, but I just want to clarify if with your configuration it's working because that's the normal config or is some other problem that makes it work although is not the right configuration.

                                    Now I can't test it in my client, because is a production system, but I'll try to make a demo in our office to see if I can verify your configuration. If it works, it could be a good temporal patch to solve the problem, but I still think something is not going well. The group should recover gateways and order preference automatically (that's what Tier is for).

                                    Thanks for your help

                                    Hi. Indeed, a second rule makes no sense to me also but I will test it as advised,  test it for the second time actuallly. I've tried even with 3 rules, same results.

                                    In the mean time I've done more testing and got to a conclusion but first please let me know how did you simulate the main wan failure?

                                    1 Reply Last reply Reply Quote 0
                                    • DerelictD
                                      Derelict LAYER 8 Netgate
                                      last edited by

                                      Second rule and gateway group is not necessary unless you want some traffic to prefer the second route and fail over the other way.

                                      You only need the one Tier 1 to Tier 2 to fail all traffic over is that direction.

                                      It certainly should recover and "fail back" when the Tier 1 route comes back up.

                                      Chattanooga, Tennessee, USA
                                      A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                                      DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                                      Do Not Chat For Help! NO_WAN_EGRESS(TM)

                                      1 Reply Last reply Reply Quote 0
                                      • Y
                                        yanakis
                                        last edited by

                                        @Derelict:

                                        Second rule and gateway group is not necessary unless you want some traffic to prefer the second route and fail over the other way.

                                        You only need the one Tier 1 to Tier 2 to fail all traffic over is that direction.

                                        It certainly should recover and "fail back" when the Tier 1 route comes back up.

                                        Well, in my case it doesn't. WAN1 is a PPPOE connection, and after I re-plug or the Ethernet cable in WAN 1 all connections still go through OPT1.

                                        For the testing purpose, I added another router in front of pfsense so it won't have to use a PPPOE connection, I assigned to WAN1 a static IP like OPT1 has. In this case to some extent it works if I unplug/re-plug the connection on the first router (take down the ISP Interface, the fiber media converter) so both WAN and OPT1 stay up in pfsense. Still, some sites refuses to load in Chrome with the following error:  DNS_PROBE_FINISHED_NXDOMAIN

                                        So, for now my only conclusion is that there is a problem with pfsense when you unplug and re-plug the cable on the interface using a PPPOE connection. The dns error is still a mystery to me, I still need to figure it out.

                                        1 Reply Last reply Reply Quote 0
                                        • DerelictD
                                          Derelict LAYER 8 Netgate
                                          last edited by

                                          Well you need to fix your DNS.  Sounds like it might not be working right on one or both WANs.  Are you using the forwarder or the resolver?

                                          It shouldn't matter which WAN the resolver uses because it should only be trying to talk to authoritative name servers that should accept queries from everywhere.

                                          The problem lies in forwarders because you usually point the forwarder at ISP caching servers and they might only accept connections from their network so it matters which DNS servers are used out which interface.

                                          Chattanooga, Tennessee, USA
                                          A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                                          DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                                          Do Not Chat For Help! NO_WAN_EGRESS(TM)

                                          1 Reply Last reply Reply Quote 0
                                          • Y
                                            yanakis
                                            last edited by

                                            @Derelict:

                                            Well you need to fix your DNS.  Sounds like it might not be working right on one or both WANs.  Are you using the forwarder or the resolver?

                                            It shouldn't matter which WAN the resolver uses because it should only be trying to talk to authoritative name servers that should accept queries from everywhere.

                                            The problem lies in forwarders because you usually point the forwarder at ISP caching servers and they might only accept connections from their network so it matters which DNS servers are used out which interface.

                                            I tried both the resolver and the forwarder, some sites are just not resolved. Unfortunately I don't think I can use pfsense in a production environment, for me at least failover it's not working with pppoe  :(.

                                            should "State Killing on Gateway Failure" should be on?

                                            Thanks

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.