Multi-WAN gateway failover not switching back to tier 1 gw after back online
-
After many readings on this subject it is the first time I read that this is normal and this is a feature request. I've read that this was the result of missconfiguration meaning that connection should go back to what it was before failover…
For example:
https://redmine.pfsense.org/issues/5090Chris Buechler
…
I went through and re-tested multi-WAN in general on 2.2.5 (which is the same as 2.2.4 in that regard) and it fails over and back as it should just fine every time.
...
There may be some edge case but nothing here to suggest what that might be.BUT fiew lines later, it goes another way
Chris Buechler
…
that's how it's supposed to work at this point. Sounds like you want state killing on failback, which doesn't exist at this time. feature #855 covers thathttps://redmine.pfsense.org/issues/855
So the final answer is FAILOVER DO NOT GO BACK TO INITIAL STATE
This is suprising but knowing this, I stop loosing time trying different config options…It is not a bug.
A setting that kills all states on a Tier X interface when a Tier < X interface returns to service would be a feature request.
I did not see one for this on redmine.pfsense.org.
-
It is not a bug.
A setting that kills all states on a Tier X interface when a Tier < X interface returns to service would be a feature request.
I did not see one for this on redmine.pfsense.org.
Right, but if it's not a bug, then how do you get traffic to go back over the original interface when it returns online.
Killing the states does not always work.
I have also been able to test that a brand new device connected to the network, will still route in the same way (onto the failover interface) even if the primary wan was back online BEFORE the new device was connected.
I have also been testing this in a virtual environment and can replicate the issue. Although it is not always the same. Sometimes new states will follow the correct route (back over the primary wan) and other times they will get stuck on the backup wan. It is not consistent which doesn't make sense.
-
Let's be clear, to me it is a bug. But if they say no, I have no choice.
Actually, I reset all states and sometimes I change the firewall rule (time consuming!!!) If better proposition I'm interested.
-
Killing the states does not always work.
Please demonstrate with evidence.
-
@MrD:
I did not see one for this on redmine.pfsense.org.
https://redmine.pfsense.org/issues/855
So the final answer is FAILOVER DO NOT GO BACK TO INITIAL STATE
This is suprising but knowing this, I stop loosing time trying different config options…There. Feature #855. My redmine searching could obviously use a tuneup.
-
Please demonstrate with evidence.
Ok so in very basic terms since I already have quite a lot of information on this post here https://forum.pfsense.org/index.php?topic=86851.msg632594#msg632594
-
The connection has failed over to the backup WAN when the primary WAN has gone down. (Failover has worked as expected)
-
The primary WAN has come back up (Status > Gateways confirms this is up/online).
-
The states (VoIP sessions for phones) are still showing in the state table 12hrs later going over the backup WAN.
-
No new or refreshed sessions from the phones go over the primary connection.
-
Current state table (filtered by the phone with the IP of 10.10.30.55) looks like this:
WAN_EFM udp 135.196.xxx.xxx:41809 (10.10.30.55:49679) -> 185.83.xxx.xxx:5060 MULTIPLE:MULTIPLE 201.589 K / 102.513 K 125.60 MiB / 39.52 MiB
30VOICELAN udp 185.83.xxx.xxx:5060 <- 10.10.30.55:49679 MULTIPLE:MULTIPLE 99.293 K / 99.502 K 61.87 MiB / 38.35 MiBTo clarify:
WAN_EFM - is the backup WAN connection
30VOICELAN - is the LAN network for the phones
135.196.xxx.xxx - is the IP of my backup WAN connection
185.83.xxx.xxx - is the IP of my externally hosted VoIP platform -
I have then "Reset the firewall state table"
-
At this point SOMETIMES the states will clear and obey the correct Gateway fail-over rule and be sent back over the primary WAN.
SOMETIMES they will stay where they are (on the backup WAN)
I can understand the argument that it is a feature request to have the states clear on the re-establishment of the primary wan connection.
However, why have I seen the following…-
Primary connection has been down for a length of time and has since come back online.
-
A brand new device which has never connected to the network (so therefore has no open states) is connected.
-
This new device states are sent over the backup WAN - even though the primary wan is available
-
"Reset the firewall state table" and the new device has states over the primary wan (as it should have done when it first connected to the network)
I also ran a test of this in a virtual environment and simulated the primary WAN connection dropping and re-connecting.
I was using a Linux machine as a test client and just running a PING and TRACEROUTE to use as example states on the firewall (eliminating the VoIP aspect).
Sometimes, when you bought the Primary WAN connection back online, a new TRACEROUTE to a different IP address could go over the primary WAN, and other times it would remain over the backup WAN.
I have not been able to prove what causes this - it appears random.In my mind, if the primary wan connection is reconnected and online, then any NEW state that hits the firewall should always follow the gateway group rule and go over the Tier1 connection.
Why does running a pfctl and targeting the relevant hosts/network not force clearing of the states just for the VoIP devices (without clearing the whole state table)?
Another simple way of putting it….........
If your primary connection goes down for an hour and then comes back online. At what point should your traffic start to reuse that connection again. What if your "backup" connection has a very data usage charge?
Bit of history for you…..
I used to use Draytek equipment for all my client sites, on their old 2830 series of routers, they had the WAN failover options, but the same applied… if the primary went down everything would failover to the backup and then never fail back again when the primary connection returned.
On their newer 2860 series of routers, they added one simple check box labelled "Failback" and it moved your sessions/states back to the correct primary connection when it was available again.
However on the Draytek I never had the issue where a NEW state/session would still go over the backup WAN when the primary was available. If it was a new session it always followed the rules correctly.
I hope that makes sense to some of you :)
-
-
You still didn't show the Tier 1 being back online and new states still being created on Tier 2. I think if you really take a look at this you will find that is not happening.
And nothing can "move a state" back to the original connection. All you can do is kill the old state and let a new one be established on a reconnection.
-
Ok lets simplify it even more…
Take the traceroute facility in Diagnostics > Traceroute
-
My primary wan is back online.
-
I enter a hostname to trace (Google - 8.8.8.8)
-
I pick the Source Address as 30VOICELAN
-
I get the following result:
-1 135.196.xxx.xxx 7.671 ms 6.869 ms 7.008 ms
-2 135.196.xxx.xxx 7.016 ms 7.195 ms 7.164 ms
-3 5.57.80.136 7.218 ms 7.199 ms 11.125 ms
-4 216.239.54.243 7.922 ms
-5 216.239.58.95 9.010 ms
-6 8.8.8.8 8.139 ms 8.010 ms 8.626 ms -
The first line on the traceroute with the IP starting 135.196 is my backup internet connection. Not my primary.
How is that possible?
The firewall rule on the 30VOICELAN has the Gateway set as the Gateway Group named "DSLFirst".
The Gateway group "DSLFirst" has the (Primary) DSL WAN connection as Tier1 and the (Backup) EFM WAN connection as Tier2.
Status > Gateways shows both gateways online. -
-
Show me the states, bro. pfctl -vss
-
Show me the states, bro. pfctl -vss
Ok so perfect time for a test :) Last night looks like BT did their usual maintenance on the DSL network around 1am so the ADSL line was down for 5mins. This morning I have the following in the states table for the phone on IP 10.10.30.27.
30VOICELAN tcp 185.83.xxx.xxx:5060 <- 10.10.30.27:55778 ESTABLISHED:ESTABLISHED 8.933 K / 10.417 K 3.13 MiB / 3.43 MiB
WAN_EFM tcp 185.3.xxx.xxx:40781 (10.10.30.27:55778) -> 185.83.xxx.xxx:5060 ESTABLISHED:ESTABLISHED 8.933 K / 10.417 K 3.13 MiB / 3.43 MiBTo clarify:
185.83.xxx.xxx is the external VoIP pbx.
185.3.xxx.xxx is the IP of the WAN_EFM (backup) connection.
30VOICELAN is my internal network with a subnet of 10.10.30.0/24pfctl -vss shows the following:
igb1_vlan30 tcp 185.83.xxx.xxx:5060 <- 10.10.30.27:55778 ESTABLISHED:ESTABLISHED
[1594456643 + 42272] wscale 8 [1007765254 + 183296] wscale 5
age 05:58:04, expires in 119:59:52, 8954:10441 pkts, 3290011:3604569 bytes, rule 119igb2 tcp 185.3.xxx.xxx:40781 (10.10.30.27:55778) -> 185.83.xxx.xxx:5060 ESTABLISHED:ESTABLISHED
[1007765254 + 183296] wscale 5 [1594456643 + 42272] wscale 8
age 05:58:04, expires in 119:59:52, 8954:10441 pkts, 3290011:3604569 bytes, rule 96Our of interest, what should be the correct pfctl command to run to force killing these states (so all states on the WAN_EFM connection from the subnet 10.10.30.0/24)?
If I can get a command to successfully kill these states when they get stuck here, I am happy for that as a work around until someone can work out how to automate it. I don't want to be Resetting the whole state table every time since that kills sessions which should be legitimately open.
Thanks
-
You probably want to kill all connections to the PBX. That would be:
pfctl -k 0.0.0.0/0 -k 185.83.xxx.xxx
That will kill everything even phones that are connected out the Tier 1.
You can try just killing one side of the connection that is tied to WAN_EFM with:
pfctl -i igb2 -k 0.0.0.0/0 -k 185.83.xxx.xxx
If, when the phones reconnect, they use the Tier1 connection, great. In my testing they continued to use the other connection so it doesn't look like you can do that.
-
Ok so the following command cleared the sessions stuck on the failover WAN.
pfctl -i igb2 -k 0.0.0.0/0 -k 185.83.xxx.xxx
This is a good step forward since I can now manually force the sessions back when I know they haven't moved on their own.
I presume I may be able to scheduled this via some sort of script to run a specified period of time after the primary connection comes back online…...?
Thanks for help so far Derelict :)
- 5 months later
-
Hello,
I know this is an old thread but I have the same problem and now I am able to reliably reproduce the behavior in a test environment.
If the "primary" WAN is a PPPoE connection and the secondary WAN is a "standard" static or DHCP assigned IP address connection when the primary goes down failover to the secondary work as expected but when the primary comes back up no traffic will flow through it.
In such cases on my production systems I usually edit my default gateway entry in System->Routing.
I uncheck the "Default Gateway" mark, re-check it and then save and apply.
Traffic starts flowing again through the PPPoE connection.The same always works in my virtual machines test environment too.
I hope this can help in tracking down the source of the problem or at least in finding some solution.
Thanks
-
Hi all,
same problem WAN1 tier 1 (cable - default GW - 2Mb/2Mb), WAN2 tier 1 (WiMAX pppoe 12Mb/3Mb) weigth 1 WAN1 : 4 WAN2
If WAN2 goes down all traffic switch on WAN1
When WAN2 return online (GatewayGrops all online) all connections still in WAN1.
If I reload filter everythinks turns all rigth WAN1 1 : WAN2 4 as weigth.
I don't use DNS Forwarder and fror monitor I use IPS dns (2 per connections).
Please help!!!!
Bye
Sandro -
Hi
in "miscellaneus config" under "Gateway Monitoring" there are:
Gateway Monitoring
State Killing on Gateway Failure
Flush all states when a gateway goes down The monitoring process will flush all states when a gateway goes down if this box is checked.Skip rules when gateway is down
Do not create rules when gateway is down By default, when a rule has a gateway specified and this gateway is down, the rule is created omitting the gateway. This option overrides that behavior by omitting the entire rule instead.Someone could explain it?
Thanks
Bye
Sandro - 19 days later
-
I think I have found a solution.
I have tested it on 2.3.2 release, it consists of 2 steps
- Take note of the name you assigned to your PPPoE connection (WAN2 in this example)
- Add the following lines at the end of "/usr/local/sbin/ppp-linkup" script (between "fi" and "exit 0" lines)
–-----------------------
fisleep 5
/etc/rc.newwanip wan2exit 0
In all my tests traffic switches back correctly.
Note: without the "sleep" instructions I was having mixed results, maybe is only a timing problem with pppoe activation?
Bye
- 21 days later
-
+1 that failback would be very valuable. I have a deployment where the Tier 2 connection is pay per GB so it would be nice to be able to automate failover AND failback but I have to keep that WAN disconnected to make sure no connections get stuck on it. It's not a PPPoE link so sadly I can't use an up/down script for this :(
We need a setting for "Flush all states when a lower tier gateway comes back up. The monitoring process will flush all states when a lower tier gateway comes up if this box is checked"
-
I'm working on a script to kill VOIP states when WAN1 (primary) comes back online. As mentioned elsewhere in this thread, this is a critical feature in real-world scenarios due to (a) costly metered backup connections as well as (b) SIP interop issues when devices behind the same LAN are seen registering from different public IPs. So I won't rehash all of that. I am trying to automate
pfctl
from the rc.gateway_alarm script that gets called on WANUP. I also see that a PR has been recently merged that might help make this even easier and less hacky. Has anyone hooked into these new functions yet to make this more reliable?TL;DR— pfctl is not killing all of the related states. Can someone help me to understand something regarding states?
• Assume vlan100 is dedicated for voice, with subnet 192.168.20.0/24
• WAN1=primary, WAN2=backup
• When a "fail back" WAN2–>WAN1 event happens, I need to kill all states: (any)->WAN2->vlan100 and vlan100->WAN2->(any)
• I try using a command like:pfctl -i igb0_vlan100 -k 0.0.0.0/0
But, this only seems to kill the states originating from inside the LAN. There are still tracked states via WAN2 that are NAT'ted to –> internal igb0_vlan100 IPs. Do I also need to run the commands like this instead?
pfctl -k 192.168.20.0/24 -k 0.0.0.0/0
pfctl -k 0.0.0.0/0 -k 192.168.20.0/24Or, some other command? Is there a better way…. ???
- 3 months later
-
Any news ? :(
-
Failback to default WAN works for me.
I have a Gigabit Fiber connection and a Cable modem connection. I put one of them as Tier1 and the other as Tier2.
I used 8.8.8.8 for one and 8.8.4.4 for the other.
But just following all the instructions in pfsense documentation and postings here in the forum that suggests with creating groups and different level of Tiers and etc. will not work unless you have the 'Default gateway switching' box checked. You can find it under System > Advanced > Miscellaneous
http://prntscr.com/evn3ub
I tested with disconnecting WAN1 and going to whatismyip.com and then plugging WAN1 back and going to a different what is my ip site. Don't go to the first one as it will be cached and will not show your original/default wan IP.
Or you can just do a ping.
Let me know if this helps. I can also post my configurations if you need to see.
KK
-
2.3.3-RELEASE-p1 (amd64), MultiWAN, VM on Hyper-V
WAN1 ( tier2, monitor ip 8.8.4.4 )
WAN2 ( tier1, monitor ip 8.8.8.8 ).Today WAN2 had alarm latecy but no clear latency occured despite the fact line becomes stable (accordingly to dashboard)
Usual (System logs->Gateways):
Apr 12 03:29:32 dpinger WAN2_DHCP 8.8.8.8: Clear latency 39052us stddev 2978us loss 5%
Apr 12 03:28:34 dpinger WAN2_DHCP 8.8.8.8: Alarm latency 34409us stddev 429us loss 22%Today (no clear latency event):
–-
Apr 13 13:19:23 dpinger WAN2_DHCP 8.8.8.8: Alarm latency 34494us stddev 342us loss 21%All clients from from LAN were using WAN1 until i manually simulated WAN2 disconnect (set 1.1.1.1 as monitor ip for a minute, then revert back to 8.8.8.8 )
- 12 days later
-
same problem here
failover is working tier1 to tier2, but when tier1 recovers, monitor says "online" but the traffic doesn't switch back to tier1 , remains on tier2
PFsense ver. 2.3.3-RELEASE-p1
- 9 days later
-
same problem here
failover is working tier1 to tier2, but when tier1 recovers, monitor says "online" but the traffic doesn't switch back to tier1 , remains on tier2
PFsense ver. 2.3.3-RELEASE-p1
I am having the exact same problem here.
2.3.3-RELEASE-p1 (amd64)
built on Thu Mar 09 07:17:41 CST 2017
FreeBSD 10.3-RELEASE-p17 - 10 days later
-
The fail back seems to work providing the PC's connection is left idle for 20 Seconds or so, but if theres an active connection after your primary connection goes down (voip, video/audio streaming or even a continuous ping), it seems to remain on the redundant connection.
The following script seems to work for my situation (4g modem failover with limited quota), it's nowhere near perfect but it'll shut the 4g interface down long enough for the states to be killed when the Primary WAN is up , would be better if it exited if there was no active states on 4G but meh..
(Using cron to run every 5 minutes or so, */5 * * * * root /bin/sh /root/routercheck.sh)
#!/bin/sh
check_wan1=8.8.8.8
check_wan2=8.8.4.4wan_ipaddress=
ifconfig rl0 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
backupwan_ipaddress=ifconfig rl1 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
ping -c 2 -S {backupwan_ipaddress} ${check_wan2} > /dev/null 2>&1
wan2_resp=$?backupwan_resp=
expr ${wan2_resp}
if [ ${backupwan_resp} -gt 0 ]; then
exit 1
fiping -c 2 -S ${wan_ipaddress} ${check_wan1} > /dev/null 2>&1
wan1_resp=$?wan_resp=
expr ${wan1_resp}
if [ ${wan_resp} -eq 0 ]; then
#service netif restart rl1
ifconfig rl1 down;sleep 15;ifconfig rl1 upfi
#end
-
The fail back seems to work providing the PC's connection is left idle for 20 Seconds or so, but if theres an active connection after your primary connection goes down (voip, video/audio streaming or even a continuous ping), it seems to remain on the redundant connection.
The following script seems to work for my situation (4g modem failover with limited quota), it's nowhere near perfect but it'll shut the 4g interface down long enough for the states to be killed when the Primary WAN is up , would be better if it exited if there was no active states on 4G but meh..
(Using cron to run every 5 minutes or so, */5 * * * * root /bin/sh /root/routercheck.sh)
#!/bin/sh
check_wan1=8.8.8.8
check_wan2=8.8.4.4wan_ipaddress=
ifconfig rl0 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
backupwan_ipaddress=ifconfig rl1 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
ping -c 2 -S {backupwan_ipaddress} ${check_wan2} > /dev/null 2>&1
wan2_resp=$?backupwan_resp=
expr ${wan2_resp}
if [ ${backupwan_resp} -gt 0 ]; then
exit 1
fiping -c 2 -S ${wan_ipaddress} ${check_wan1} > /dev/null 2>&1
wan1_resp=$?wan_resp=
expr ${wan1_resp}
if [ ${wan_resp} -eq 0 ]; then
#service netif restart rl1
ifconfig rl1 down;sleep 15;ifconfig rl1 upfi
#end
Thank you for this…
I am not a script writer, but it would appear I need to change rl0 and rl1 to my specific interfaces. Any other changes necessary?
Also, I have searched for a couple of hours and still cannot find what directory to install the script to, and what command to run at CLI to test. I see that the "Filer" pkg was the preferred way, but is no longer available on my version, 2.3.4.
- 8 days later
-
Yeah , it needs to be changed to the physical interface names, not the name assigned in pfsense. script location can be anywhere, i just saved mine under /root/failback.sh , you'll need to allow it to run after saving, chmod 775 scriptname.sh should do it, aslong as the path in your cron points to the script it can go anywhere,
Thinking it may be better to just leave the 4g interface down until the wan stops responding though, it may have a better outcome, but it still seems to do the job
-
i've changed it so the 4g is down until the primary wan stops working, this time cron is set every minute, the most time you should lose connection is maybe 70 or 80 seconds or so as it takes some time for the gateway to register as online again
#!/bin/sh
check_wan1=8.8.8.8
#check_wan2=8.8.4.4wan_ipaddress=
ifconfig rl0 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
#backupwan_ipaddress=ifconfig rl1 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
#ping -c 2 -S {backupwan_ipaddress} ${check_wan2} > /dev/null 2>&1
#wan2_resp=$?#backupwan_resp=
expr ${wan2_resp}
#if [ ${backupwan_resp} -eq 1 ]; then
# exit 1#fi
ping -c 2 -S ${wan_ipaddress} ${check_wan1} > /dev/null 2>&1
wan1_resp=$?wan_resp=
expr ${wan1_resp}
if [ ${wan_resp} -eq 0 ]; then
ifconfig rl1 down
fi
if [ ${wan_resp} -gt 0 ]; then
#service netif restart rl1
ifconfig rl1 upfi
#end
- 12 days later
-
Hello.
I have a similar problem with failover. I use one openvpn client as tier1 and the second openvpn client as tier2.
After tier1 is online, pfsense does not switch back from tier2 to tier1.Is the solution from kimkhan suitable for me? Any other solutions?
-
Dear pfSense Staff, this is a very important issue, we can find a solution? :-)
-
I have perhaps found my problem in pfSense 2.3.4.
My setup in System > Routing > Gateway Groups:
OpenVPN Client1 = tier1
OpenVPN Client2 = tier2I have not set 'Default gateway switching' or anything else.
In System > Routing > Gateways I have set:
Monitor IP of OpenVPN Client1 = 8.8.8.8
Monitor IP of OpenVPN Client2 = 8.8.4.4If I disable OpenVPN Client1, then pfsense switches to OpenVPN Client2 correctly.
But only after I have activated 'Apply Settings' in "System > Routing > Gateways"If I activate OpenVPN Client1, then pfsense switches back to OpenVPN Client1.
But only after I have activated 'Apply Settings' in "System > Routing > Gateways" again.Does somebody has any idea, which settings should I make?
- 9 months later
-
I have the same issue with my failover WAN. If WAN1 goes down=Offline switching to WAN2 works correctly but it don't switch back to WAN1 if its available again. I have to deactivate and activate WAN1 manually.
Anyone have a solution?
I've set multi WAN as the screenshots below:
13:37 WAN1 goes offline
14:00 WAN1 is available again but don't switch back to WAN2System Logs:
Feb 27 13:37:48 rc.gateway_alarm 91471 >>> Gateway alarm: WAN1GW (Addr:8.8.4.4 Alarm:1 RTT:18713ms RTTsd:4408ms Loss:22%)
Feb 27 13:37:48 check_reload_status updating dyndns WAN1GW
Feb 27 13:37:48 check_reload_status Restarting ipsec tunnels
Feb 27 13:37:48 check_reload_status Restarting OpenVPN tunnels/interfaces
Feb 27 13:37:48 check_reload_status Reloading filter
Feb 27 13:37:50 php-fpm 5341 /rc.dyndns.update: Default gateway down setting WAN2_PPPOE as default!
Feb 27 13:37:50 php-fpm 5341 /rc.dyndns.update: MONITOR: WAN1GW is down, omitting from routing group DualWAN 8.8.4.4|192.168.100.2|WAN1GW|18.747ms|4.479ms|25%|down
Feb 27 13:37:50 php-fpm 5341 /rc.dyndns.update: Default gateway down setting WAN2_PPPOE as default!
Feb 27 13:37:50 php-fpm 65471 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN1GW.
Feb 27 13:37:50 php-fpm 65471 /rc.openvpn: Default gateway down setting WAN2_PPPOE as default!
Feb 27 13:37:50 php-fpm 65471 /rc.filter_configure_sync: Default gateway down setting WAN2_PPPOE as default!
Feb 27 13:40:20 php-fpm 11774 /services_dyndns.php: Default gateway down setting WAN2_PPPOE as default!
Feb 27 13:40:24 check_reload_status Syncing firewall
Feb 27 13:40:24 php-fpm 11697 /services_dyndns.php: Default gateway down setting WAN2_PPPOE as default!
Feb 27 14:11:47 check_reload_status Syncing firewall
Feb 27 14:11:47 check_reload_status Reloading filter
Feb 27 14:11:48 php-fpm 23963 /rc.filter_configure_sync: Default gateway down setting WAN2_PPPOE as default!
Feb 27 14:20:59 check_reload_status Syncing firewall
Feb 27 14:20:59 check_reload_status Reloading filter
Feb 27 14:21:00 php-fpm 34668 /rc.filter_configure_sync: Default gateway down setting WAN2_PPPOE as default!
Feb 27 14:24:10 check_reload_status Syncing firewall
Feb 27 14:24:10 check_reload_status Reloading filter
Feb 27 14:24:11 php-fpm 5833 /rc.filter_configure_sync: Default gateway down setting WAN2_PPPOE as default!Thanks
-
I am having a similar issue. However, I am unable to get it to work simply failing over from WAN1 to WAN2. The logs show that when WAN1 goes down the default it appears things switch. Whoever, traffic does not flow and the pfSense UI hangs (actually becomes very slow). Then Bringing WAN1 back on-line does to resume traffic flow. The quickest way to get things going again is to restart the box. I have configured the dual wan configuration in the simplest way similar to your. I have also tried the suggested configurations that does not use the automatic gateway switching building gateway groups in accordance with the configuration suggestions. I have tried using different hardware and rebuilding pfSense from scratch. The frustrating part for me is I can take a commercial firewall that supports multi-wan and configure things in a similar fashion and it works perfect every time. I apologize for not having a proven solution. You getting it to work this far is great. The only thing different is that many of the multi-wan configuration recommendations is that You have an additional gateway group that handles flipping connections back the other way.
This is one of many configuration examples out there https://www.cyberciti.biz/faq/howto-configure-dual-wan-load-balance-failover-pfsense-router/. I found this one to be helpful as most are. I think in my case it just my limited experience or perhaps I have a glaring simple issue preventing mine from working that I am just missing.
- 4 months later
-
@markn455 Did you ever find a working configuration for failing back to a primary connection once it comes back up?
- 28 days later
-
Goal: Have auto fail-over to 2nd ISP when 1st ISP is down. When 1st ISP comes back, re-enable as primary in routes.
My ISP setup all use static ip config, have not tested with a dynamic interface.Here is what works for me, various sites, multiple ISP fail-overs.
Step 1:
Navigate to: System - > Advanced -> Miscellaneous
Make sure "Default gateway switching" is UNCHECKED.Step 2:
Configure your gateway group accordingly. Tier1 is highest priority.
I use Member Down as trigger.Step 3:
Choose the gateway group in your firewall rules setup.
You find this under Advanced Options for each rule you want to make use of the gateway group.Simple test if working, plug out Tier1 cable from firewall. Should fail-over to Tier2.
Plug back in Tier1 cable, should become the default route almost instantly. -
Just for clarification:
-
Default gateway switching is unchecked.
-
A single gateway group with Tier 1 gateway being highest priority, and Tier 2 being lower priority, and member down is the trigger.
-
Firewall rules use that gateway group.
And that works for failover? If you pull the cord for gateway 1 it switches to gateway 2? And if you reconnect gateway 1 it switches back?
-
-
@satadru
Yes that is correct. The switch back and forth between tiers is fully automatic. - about a year later
-
Although this is an older thread, I have the same issue happening with the very latest version, as of September, 2019. I have three WAN connections, and one of the gateways I have configured has two of the gateways on it. My PFSense will failover to my Tier 2 connection automatically; but when it comes back up, it will not go back to the Tier 1. I even tried clearing the states - no change. I tried changing the gateway set as Tier 2, and it just routed all the traffic thru that gateway, instead of the Tier 1. All gateways are up, and show as up.
What more can I do to debug this? I did not find the "Default Gateway Switching" option where indicated. Indeed, my "default" gateway is the Tier 1 gateway that seems not to be being used by the Gateway group.
My config is a bit complex, but I'm happy to try to debug this. Just need direction. Thanks.
Bob
-
I ended up writing a script and running it via cron to achieve the "switch." Yes, it is not elegant, but it gets the job done.
Here's what I have and I run this as a 5-minute cron job.
#!/bin/sh # get active gateway and current time CURRENT_TIME="$(date +"%c")" CURRENT_GW="$(netstat -rn | grep default | awk '{print $4}')" if [ $CURRENT_GW = "em2" ]; then #check if WAN1 is up or not WAN1_STATUS="$(pfSsh.php playback gatewaystatus brief | grep WANGW | awk '{print $2}')" if [ $WAN1_STATUS = "none" ]; then #WAN1 is back online, stop/start WAN2 echo "$CURRENT_TIME: Bringing down WAN2" ifconfig em2 down echo "$CURRENT_TIME: Sleeping for 30s" sleep 30 echo "$CURRENT_TIME: Bringing up WAN2" ifconfig em2 up else echo "$CURRENT_TIME: WAN1 is still down" fi else echo "$CURRENT_TIME: Nothing to do!" fi
- 25 days later
-
Hey. Thanks @ibbetsion for the script.
Here is a slightly modified version that kills firewall states when there are connections remaining on WAN2 and WAN1 is back online.
Works great for my needs ( LTE failover ).
I set it as a cron, every minute:
*/1 * * * * /root/clear_state_back_from_failover_cron.sh >> /root/clear_state_back_from_failover_cron.log
- I also checked "Flush all states when a gateway goes down" in System / Advanced / Miscellaneous.
- The LTE gateway has monitoring disabled "Disable Gateway Monitoring" in System / Routing / Gateways. Otherwise states will be created on the interface and the script becomes wrong. Also, monitoring would consume data and I did not want that.
Code:
#!/bin/sh # *** kills firewall states on failover WAN when WAN1 is up *** WAN1_NAME="WAN_DHCP" WAN2_IF=ue0 WAN2_GW_IP=192.168.3.1 CURRENT_TIME="$(date +"%c")" WAN1_STATUS=`pfSsh.php playback gatewaystatus brief | grep "$WAN1_NAME" | awk '{print $2}'` if [ "$WAN1_STATUS" = "none" ]; then # the following line may need to be tweaked depending on your needs WAN2_NSTATES=`pfctl -s state | grep "$WAN2_IF" | grep -v " -> $WAN2_GW_IP" | wc -l` if [ "$WAN2_NSTATES" -gt 0 ]; then echo "$CURRENT_TIME: WAN1 is online, but connections remain on $WAN2_IF. Killing states." pfctl -F state fi else echo "$CURRENT_TIME: WAN1 is down" fi
- 2 months later
-
I'm really surprised pfSense has nothing built in to handle this yet. This has been ongoing since 2017. In my case, my LTE modem (unlimited data) is still in gateway monitoring mode, so I'll be using @ibbetsion script. Thanks @ibbetsion