Multi-WAN gateway failover not switching back to tier 1 gw after back online
-
Hey. Thanks @ibbetsion for the script.
Here is a slightly modified version that kills firewall states when there are connections remaining on WAN2 and WAN1 is back online.
Works great for my needs ( LTE failover ).
I set it as a cron, every minute:
*/1 * * * * /root/clear_state_back_from_failover_cron.sh >> /root/clear_state_back_from_failover_cron.log
- I also checked "Flush all states when a gateway goes down" in System / Advanced / Miscellaneous.
- The LTE gateway has monitoring disabled "Disable Gateway Monitoring" in System / Routing / Gateways. Otherwise states will be created on the interface and the script becomes wrong. Also, monitoring would consume data and I did not want that.
Code:
#!/bin/sh # *** kills firewall states on failover WAN when WAN1 is up *** WAN1_NAME="WAN_DHCP" WAN2_IF=ue0 WAN2_GW_IP=192.168.3.1 CURRENT_TIME="$(date +"%c")" WAN1_STATUS=`pfSsh.php playback gatewaystatus brief | grep "$WAN1_NAME" | awk '{print $2}'` if [ "$WAN1_STATUS" = "none" ]; then # the following line may need to be tweaked depending on your needs WAN2_NSTATES=`pfctl -s state | grep "$WAN2_IF" | grep -v " -> $WAN2_GW_IP" | wc -l` if [ "$WAN2_NSTATES" -gt 0 ]; then echo "$CURRENT_TIME: WAN1 is online, but connections remain on $WAN2_IF. Killing states." pfctl -F state fi else echo "$CURRENT_TIME: WAN1 is down" fi
- 2 months later
-
I'm really surprised pfSense has nothing built in to handle this yet. This has been ongoing since 2017. In my case, my LTE modem (unlimited data) is still in gateway monitoring mode, so I'll be using @ibbetsion script. Thanks @ibbetsion
- 6 months later
-
EDIT2
Issues not fixed. If i pull cable and put it right back in it will mess up Multi-WAN. Will not switch back correctly.EDIT:
Resetting to defaults and setting everything up again seems to have fixed my issues.Old:
I think i found the cause:Seems Multi WAN is not working properly (or maybe dpinger) if the Interface goes down and back up (unplugging and re plugging).
In my tests i was just unplugging the cable on the WAN-Port.I think the same happens with PPPOE or anytime the link is down and up again (physically).
This should not happen In my opinion. If modems reboot and so on: MultiWan would stop working.
I use "Paket Loss" as trigger level on Gateway-Group.
Would love to hear from you, thanks.
-
Hello!
I have several sites using multi wan and gateway groups with a mixture of static, dhcp, and pppoe. They all behave as expected.
Are you policy routing all of your WAN bound traffic?
"Defining gateway groups is only part of the story. Traffic must be assigned to these gateways using the Gateway setting on firewall rules."
https://docs.netgate.com/pfsense/en/latest/routing/multi-wan.html#firewall-rules
My experience is that you cant depend on the system routing table having your "preferred" (tier1) default route.
John
-
@serbus said in Multi-WAN gateway failover not switching back to tier 1 gw after back online:
Hello!
I have several sites using multi wan and gateway groups with a mixture of static, dhcp, and pppoe. They all behave as expected.
Are you policy routing all of your WAN bound traffic?
"Defining gateway groups is only part of the story. Traffic must be assigned to these gateways using the Gateway setting on firewall rules."
https://docs.netgate.com/pfsense/en/latest/routing/multi-wan.html#firewall-rules
My experience is that you cant depend on the system routing table having your "preferred" (tier1) default route.
John
Thank you,
strange is: if i don't unplug a cable on testing and switch off internet without pulling the cable, everything works just as expected. Every time. Soon as i unplug and replug i have to save interface settings for example to get switiching back to default Tier back working.
I checked almost every configuration before and nothing really helped.
-
Hello!
Ahhhh, gotcha.
I am having a problems following the thread. It is long and old, and seems to cover different (resolved?) problems. Yours could be yet another issue. Maybe a new thread?
John
-
As far as I know, this is still problematic! Some PFSense boxes I have on dual-wan setups will switch from tier 1 to tier 2 connections without issue, but going back when the tier 1 is restored does not always work... At least not in the timeframe I would consider usable.
Bob
-
@nleaudio said in Multi-WAN gateway failover not switching back to tier 1 gw after back online:
As far as I know, this is still problematic! Some PFSense boxes I have on dual-wan setups will switch from tier 1 to tier 2 connections without issue, but going back when the tier 1 is restored does not always work... At least not in the timeframe I would consider usable.
Bob
i think i had those issues because i imported a configuration to different hardware. Did you do the same?
After i did a reset to defaults and set up everything again it is now switching back fine. -
@mo10 to be clear... in your dual-WAN setup, if WAN1 (default gateway) goes down and pfsense ends up making WAN2 the default, then upon recovery of WAN1, pfsense automatically marks WAN1 as default?
-
@ibbetsion
This was never a problem for me. It maked the gateway as default fine but still was sending traffic the wrong way. Saving an interface fixed it until i unplugged (physically) a cable again.
Now after resetting everything everything runs as expected.Do you have problems with Multi-Wan? What exactly?
-
@mo10 my problem is that post recovery, WAN1 never goes back to being default. I have to use a script to bring down WAN2 so that WAN1 becomes default again. Not an ideal solution but it works.
-
@ibbetsion
Would you be able to make a test an reset you pfsense (save configuration first) and just setup the multi-wan an try again?
-
Hello!
Assuming a pretty standard multiwan: WAN1 -> tier1, WAN2 -> tier2, PREFWAN1/PREFWAN2/BALANCE gwgroups.
Whether you have states left open on WAN2 after WAN1 comes back up (sticky connections?) , or the default in the system routing table doesnt switch back to WAN1 after it recovers (make sure you dont have BALANCE as the default gateway), I believe the best approach is to policy route everything.
After WAN1 comes back, does traffic routed to a PREFWAN1 gwgroup still go out WAN2?
John
-
Yes, it's quite possible that I moved the config to a new box.
And yes, when WAN1 comes back, new connections still go out WAN2.
It does appear to recover some time later though - maybe by the following day? I've not looked into it carefully.
Bob
-
Are you using DHCP on the WANs or what are you using?
-
I have what looks like the same problem. Gateway group with three gateways. When the tier1 goes down (packet loss) tier2 is used. When tier1 comes back, it does not get used and requires manual reconfigure or reboot. No changes I'm aware of to trigger this behaviour. No hints in the logs.
-
@idiotzoo
Are you using DHCP on the WANs or what are you using on each WAN?
Please don't use DHCP, use static instead an report back. Set your main WAN as upstream Gateway. -
@mo10 The tier2 link us using PPPoE, correct me if I'm wrong but I can't use PPPoE with static IPv4 config.
I'm not sure what you mean by "Set your main WAN as upstream Gateway".
The main WAN link is static. This is a WISP link with a local NAT gateway connected via a vlan, so the physical link never goes down from PFsense point of view. The gateway (a ubiquti radio) also is no use as indicator of the connection health so I have to ping to something and use packet loss to determine the link's state.
This was working. The only change is the tier3 link appears to have failed entirely, so this is sitting in a pending state. I'm wondering if this is causing the gateway group to behave incorrectly. Next time the issue occurs I'll remove it and see what happens.
-
This sounds like a setup Error (pending). Do as you say, delete tier 3 from group an delete tier 3 interface. Then add everything again.
I was asking about DHCP because this was the reason I had problems. I heared dual pppoe can cause problems as well but I am not sure.
-
@mo10 Someone on site has verified the tier3 connection is borked at layer1 so removing that isn't going to hurt anything. Certainly having a dead link on a lower priority (higher tier) shouldn't cause any issues with the gateway group behaviour and if it does, this is a bug.... but it would be nice to know why a functioning system has stopped working. As this line failure is the only change I'm hopeful that at least explains the issue.