Multi-WAN gateway failover not switching back to tier 1 gw after back online

jmonline

Ok lets simplify it even more…

Take the traceroute facility in Diagnostics > Traceroute

My primary wan is back online.
I enter a hostname to trace (Google - 8.8.8.8)
I pick the Source Address as 30VOICELAN
I get the following result:
-1 135.196.xxx.xxx 7.671 ms 6.869 ms 7.008 ms
-2 135.196.xxx.xxx 7.016 ms 7.195 ms 7.164 ms
-3 5.57.80.136 7.218 ms 7.199 ms 11.125 ms
-4 216.239.54.243 7.922 ms
-5 216.239.58.95 9.010 ms
-6 8.8.8.8 8.139 ms 8.010 ms 8.626 ms
The first line on the traceroute with the IP starting 135.196 is my backup internet connection. Not my primary.

How is that possible?

The firewall rule on the 30VOICELAN has the Gateway set as the Gateway Group named "DSLFirst".
The Gateway group "DSLFirst" has the (Primary) DSL WAN connection as Tier1 and the (Backup) EFM WAN connection as Tier2.
Status > Gateways shows both gateways online.

Derelict

Show me the states, bro. pfctl -vss

jmonline

@Derelict:

Show me the states, bro. pfctl -vss

Ok so perfect time for a test :) Last night looks like BT did their usual maintenance on the DSL network around 1am so the ADSL line was down for 5mins. This morning I have the following in the states table for the phone on IP 10.10.30.27.

30VOICELAN tcp 185.83.xxx.xxx:5060 <- 10.10.30.27:55778 ESTABLISHED:ESTABLISHED 8.933 K / 10.417 K 3.13 MiB / 3.43 MiB
WAN_EFM tcp 185.3.xxx.xxx:40781 (10.10.30.27:55778) -> 185.83.xxx.xxx:5060 ESTABLISHED:ESTABLISHED 8.933 K / 10.417 K 3.13 MiB / 3.43 MiB

To clarify:
185.83.xxx.xxx is the external VoIP pbx.
185.3.xxx.xxx is the IP of the WAN_EFM (backup) connection.
30VOICELAN is my internal network with a subnet of 10.10.30.0/24

pfctl -vss shows the following:

igb1_vlan30 tcp 185.83.xxx.xxx:5060 <- 10.10.30.27:55778 ESTABLISHED:ESTABLISHED
[1594456643 + 42272] wscale 8 [1007765254 + 183296] wscale 5
age 05:58:04, expires in 119:59:52, 8954:10441 pkts, 3290011:3604569 bytes, rule 119

igb2 tcp 185.3.xxx.xxx:40781 (10.10.30.27:55778) -> 185.83.xxx.xxx:5060 ESTABLISHED:ESTABLISHED
[1007765254 + 183296] wscale 5 [1594456643 + 42272] wscale 8
age 05:58:04, expires in 119:59:52, 8954:10441 pkts, 3290011:3604569 bytes, rule 96

Our of interest, what should be the correct pfctl command to run to force killing these states (so all states on the WAN_EFM connection from the subnet 10.10.30.0/24)?

If I can get a command to successfully kill these states when they get stuck here, I am happy for that as a work around until someone can work out how to automate it. I don't want to be Resetting the whole state table every time since that kills sessions which should be legitimately open.

Thanks

Derelict

You probably want to kill all connections to the PBX. That would be:

pfctl -k 0.0.0.0/0 -k 185.83.xxx.xxx

That will kill everything even phones that are connected out the Tier 1.

You can try just killing one side of the connection that is tied to WAN_EFM with:

pfctl -i igb2 -k 0.0.0.0/0 -k 185.83.xxx.xxx

If, when the phones reconnect, they use the Tier1 connection, great. In my testing they continued to use the other connection so it doesn't look like you can do that.

jmonline

Ok so the following command cleared the sessions stuck on the failover WAN.

pfctl -i igb2 -k 0.0.0.0/0 -k 185.83.xxx.xxx

This is a good step forward since I can now manually force the sessions back when I know they haven't moved on their own.

I presume I may be able to scheduled this via some sort of script to run a specified period of time after the primary connection comes back online…...?

Thanks for help so far Derelict :)

devmaybe

Hello,

I know this is an old thread but I have the same problem and now I am able to reliably reproduce the behavior in a test environment.

If the "primary" WAN is a PPPoE connection and the secondary WAN is a "standard" static or DHCP assigned IP address connection when the primary goes down failover to the secondary work as expected but when the primary comes back up no traffic will flow through it.

In such cases on my production systems I usually edit my default gateway entry in System->Routing.
I uncheck the "Default Gateway" mark, re-check it and then save and apply.
Traffic starts flowing again through the PPPoE connection.

The same always works in my virtual machines test environment too.

I hope this can help in tracking down the source of the problem or at least in finding some solution.

Thanks

sandrino

Hi all,

same problem WAN1 tier 1 (cable - default GW - 2Mb/2Mb), WAN2 tier 1 (WiMAX pppoe 12Mb/3Mb) weigth 1 WAN1 : 4 WAN2

If WAN2 goes down all traffic switch on WAN1

When WAN2 return online (GatewayGrops all online) all connections still in WAN1.

If I reload filter everythinks turns all rigth WAN1 1 : WAN2 4 as weigth.

I don't use DNS Forwarder and fror monitor I use IPS dns (2 per connections).

Please help!!!!

Bye
Sandro

sandrino

Hi

in "miscellaneus config" under "Gateway Monitoring" there are:

Gateway Monitoring
State Killing on Gateway Failure
Flush all states when a gateway goes down The monitoring process will flush all states when a gateway goes down if this box is checked.

Skip rules when gateway is down
Do not create rules when gateway is down By default, when a rule has a gateway specified and this gateway is down, the rule is created omitting the gateway. This option overrides that behavior by omitting the entire rule instead.

Someone could explain it?

Thanks
Bye
Sandro

devmaybe

I think I have found a solution.

I have tested it on 2.3.2 release, it consists of 2 steps

Take note of the name you assigned to your PPPoE connection (WAN2 in this example)
Add the following lines at the end of "/usr/local/sbin/ppp-linkup" script (between "fi" and "exit 0" lines)

–-----------------------
fi

sleep 5
/etc/rc.newwanip wan2

exit 0

In all my tests traffic switches back correctly.

Note: without the "sleep" instructions I was having mixed results, maybe is only a timing problem with pppoe activation?

Bye

SecureIS

+1 that failback would be very valuable. I have a deployment where the Tier 2 connection is pay per GB so it would be nice to be able to automate failover AND failback but I have to keep that WAN disconnected to make sure no connections get stuck on it. It's not a PPPoE link so sadly I can't use an up/down script for this :(

We need a setting for "Flush all states when a lower tier gateway comes back up. The monitoring process will flush all states when a lower tier gateway comes up if this box is checked"

luckman212

I'm working on a script to kill VOIP states when WAN1 (primary) comes back online. As mentioned elsewhere in this thread, this is a critical feature in real-world scenarios due to (a) costly metered backup connections as well as (b) SIP interop issues when devices behind the same LAN are seen registering from different public IPs. So I won't rehash all of that. I am trying to automate pfctl from the rc.gateway_alarm script that gets called on WANUP. I also see that a PR has been recently merged that might help make this even easier and less hacky. Has anyone hooked into these new functions yet to make this more reliable?

TL;DR— pfctl is not killing all of the related states. Can someone help me to understand something regarding states?

• Assume vlan100 is dedicated for voice, with subnet 192.168.20.0/24
• WAN1=primary, WAN2=backup
• When a "fail back" WAN2–>WAN1 event happens, I need to kill all states: (any)->WAN2->vlan100 and vlan100->WAN2->(any)
• I try using a command like:

pfctl -i igb0_vlan100 -k 0.0.0.0/0

But, this only seems to kill the states originating from inside the LAN. There are still tracked states via WAN2 that are NAT'ted to –> internal igb0_vlan100 IPs. Do I also need to run the commands like this instead?

pfctl -k 192.168.20.0/24 -k 0.0.0.0/0
pfctl -k 0.0.0.0/0 -k 192.168.20.0/24

Or, some other command? Is there a better way…. ???

nemanager

Any news ? :(

kimkhan

Failback to default WAN works for me.

I have a Gigabit Fiber connection and a Cable modem connection. I put one of them as Tier1 and the other as Tier2.

I used 8.8.8.8 for one and 8.8.4.4 for the other.

But just following all the instructions in pfsense documentation and postings here in the forum that suggests with creating groups and different level of Tiers and etc. will not work unless you have the 'Default gateway switching' box checked. You can find it under System > Advanced > Miscellaneous

http://prntscr.com/evn3ub

I tested with disconnecting WAN1 and going to whatismyip.com and then plugging WAN1 back and going to a different what is my ip site. Don't go to the first one as it will be cached and will not show your original/default wan IP.

Or you can just do a ping.

Let me know if this helps. I can also post my configurations if you need to see.

KK

red_cat1930

2.3.3-RELEASE-p1 (amd64), MultiWAN, VM on Hyper-V

WAN1 ( tier2, monitor ip 8.8.4.4 )
WAN2 ( tier1, monitor ip 8.8.8.8 ).

Today WAN2 had alarm latecy but no clear latency occured despite the fact line becomes stable (accordingly to dashboard)

Usual (System logs->Gateways):
Apr 12 03:29:32 dpinger WAN2_DHCP 8.8.8.8: Clear latency 39052us stddev 2978us loss 5%
Apr 12 03:28:34 dpinger WAN2_DHCP 8.8.8.8: Alarm latency 34409us stddev 429us loss 22%

Today (no clear latency event):
–-
Apr 13 13:19:23 dpinger WAN2_DHCP 8.8.8.8: Alarm latency 34494us stddev 342us loss 21%

All clients from from LAN were using WAN1 until i manually simulated WAN2 disconnect (set 1.1.1.1 as monitor ip for a minute, then revert back to 8.8.8.8 )

carmico

same problem here

failover is working tier1 to tier2, but when tier1 recovers, monitor says "online" but the traffic doesn't switch back to tier1 , remains on tier2

PFsense ver. 2.3.3-RELEASE-p1

ronnysa

@carmico:

same problem here

failover is working tier1 to tier2, but when tier1 recovers, monitor says "online" but the traffic doesn't switch back to tier1 , remains on tier2

PFsense ver. 2.3.3-RELEASE-p1

I am having the exact same problem here.

2.3.3-RELEASE-p1 (amd64)
built on Thu Mar 09 07:17:41 CST 2017
FreeBSD 10.3-RELEASE-p17

jono_white

The fail back seems to work providing the PC's connection is left idle for 20 Seconds or so, but if theres an active connection after your primary connection goes down (voip, video/audio streaming or even a continuous ping), it seems to remain on the redundant connection.

The following script seems to work for my situation (4g modem failover with limited quota), it's nowhere near perfect but it'll shut the 4g interface down long enough for the states to be killed when the Primary WAN is up , would be better if it exited if there was no active states on 4G but meh..

(Using cron to run every 5 minutes or so, */5 * * * * root /bin/sh /root/routercheck.sh)

#!/bin/sh

check_wan1=8.8.8.8
check_wan2=8.8.4.4

wan_ipaddress=ifconfig rl0 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
backupwan_ipaddress=ifconfig rl1 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1

ping -c 2 -S {backupwan_ipaddress} ${check_wan2} > /dev/null 2>&1
wan2_resp=$?

backupwan_resp=expr ${wan2_resp}

if [ ${backupwan_resp} -gt 0 ]; then
exit 1
fi

ping -c 2 -S ${wan_ipaddress} ${check_wan1} > /dev/null 2>&1
wan1_resp=$?

wan_resp=expr ${wan1_resp}

if [ ${wan_resp} -eq 0 ]; then

#service netif restart rl1
ifconfig rl1 down;sleep 15;ifconfig rl1 up

fi

#end

eng1tx

@jono_white:

The fail back seems to work providing the PC's connection is left idle for 20 Seconds or so, but if theres an active connection after your primary connection goes down (voip, video/audio streaming or even a continuous ping), it seems to remain on the redundant connection.

The following script seems to work for my situation (4g modem failover with limited quota), it's nowhere near perfect but it'll shut the 4g interface down long enough for the states to be killed when the Primary WAN is up , would be better if it exited if there was no active states on 4G but meh..

(Using cron to run every 5 minutes or so, */5 * * * * root /bin/sh /root/routercheck.sh)

#!/bin/sh

check_wan1=8.8.8.8
check_wan2=8.8.4.4

wan_ipaddress=ifconfig rl0 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
backupwan_ipaddress=ifconfig rl1 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1

ping -c 2 -S {backupwan_ipaddress} ${check_wan2} > /dev/null 2>&1
wan2_resp=$?

backupwan_resp=expr ${wan2_resp}

if [ ${backupwan_resp} -gt 0 ]; then
exit 1
fi

ping -c 2 -S ${wan_ipaddress} ${check_wan1} > /dev/null 2>&1
wan1_resp=$?

wan_resp=expr ${wan1_resp}

if [ ${wan_resp} -eq 0 ]; then

#service netif restart rl1
ifconfig rl1 down;sleep 15;ifconfig rl1 up

fi

#end

Thank you for this…

I am not a script writer, but it would appear I need to change rl0 and rl1 to my specific interfaces. Any other changes necessary?

Also, I have searched for a couple of hours and still cannot find what directory to install the script to, and what command to run at CLI to test. I see that the "Filer" pkg was the preferred way, but is no longer available on my version, 2.3.4.

jono_white

Yeah , it needs to be changed to the physical interface names, not the name assigned in pfsense. script location can be anywhere, i just saved mine under /root/failback.sh , you'll need to allow it to run after saving, chmod 775 scriptname.sh should do it, aslong as the path in your cron points to the script it can go anywhere,

Thinking it may be better to just leave the 4g interface down until the wan stops responding though, it may have a better outcome, but it still seems to do the job

jono_white

i've changed it so the 4g is down until the primary wan stops working, this time cron is set every minute, the most time you should lose connection is maybe 70 or 80 seconds or so as it takes some time for the gateway to register as online again

#!/bin/sh

check_wan1=8.8.8.8
#check_wan2=8.8.4.4

wan_ipaddress=ifconfig rl0 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1
#backupwan_ipaddress=ifconfig rl1 | grep 'inet ' | awk '{ print $2}' | cut -d'/' -f1

#ping -c 2 -S {backupwan_ipaddress} ${check_wan2} > /dev/null 2>&1
#wan2_resp=$?

#backupwan_resp=expr ${wan2_resp}

#if [ ${backupwan_resp} -eq 1 ]; then
# exit 1

#fi

ping -c 2 -S ${wan_ipaddress} ${check_wan1} > /dev/null 2>&1
wan1_resp=$?

wan_resp=expr ${wan1_resp}

if [ ${wan_resp} -eq 0 ]; then

ifconfig rl1 down

fi

if [ ${wan_resp} -gt 0 ]; then

#service netif restart rl1
ifconfig rl1 up

fi

#end