Reset States on Recovery of Tier 1 WAN in Gateway Group

Ximulate

In my slow journey in trying to get a working Failover Gateway group, I just experienced the issue mentioned here:
https://redmine.pfsense.org/issues/8#note-50

I have a Gateway Group setup for failover using one Tier 1 WAN and 1 Tier 2 WAN. On Tier 1 WAN fail, the Groups switches to Tier 2 just fine. However, after the primary (Tier 1) WAN recovers, VoIP/SIP devices maintain thier connection through the backup (Tier 2) WAN. The Tier 2 WAN is a metered connection, so maintaining the connection eats up a lot of unnecessary bandwidth. Manually Resetting States brings the SIP connections back to the primary WAN.

Since this issue has been brought up before, has it been resolved and if so I'm I just missing a setting somewhere? If not, is there a "hack" that can automate the process of killing the states when the Tier 1 WAN is up?

Ximulate

Posting this as a possible solution:
https://github.com/mk-fg/pfsense-scripts

I'm not particularly proficient at bash, so I haven't studied this enough to determine if it works for my situation.

Ximulate

Based on comment in this thread:
https://forum.netgate.com/topic/33246/kill-state/2

I've added the following code:

d=192.168.XXX; for i in YY1 YY2 YY3 YY4; do pfctl -k $d.$i; done

as a CRON job that executes twice a day, where YY1 thru YY4 are the VoIP devices on the 192.168.XXX network. Not particularly elegant, but simple.

dragoangel

@Ximulate for more elegant way to do this i think you can create cron job each 1mins that check your IP via curl and if IP NOT "tier1" then write temp file. Run regex is good and exit script if output of curl not valid IP (like internet loss etc). The second part of script check if this temp file exist on filesystem and your IP IS "tier1" it reset states and removes temp file. By such script you will not reset states without needed to and your states will back to tier1 more quickly - in 2 mins for example.

dragoangel

You can use /tmp/xxx_defaultgw too and not use curl (but then check wangw, not wan ips). Anyway temp file needed to capture "changes in routing" for "state2" of script: states reset only if routing was changed

renat_kaa

@Ximulate did you specify dns server for each of multiwan members? It is very important item for correct switching. https://docs.netgate.com/pfsense/en/latest/routing/multi-wan.html#dns-considerations

dragoangel

@Renat he haven't problem on pfsense itself, what dns in his case will potentially fix? Did you see somewhere issues with domain resolving?
P.s. your recommendation is wrong most logical cases: anybody can use any dns and no matter this system multiwan or not, main goal to use dns accessible for both WANs or specific for each one. Newer like use isp domain resolvers, they are mostly not so good as public one: 1.1.1.1 (cloudflare), opendns (cisco), quad9 etc.

renat_kaa

@dragoangel, I got it, thanks! Anyway, this worked for me and a number of pfsense users. And I didn't say about ISP dns. Public dns is good point.

Ximulate

This post is deleted!

Ximulate

@dragoangel Your suggestions look good, thank you. I'm barely proficient at bash scripting, so I'll make small improvements as necessary or time permits.

Ximulate

@Renat Yes, I did. Thank you for the suggestion anyway.

venix91

@Ximulate the link you provided ( https://github.com/mk-fg/pfsense-scripts ) worked great for my needs. I have a Netgeate SG3100. I have a cable modem connection as my primary WAN, it's flaky in the evening so i got a Netgear LB1121 LTE modem and a Ting GSM Sim to use for failover. It works great except when the gateway group fails back over to the cable modem gateway many states are left alive on the metered LTE connection. I was able to bandaid this by manually killing the states however i wanted this to be automatic. I struggled to find a way to automatically kill the states on primary gateway fallback and until i came across your post i thought it was hopeless. The gateway_change_conn_reset.sh script from that github page did it for me. Now i have a script that works perfectly across reboots and everything.

Thanks.

Ximulate

@venix91 Great to hear! I have basically the same failover set-up: Netgear with Ting. I haven't gone any further than the CRON job I posted above, so I'm glad to know the script will work when I'm ready to move on and glad to know this post help out others.

Ximulate

Here's my latest script. It runs as a cron job every hour.
// Checks the WAN IP as reprted by an external service (opendns)
// Grabs IP of the WANs from the primary and backup gateways
// Compares reported IP to primary WAN IP, and if the same it kills the states on the backup ip

Code is executing, but need time to see if it actually behaves as expected (kills the backup wan states).

reported_ip="$(drill myip.opendns.com @resolver1.opendns.com | grep 'myip.opendns.com.')";
reported_ip="$(echo "$reported_ip" | grep -w -E -o "([0-9]{1,3}[.]){3}[0-9]{1,3}")";
primary_ip="$(ifconfig igb0 | grep -w -E -o "inet ([0-9]{1,3}[.]){3}[0-9]{1,3}")";
primary_ip="$(echo "$primary_ip" | grep -w -E -o "([0-9]{1,3}[.]){3}[0-9]{1,3}")";
backup_ip="$(ifconfig igb2 | grep -w -E -o "inet ([0-9]{1,3}[.]){3}[0-9]{1,3}")";
backup_ip="$(echo "$backup_ip" | grep -w -E -o "([0-9]{1,3}[.]){3}[0-9]{1,3}";)";
if [ "$reported_ip" = "$primary_ip" ]; then
pfctl -k $backup_ip;
fi

Edit: replaced DIG command with Drill commands to correct issue that occurs when scheduled incrontab

Ximulate

This post is deleted!

Ximulate

The problem with the scripts is that it will kill an active phone conversation. Not sure how to resolve that.

Ximulate

Before trying the scripts, you may want to check "firewall optimization" is normal or aggressive. The VoIP configuration docs suggest conservative, which could be aggravating this particular problem. I've bumped mine to aggressive, but have no idea if this will cause other issues.

https://docs.netgate.com/pfsense/en/latest/book/config/advanced-firewall-nat.html#config-advanced-firewall-optimization
https://docs.netgate.com/pfsense/en/latest/book/config/advanced-firewall-nat.html#state-timeouts