Dual WAN failover, PAP2T and asterisk won't register unless reset states

ilko

Hi,

First of all- hats off for all your work!

First time using pfSense- 1.2 RC4. Dual WAN- static and DHCP adresses, failover and load balancing as per MultiWAN guide for 1.2, Monitor IPs are fine.
WAN- ISP with dynamic address.
OPT1- another ISP with static address, temporary placed Linksys WRT54GR with DMZ to OPT1.
LAN

Failover and load balancing seem to work fine, I can see increase of speed and failover works ok, in load balance status pulling one or the other WAN cables changes status accordingly.
We have 4 Linksys PAP2-T devices and a X-Lite softphone connecting to remote asterisk server on public IP.
Created a rule for LAN ports 5060-5080 and 10000-21000 to use WAN2FailsToWAN1. Ports are properly forwarded on both WAN and OPT1, same is done on each SIP device. Using static ports.
I see all SIP and RTP traffic goes via OPT1 as supposed to work. Pull out OPT1 cable and watch registrations on remote asterisk server.
X-lite reregisters with the new IP- the one from WAN, but none of the PAP2 devices.
Look at states- I see all them still using OPT1 gateway. Reset all states- most of PAP2 devices re-register. Reset again- all are registered with the new IP and go via WAN interface.
Tried using STUN server and without, played with VIA options, registration expire values, DNS etc.- no go. Just have to reset states a couple of times. Tried advanced options in LAN firewall rules for SIP and RTP ports- State type- NONE, in this case even X-lite won't re-register, reset states doesn't help, neither did changing of Firewall Optimization Options.
Tried "Use sticky connections"- no go again.

Questions:
1. Any ideas how to overcome this problem with PAP2-T? Is the problem combination of PAP2t and pfSense or it's PAP2 problem? If latter why reseting states fixes it?
2. Once OPT1 is up again, how can I force all SIP devices or other prefered protocols to go back using only OPT1?
3. Is one rule with gateway WAN2FailsToWAN1 for protocols that don't like load balancing enough, or second is needed? See screenshot- SIP registrations 2 and RTP 2.

Thanks again for all your work on pfSense.

ilko

Ok, I will try again the questions-

1. Why in case WAN or OPT1 fails, pfSense keeps states, and creates new ones to non working gateway? Bug or feature?
2. How in case WAN or OPT1 is back online, certain connection can be switched back to their preferred gateway?

I've read almost every single thread in this forum, as well as many pages from Google, hopefully you don't blame me that didn't do my homework :)

ilko

It seems it's a bug/non implemented feature in pfSense- in case of failover states to the failed gateway are NOT flushed. If asterisk/VIOP device behind pfSense has successfully used the failed gateway before and keeps sending packets, they are routed through the FAILED gateway.
http://forum.pfsense.org/index.php/topic,5778.0.html
http://forum.pfsense.org/index.php/topic,4704.0.html

For anyone following MultiWanVersion1.2 doc- pools order/naming is WRONG.
WAN1FailsToWAN2 pool order should be WAN then WAN1/OPT1
WAN2FailsToWAN1 pool order should be WAN1/OPT1 then WAN.

Now I am looking how to use this idea in failover situation:
http://forum.pfsense.org/index.php/topic,6531.0.html

/etc/rc.newanip

/* if everything normal is done, reset SIP states (until bug-fix comes up) */

if($old_ip <> $curwanip) {
        log_error("bug-fix for sip and iax: ip changed $old_ip ->  $curwanip ... killing states to sipgate and dusnet.");
        /* statereset: asterisk internal to sip sipgate external */
        exec("/sbin/pfctl -k 192.168.15.90 -k 217.10.79.9 2>/dev/null");
        /* statereset: asterisk internal to iax2 dusnet external */
        exec("/sbin/pfctl -k 192.168.15.90 -k 83.125.8.46 2>/dev/null");
}

How same idea can be used in case of failover to reset states to the failed gateway?

edit: edited second WAN21FailsToWAN2 to the proper WAN2FailsToWAN1

cmb

This is a known limitation of the system that won't be fixed in 1.2. I'll move the ticket you opened to a feature request for 1.3.

ilko

Thanks for your reply, you made my day, I've almost given up :)

Is it possible to use similar script to the one above? Where it has to be placed? Or it's way more than that?
With this limitation it make no sense at all to use failover with VOIP/SIP. May be other applications are affected too?

sullrich

I have been seeing this problem more and more as well as I migrate in more connections at my work place. I'll be addressing this in 1.3.

ilko

Thanks for that, I do appreciate it.

Until 1.3 is out, isn't there a quick workaround? Something like the script above.

hoba

I think you should be able to run scripts on filter reload by using hidden config.xml magic. There was a tag introduced to run shellcommands on filter reloads quite some time ago (see http://blog.pfsense.org/?p=31 ). You also can search the forum for how to use this.

ilko

hoba, thanks for that, I'll give it a try.
I am thinking to reset all states using external script, and I read that on any filter change this will be executed:

New system->afterfilterchangeshellcmd xml tag which is executed on the system after each filter change (or other networking related changes)

http://www.pfsense.com/index.php?id=26

As I read it, unless I change something in the filters, the only event, which would trigger "afterfilterchangeshellcmd" is when either or both WAN foes DOWN or UP, is that correct?

hoba

That is correct, however keep in mind that you will kill other already established connections on other interfaces (for example connections already running from lan to wan2 if wan1 fails) if you just kill ALL the states. However depending on how reliable your wans are usually this should not happen too often. Just have a look at status>systemlogs, loadbalancer tab to see how reliable your wans have been in the past.

ilko

OPT1 (ADSL) is playing up once per day, usually at the end of office hours (office next to us turning on their alarm…who knows?). It happens for a minute or two, but that's enough all our VOIP phones to go crazy for that period. For now I just moved them to WAN1FailsToWAN2, but reseting all states would be a problem.

How do I get in script which interface has went DOWN to reset it's states only?

Another thing- could this indicate a problem?

check_reload_status.log

02-18-2008_at_091500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-18-2008_at_130000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-18-2008_at_153000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-18-2008_at_204000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-19-2008_at_002500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-19-2008_at_090001 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-19-2008_at_161500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-19-2008_at_165500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-20-2008_at_150000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_043000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_152500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_193500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_211001 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_215501 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-22-2008_at_053000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...

parrotscience

I'd like a solution for this too… My DSL connection - my main one - goes down and my cable connection - as a backup gets stuck for my SIP connections to the DSL line if they fail... I was using the reset states as well, but would like something a little more automatic.

@ilko:

OPT1 (ADSL) is playing up once per day, usually at the end of office hours (office next to us turning on their alarm…who knows?). It happens for a minute or two, but that's enough all our VOIP phones to go crazy for that period. For now I just moved them to WAN1FailsToWAN2, but reseting all states would be a problem.

How do I get in script which interface has went DOWN to reset it's states only?

Another thing- could this indicate a problem?

check_reload_status.log

02-18-2008_at_091500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-18-2008_at_130000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-18-2008_at_153000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-18-2008_at_204000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-19-2008_at_002500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-19-2008_at_090001 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-19-2008_at_161500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-19-2008_at_165500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-20-2008_at_150000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_043000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_152500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_193500 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_211001 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-21-2008_at_215501 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...
02-22-2008_at_053000 There appears to be 2 or more check_reload_status processes. Forcing kill and restart of all now...

ilko

This is what I tried yesterday. Mind you, linux/bsd/pfSense is pretty new to me, and I am not programmer at all ::)

/conf/config.xml

 <system>...
....
<afterfilterchangeshellcmd>/usr/local/bin/reset_states.sh</afterfilterchangeshellcmd></system>

/usr/local/bin/reset_states.sh
chmod 755

#!/bin/sh
sleep 70
/sbin/pfctl -F state
sleep 40
/sbin/pfctl -F state

Had to reset twice, because those PAP2 devices register every 30 seconds, if next registration is not right after reseting states, it uses again the old gateway, or for some reason some of them won't re-register. Reseting twice seems to work for now. Hmm, is there anything else to be flushed? NAT rules? Why connections are established on the failed gateway even after reseting states?

What I saw- all are register via WAN, which fails and comes up again in 2 seconds. Loss of audio is only for these 2 seconds. Reseting states later seems NOT to break audio, to be reconfirmed.

If WAN fails for longer, then 1:50 min wait until all devices are re-registered. Prefer that way, instead of reseting states and moving on another gateway for every 1-2 seconds line break up. Also- reseting states makes firewall rules, about preferred gateway order to be reapplied.
Didn't have much time yesterday to test, during the week will play more and report. At least I can start with something, thanks for clues :)

eri--

Try pfctl -F all -i {$interface_that_goes_down}
Is better and should avoid running it twice.

ilko

Umm, $interface_that_goes_down is system variable, or I need to replace it with something? If latter, how do I get which interface went down?

sullrich

Look in /var/db/pingstatus

Each monitored item will appear there. Simply look for DOWN in the files. You could easily parse each file looking for DOWN and then resolve the IP back to the interface.

ilko

That directory is empty, same as pingmsstatus. No such files in /var/db.

ilko

@ermal:

Try pfctl -F all -i {$interface_that_goes_down}
Is better and should avoid running it twice.

Using
#!/bin/sh
sleep 5
/sbin/pfctl -F all

causes no new states created- Diagnostics: Show States- "No states were found."

back to

#!/bin/sh
sleep 60
/sbin/pfctl -F state
sleep 40
/sbin/pfctl -F state

This also makes when WAN or OPT1 are back online, all connections to use their preferred gateway again, which is good.
If we reset states on the failed gateway only, the above will not happen.
Need more time to study the negative effect of reseting states.

ilko

Since I've added this for 3 weeks it's working fine, however OPT1 has failed just a few times during office hours. No complains about loss of internet or failed PAP2T device for now. I will stick with this workaround until better solution comes up.

In short:

Change /conf/config.xml

 <system>...
....
<afterfilterchangeshellcmd>/usr/local/bin/reset_states.sh</afterfilterchangeshellcmd></system>

Create /usr/local/bin/reset_states.sh

#!/bin/sh
sleep 60
/sbin/pfctl -F state
sleep 40
/sbin/pfctl -F state

chmod 755 /usr/local/bin/reset_states.sh