Fail-back - primary wan fails to route UDP from SIP trunk

blistovmhz

Couldn't find anyone on IRC with an answer so trying here now.

I've run into a wall with my WAN fail-back and our SIP trunk refusing to fail back to the primary IP. I'm beginning to think it's either a PFsense issue with UDP routing, or perhaps I'll just expose my ignorance on how routing really works.

2 WAN's, 2 public IP's. WAN0 and WAN1. WAN0 is primary, WAN1 only comes up in the event that WAN0 fails.

Failover from 0 to 1 works flawlessly and all services stay up, including VOIP.
Our SIP trunk provider has two bindings configured for us. The primary connects to WAN0 and the secondary to WAN1. We've tried two failover methods - round robin and lowest available, both have the same problem.
The trunk bindings work like so:

By default, trunk provider tries to connect to WAN0 (UDP:5060). It determines gateway status via response from SIP OPTIONS request. If WAN0 is established, WAN0 is always used. If WAN0 stops responding to SIP OPTIONS, they re-establish the connection via WAN1 (UDP:5060). While WAN1 is up, they continue sending SIP OPTIONS requests to WAN0. When WAN0 responds, the trunk to WAN0 is re-established and if there are no active calls on WAN1, WAN1 trunk is dropped.

What actually happens though, is when the secondary binding comes up (after a failover), we're all good, but when WAN0 comes back up, their OPTIONS request never gets routed through WAN0 to the PBX server.
I've confirmed the OPTIONS request comes in both WAN0 and WAN1 every 4 seconds, but pfsense only routes through the interface with the active UDP connection, to the PBX.
Thus, When calls come in, they come in the secondary, but when our RTP goes out, it leaves via WAN0, which results in the calls dropping after 30 seconds.

Looking at the states on pfSense, I can see that we have an open UDP connection on P5060 from the host to our secondary WAN1. When I bring WAN0 back up, I can also see OPTIONS requests on both WAN0 and WAN1, but those requests are only forwarded to the PBX from WAN1. WAN0 requests are dropped at the firewall. When we are in this post-fail-back state (trunk is still active on WAN1 and WAN0 isn't routing OPTIONS to PBX, so the trunk host thinks WAN0 is still down), I can force the SIP trunk to re-establish on WAN0 by killing the state containing the UPD:5060 connection using pfkill, or by bringing WAN1 down. Neither of these are the right answer though, as there may be active calls on WAN1. We want WAN0 to respond to OPTIONS BEFORE WAN1 connection is dropped.

Looking at the options on pfSense, I found the following which I believe may be related:

System: Advanced: Miscellaneous: IP Security: IPsec Reload on Failover.
I suspect, as IPsec is going to be UDP as well, that someone ran into the same issue where if a UDP connection is actively routing through one WAN interface, pfsense will not respond to a UDP request on the same port of the other WAN interface. Thus, this option I suspect is a work around where during fail back, the UDP connection will be closed so that the new connections can be established on the now active interface.

Anyone know what I'm talking about here ?

Voxis

check your PBX, I bet your device is still registered from the IP of WAN1 (In Asterisk run a sip show peer {Extension})

when WAN0 comes back, WAN1 is also still live so the registration is still bound to IP of WAN1, since thats where its registered thats where its going to send the RTP stream

the options are comply separate, and are used for keep alive, BLF and other stuff, since they have nothing to do with SIP registration now that WAN0 is back pfsense directs them out WAN0, your PBX is still registered from WAN1 so its sending options to WAN1 and your devices are responding, thats why you see options on both WANs

usually you can clear the state tables in PFsense after WAN0 is back, this will usually will nullify the registration on WAN1 and force device to re-register on the now up WAN0

https://forum.pfsense.org/index.php?topic=65004.0

there are probably ways to clear just the states that have to do with your PBX, but none of our setups has it been a issue to clear the entire state tables

you can also set your re-registration in your devices really low… IE:

Grandstream phones have the following settings
Register Expiration = 60 minutes (default)
Reregister before Expiration = 60 seconds (default)

by changing Reregister before Expiration to 3540 your device will do a full re-registration every minute, so it should only be on WAN1 max 2 minutes after WAN0 is back
those kind of registration times can be resource intensive on a large deployments so be careful and only do if clearing out the state tables will not work for your setup

if you are connecting a PBX to a Switch over SIP the principle of the issue is the same

good luck

blistovmhz

Turns out it was either a bug with 2.1.x or the upgrade to 2.2.x fixed a corrupt config or something. After the upgrade, both bindings are able to connect to the PBX simultaneously (as they should), and fail-back now works (with some additional SIP config).