CARP seems to be interfering with SIP/RTP voice traffic somehow
Okay this is an odd one.
I have a phone system, Allworx. The PBX has its own WAN interface, and I use the WAN interface for its internet connection. The phone system and all of the phones are on their own VLAN. They are also on their own physical switch. That switch connects via a trunk to the other switch, mainly so that I can reach the web interface of the phone system, but I could disconnect the switch from the rest of the network and all functions would work fine.
The problem I'm having is that sometimes, I will have an audio issue, internally. It's a slightly garbled, but audible and annoying blip that happens throughout a call. The weird thing is, this happens on internal calls, and even calls to voicemail so it isn't related to our SIP providers. When the problem is happening, it is reproducible. But sometimes it will just stop, and then nothing I do can get it to come back.
To make a long story a bit shorter, I have struggled with this problem for a long time. I've finally narrowed it down to pfSense a few weeks ago, and from there, I think I've narrowed it to CARP.
I am running pfSense 1.2.3-Release Full on VMware ESXi 4. Two VMware hosts, and each has a pfSense virtual machine, and they are using CARP for failover. During testing, I shut down one of the virtual machines, but it made no difference; sounded the same as with both running, but I left the secondary off anyway.
We first noticed that if we disconnected the switch that the phones are connected to (I'll call it the phone switch) from the rest of the network, the problem went away instantly. Form there, I narrowed it down to one particular switch, and then from there, one port.
It was the virtual host, and then I shutdown every VM except for pfSense, and it didn't help. Once I shutdown pfSense (so we had no routers) the problem went away. It was reproducible: if I have pfSense running, I hear the problem, if I shut it down, it goes away.
I thought it might be related to it running in a VM, so I backed up the config, shut down pfSense, and fired up an Alix. I restored the config from the VM, and let the alix be our router. Same problem. So it's a completely different physical instance.
Finally today, back in the VM, I decided to try disabling CARP. That seemed to work. It also brought down the router, effectively, so I decided to remove all the CARP VIPs, and then set the interface IPs to what the CARP IPs were. This fixed the problem. I used snapshots to create a snapshot before I removed the CARP setup (old config), and a snapshot after (new config). This lets me go back and forth between the two without lengthy reboots and such.
It's pretty clear: I sit there on a call to the voicemail, and just let it repeat the message over and over. Old config: garbled. New config: perfect. I can switch and back and forth and hear the difference immediately.
So my question is: what the hell is going on??? The traffic that is being affected here isn't even going through pfSense! It's traffic between a single phone and the PBX, on the same VLAN of the same physical switch. How is this possible?
Please someone help me..
Is anyone else able to reproduce this, or offer any insight? During testing, I tried to upgrade to 2.0 but something went wrong with the upgrade and it just kept rebooting. When I get some time, I will try to do a clean 2.0 install and then restore the config instead; that should help me figure out if 2.0 is affected too. At the moment though, I am just running without CARP to keep the phones functional.
Never heard of anything like that, I setup at least a couple CARP installs every week with VoIP behind them, and have done some for VoIP providers with thousands of simultaneous calls going through.
You have to uninstall open-vm-tools package before upgrading or you'll end up in a panic loop.
Could be a problem with a conflicting VHID, or any number of other things, hard to say without digging into packet captures.