Wifi calling issue

MagneticMuffin

Hello,

I put this discussion in General because it may cover a few different topics. Please bear in mind that I am not a network engineer and that I am learning as I am going. Some terminology may not be quite the correct one.

The problem
Both my wife and I are experiencing issues with wifi calling and texting since changing our entire setup to pfSense. We are on Verizon Wireless. The issues are unfortunately intermittent. I believe I understand how the phones establish a secure connection for wifi calling. What baffles me, however, is that everything should work out of the box.

Setup
I have Verizon FiOS 50/50 as my ISP and decided it was time to retire the provided Actiontec MI424WR router. I have replaced the Verizon router with a combination of Netgate SG-1100, Ubiquiti Flex Mini Switch, and Ubiquiti AP-AC-Lite.

pfSense
Version 2.4.5. The network is setup so as to have a trusted LAN and three VLANs that are separated from the rest. The firewall rules are very simple: allow everything from LAN, block everything inbound on WAN that is unsolicited (all default). I also added pfBlockerNG with the default IPv4 blocking rules and DNS blocking.

Ubiquiti
The Unifi controller is running off a Raspberry Pi connected to the Flex Mini switch. Nothing fancy there, just enough to keep things going. The AP broadcasts one SSID and has two hidden SSID for two of the VLANs. Nothing is connected to the hidden SSIDs at the moment.

The AP has a single SSID for both 2.4 GHz (Channel 1) and 5 GHz (Channel 161). No interference on Channel 161, and Channel 1 is not too bad.

I have a Roku connected to one of the switch port, which I reserved for a “streaming” VLAN.

Cell phones
Neither phone is unlocked: we both use the “Verizon” flavor of Android. The phones are a Galaxy S7 (Android 8) and Galaxy A50 (Android 9). Both phones have a static IP address (although I tried DHCP originally).

Diagnostic attempts
Wireshark
I have used tcpdump (displayed through Wireshark) on (1) the SG-1100 on both the WAN and the LAN interfaces and (2) the Ubiquiti AP. When things work I can see the handshake that enables the IPSec tunnel (ports 500, 4500) between the phone and the trusted server and then the series of ESP packets that follows. The trusted Verizon servers are at 141.207.x.233 (where x = 137, 151, 177, 193, 229 as far as I have seen; more here: https://community.verizonwireless.com/t5/Verizon-Wireless-Services/Redux-2-0-iPhone-Wi-Fi-Calling-Firewall-rules/td-p/1080151). Here is a screencap of a working session (WAN interface, public IP removed):

When things do not work the phone gets stuck on “Calling...” but nothing related to the call shows up on the WAN or LAN interfaces, or the AP. There are, however, both a NAT-keepalive still showing up along with ESP packets sent by the phone the Verizon servers. Here is a screencap of a non-working session (WAN interface, public IP removed):

The ESP packets do not correspond to an actual call in this instance (i.e. the phone was not trying to dial out). It seems that there is still some exchange with the server through "INFORMATIONAL messages" (https://tools.ietf.org/html/rfc7296#section-1.4). Don't know what it contains, though. Let me know if you'd like to see the actual capture and how I can anonymize it.

Firewall logs
There are on occasion some 141.207.x.233 packets being blocked, but it is not consistent. The timestamps do not correspond to the failed calls. The dropped packets that I have seen have IP source and port as one of Verizon’s Wireless:4500, IP destination and ports are my public IP address:[some random port]. For example (xxx for the IP address, yyyyy for the port number):

5,,,1000000103,mvneta0.4090,match,block,in,4,0x0,,57,36820,0,DF,17,udp,108,141.207.193.233,xxx.xxx.xxx.xxx,4500,yyyyy,88

The corresponding rule that blocks the traffic from WAN is "rule 5". pfctl -vvsr tells me that it is the "block inbound ipv4" rule:

@5(1000000103) block drop in log inet all label "Default deny rule IPv4"

I have not been able to see those requests on Wireshark.

States
The states do show that the phones have an established connection with one of Verizon’s server. I can see the ports 500 and 4500 showing up with pfTop. The port 500 state stays only for about 30 seconds:

The state on port 4500 gets renewed every 60 seconds with at NAT-Keepalive packet.

Sometimes, however, the state on port 4500 gets dropped. What happens then? Here is what I have observed:

the state was dropped for good. The phone re-establishes an IPSec tunnel with one of the Verizon servers (through port 500, then communication on 4500).
the state is dropped “temporarily”. The state shows up again with the same server, same ports, and the keep-alive packets are still being sent. However, sometimes calls do not go through anymore. Sometimes they do. When they don’t, the phone just stays on “Calling” forever. Wireshark does not even register that a phone call is happening (but keeps on exchanging NAT keep alive & ESP packets, see capture above). It seems that the state just becomes stale after a while. It can be after 5 minutes, or an hour, I have not found a pattern. The only clue I have here is that if the phone is on airplane mode with wifi enabled then the state keeps on getting renewed on for a long time (at least a few hours).

It does not matter if the two phones have the same destination. Both can be on 141.207.193.233 and still work for example. Or one can work while the other one does not.

Differences with Actiontec router
The Actiontec router has the same default firewall rules as pfSense (block inbound, enable outbound). I used to have a pretty strict PiHole (with about 1M blacklisted domains) before switching to pfSense. pfBlockerNG now has about 150k blacklisted domains. I did not see the Verizon IP addresses in the default IPv4 lists.

Main differences (old vs. new):

2.4 GHz AP vs 2.4 - 5 GHz combo AP
No DoT or DoH vs. DoT for pfSense
piHole vs. pfBlockerNG
Maybe some NAT rules

Solution attempts
It is not an issue with the firewall rules as far as I can tell. The LAN interface is open for business and the WAN interface blocks unsolicited inbound traffic. I have tried to replicate the Actiontec setup as closely as possible.

Changed some internal frequency of the Ubiquiti AP to not create harmonics that may interfere with the Verizon LTE band
→ no change
Turned off 5G and have all clients on 2G
→ no change
Bypass DoT or pfBlockerNG
→ I have forced one of the phone to use Google’s DNS directly, no change.
Matched DNS servers
→ I have tried the following DNS servers: Cloudflare, Quad9, Google, Default ISP. No luck.
Firewall rules, port forwarding, NAT rules
The default Actiontec firewall rules are the same default firewall rules as pfSense. The Actiontec only has two port forwarding rules (for ports 4567 and 25372) that are not used and not relevant to the problem at hand. No special NAT rules on either device (although pfSense has UDP port 500 static; but the state stays on only for about 30s.). The Actiontec router has a "Port triggering" setting. It does cover the port 4500 on UDP.

I am not sure as to how I can replicate this in pfSense, if that is even necessary? I am unsure how to interpret it.

Other hypotheses
DNS issue?
Looking at DNS traffic, the domain name for establishing the IPsec tunnel for wifi calling seems to be “wo.vzwwo.com”. This resolves to a variety of IP addresses (141.207.x.233) depending on the DNS provider. That domain name is hardcoded in the phone. Some of those servers, interestingly, do not ping (137, 151, 177, 183, 229 do not, 193 does).

NAT issue?
I left everything default in pfSense. pfSense has the automatic “static port rule” at UDP 500. However, the port 500 connection only stays up for a short time (about 30 seconds). The port 4500 does not have any particular NAT rule and should not need one. See comparison with the Actiontec router above.

Verizon Wireless / Samsung doing strange things?
From other posts I have read it seems that the combination of Samsung phones and Verizon Wireless is not great. Other posters managed to solve their problems one way or another. This solution seems to hit closest to home (https://forum.netgate.com/topic/143639/at-times-wifi-calling-and-sending-sms-doesn-t-work/53), but should not be needed. One of the affected person swapped his/her Note 9 to a Pixel 4 and everything was fine. iOS seems to be unaffected.

The Samsung flavor of Android seems to have had issue with IPSec (https://eu.community.samsung.com/t5/galaxy-s9-s9/vpn-is-broken-in-galaxy-s9/m-p/721041#M6052), but why would it be a problem on the SG-1100 router only?

Unfortunately I do not know how things are handled on the Verizon Wireless-side of things, so your guess is as good as mine for that.

If you made it that far, thank you for reading. Any suggestions or pointers are appreciated! We do get good cell coverage at home so I can disable Wifi calling but I'd like to undestand why this is an issue.

tman222

Hi @MagneticMuffin - I have a similar setup to yours (Verizon FiOS, pfSense, and Ubiquiti wireless AP's) and also experienced issues with WiFi calling (phones are using Verizon Wireless). Apple devices worked fine, but I received complaints from a family member about calls not connecting and problems texting using their Samsung Galaxy Note 10 when WiFi calling was enabled. One change I did make that seems to have helped is to change "Firewall Optimization Options" to "Conservative" under System > Advanced > Firewall & NAT. I have not gotten anymore complaints since. Granted, maybe this was fixed with a software update on the phone, but for now I'll take the benefit of the doubt :). Hope this helps.

stephenw10

Since it looks like the phone is using a random source port for the NAT-T (4500) connection I would try adding a static outbound NAT rule for any traffic with destination port 4500.

We can't actually see what ports are being used there but I suspect after the rekey one side it trying to use the old state still and the port will have been re-randomised by pfSense b y default. Hence the is blocking the traffic from the server.

A big difference between pfSense in default config and most other soho devices is the source port randomisation.

Steve

netblues

I had a similar use case with vodafone femtocell (which also uses ipsec) and serves same functionality as wifi calling.
I had to use a floating rule of

MagneticMuffin

Thank you all for your replies.

@tman222 I'll try that first, as it seems to be the least disruptive. I was wondering why this would be working so I took a look at the pfSense repo and checked what the "conservative" option was about. The file filter.inc has the corresponding setting. The variable limitrules is set, then written to a file (/tmp/rules.limits), and then loaded by pfctl:

if ($config['system']['optimization'] == "conservative") {
	$limitrules .= "set timeout { udp.first 300, udp.single 150, udp.multiple 900 }\n";
}
...
@file_put_contents("{$g['tmp_path']}/rules.limits", $limitrules);
mwexec("/sbin/pfctl -Of {$g['tmp_path']}/rules.limits");

This changes the timeout of a UDP connection from the default values of 60s for udp.first and udp.multiple and 30s for udp.single (see defaults here: https://man.openbsd.org/i386/pf.conf#udp.first)

I think this would make sense to me as to why it would work. The phone sends a NAT keep-alive every 60 seconds on the UDP state that is on port 4500. However, because the keep-alive is sent at the same interval as the timeout for that state there may eventually be some desynchronization, with the connection being dropped. The phone keeps trying to get to the Verizon server but the server is telling the phone to pound sand.

Reasons why it would work with the Actiontec router, for people with an iOS phone, or other providers:

Actiontec router does not drop those UDP connections that quickly
iOS sends keep-alive packets at a faster rate than the default timeout
Not sure for other providers? Maybe other providers have a different way to deal with a dropped connection?

I can't verify any of this, unfortunately:no pftop-equivalent on the Actiontec router, I don't have an Apple device, and I don't have access to the details of the wifi calling implementation of other providers.

@stephenw10 @netblues I'd like to avoid using a static outbound NAT rule if possible. NAT-T should work just fine, as demonstrated by iOS devices being able to use the same setup. Wouldn't a static rule also limit the network to a single cell phone for wifi calling (https://forum.netgate.com/topic/143639/at-times-wifi-calling-and-sending-sms-doesn-t-work/54)?

stephenw10

It would only limit it to one connection if the phones were all using the same source port, which they could have been. But your state table there seems to be showing they are using a random source port which is why I mentioned that makes it possible to try.

The state for the NAT-T traffic is udp.multiple but that is also 60s by default so it's possible it just times out. See Diag > pf_Info for the running values. However if it did it should fail the dpd pings and then reconnect. It seems like something more is happening there. It's an easy thing to test though as you say.

Steve

MagneticMuffin

Update: I made the change suggested by @tman222 last week and have not had a single issue since then. Both phones now work fine, and it did not require any new NAT rule. The value of udp.multiple can probably be tuned as the "conservative" mode keeps connections open for a while.

I also took a look at the internals of Android to figure out the default time between NAT-T keepalive packets. The constant of interest is (aptly) named NATT_KEEPALIVE_DELAY_SECONDS. Stock Android shows that it has a value of 10 seconds (probably why a Pixel phone works immediately), so either Samsung or all the US carriers are changing its value to something different. The constant is defined in the file IkeSessionStateMachine.java under com.android.internal.net.ipsec.ike.

Thank you all!