WAN interface performance issue, likely bug

bigtfromaz

I will try to keep this short and expand if anyone has questions. I have a Netgate 5100 onsite and three remote sites. Two of those sites use Netgate 5100s and the third is running pfSense on an Azure VM. We use that for our poor man's hybrid cloud. My notes here are taken from the Azure-to-home connection but the results are repeatable from all three remote sites.

Here are the steps I follow to reproduce the problem:

Restart both routers, wait for OpenVPN site-to-site to connect. In this case onsite is the OVPN server and the remote router is the client.
Start iperf server onsite with defaults
Log in to the remote router
Run the iperf client, from the remote site, to the local site's WAN Address (Public IP). Take all defaults.
Check results, 500 mbps which is essentially full speed
Run the iperf client again. This time use the in-tunnel gateway address (192.168.114.1)
24 mbps...this is very bad...One might say it's an OpenVPN problem...not so fast...don't contact the OpenVPN project just yet
Run the iperf client one more time. This time go back to the WAN Address (Public IP), outside the tunnel
Check results, 77 mbps, outside the tunnel, 84.6% SLOWER than 500 mbps
Reboot the remote server, leave the onsite server as-is
Rerun the iperf client using the WAN Address (Public IP) and it is back to full speed, 500mbps.

Additional notes:

Restarting the OpenVPN client on the remote router has no effect. Once the WAN interface is broken it remains broken
All routers are up-to-date on maintenance
The on-premise local server is never altered, nor is it rebooted. The errant behavior is at the remote site.
Changing values at System==>Advanced==>Network/ Network Interfaces has no effect on either side
Rebooting the on-premise server has no effect on the broken remote router
The tunnel cannot be used for FTP, SMB or backups
The tunnel works for low usage activities, RDP, SSH terminal, et.al.
This has been recreated from two remote sites, the Azure/pfSense and the Netgate 5100 pfSense+
I posted this as a general question because I don't think it's an OpenVPN bug, per se. It's most likely a bug in the adaptation of OpenVPN by pfSense.

The bottom line:

There's a serious bug somewhere and it is negatively affecting our business. High volumes inside the site-to-site OpenVPN tunnel corrupt the WAN interface in some way on the client side. It could be an OpenVPN issue, or bug. However it feels to me like OpenVPN is the victim; and pfSense, or its adaptation of OpenVPN has a serious problem. I also see many poorly answered or unanswered questions regarding OpenVPN performance in the forums and think this may be the root cause behind many of those observations.

Netgate...please help...

stephenw10

I assume you are running 21.05 in all locations?

It's unclear from your description if you're running iperf on pfSense itself or on hosts behind the two firewalls (which is the correct way to test)?
Though even if you were running on pfSense 24Mbps is far lower than I'd expect from a 5100 at least.

How do you have OpenVPN cofigured, UDP, site-to-site?

What is the latency across the tunnel? Are you able to replicate this between two local devices?

There must be some difference in how the traffic is handled between the two scenarios.
Check the state tables whilst testing, do you see states open on the correct interfaces?

Do you see this only on OpenVPN? IPSec tunnels are not affected?

Steve

bigtfromaz

@stephenw10
Answer, hopefully in order...

Version is 2.5.2 on the Azure VM and 21.05-RELEASE (amd64) on the 5100s

OVPN is site-to-site, pre-shared key, UDP on IPV4 only, Layer 3. On the remote server there is a point-to-site server (for use as a remote internet gateway). It's for travel use but nobody's travelling so there are no connections.

Latency is 27-32 ms, WAN Azure to WAN local; 100-130 ms to the other sites from WAN local.

I only have one local device so I haven't tried to replicate here. I could spin up a Hyper-V guest but not now, I am currently working on alternative method, most likely a Linux server on the local LAN, running OpenVPN as a server and NAT port forward Linux server. We are up interactively but backups through the tunnels are an issue.

Not an expert regarding state tables so I wouldn't know what to look for. I can try clearing the state tables after the trouble begins to see if that reset avoids a reboot to restore WAN performance. Would that provide useful information?

We're not running IPSEC now. We were, but IPSEC failed after a recent upgrade. We switched to OpenVPN. I have read that the IPSEC issue has been resolved but haven't switched back.

One more observation. We do have a point-to-site server running locally. There is one user, a Synology raid device that phones home and stays connected 24x7. It is used as an off-site backup device accepting snapshot replication and file share backups. It's been running without issues. It seems to be the site-to-site tunnels that are tripping us up, on the client-side.