[solved] 50% of flows destined to OpenVPN are lost… Driving me crazy!

wawawawa

Hi Everyone,

I need some help…

+++ CAUTION - Home Networking issue ahead! There be dragons! - CAUTION +++

I'm having an issue with my home pfSense -> Remote Lab VPN.

pfSense is set up as a VM (via KVM) on an Intel NUC. I have one onboard interface (LAN) and one USB3 Gbps interface (WAN). There are about 4 cheap, generic, unmanaged Gbps switches around the house (star topology). Everything is working well... except:

The SW VPN client on my laptop (Viscosity) works brilliantly connecting to the Remote Lab but the Site2Site (pfSense to pfSense) seems very "spotty".

Using iPerf (client local, server on feportal in Remote Lab) I can get consistent results in terms of throughput between SW client and local pfSense to the Lab (~ 19Mbps). With pfSense it sometimes seems to take ages for initial requests to go through. Then it will work for a few minutes and then really slow / patchy again.

Let's go back to basics. I noticed that if I ping 10.100.1.1 (a server in the Remote lab) from my laptop at home it works only half the time. I don't mean 50% packet loss, I mean each instance of ping either works 100% or 0%. If it does work then it's fine for 1000s packets.

From any of my LAN hosts (VMs and real machines, including VMs on the same host OS as the pfSense VM):

(Every 3sec, spawn a new ping process to send consecutive group of 3 ICMP packets to 10.100.1.1)

while true ; do ping -q -c 3 -t 4 -W 1000 10.100.1.1 | grep "packet loss" ; sleep 3 ; done

3 packets transmitted, 0 packets received, 100.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 0 packets received, 100.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 0 packets received, 100.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 0 packets received, 100.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
... and on ...

And strangely, if I run the same command from the pfSense VM itself, everything is fine:

[2.1.3-RELEASE][root@pakora.home.mullis.co.uk]/root(3): sh

while true ; do ping -q -c 3 -t 4 -w 1000 10.100.1.1 | grep "packet loss" ; sleep 3 ; done

3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
3 packets transmitted, 3 packets received, 0.0% packet loss
… and on ...

I've used mtu-test in my pfSense config and it seems that mtu is not the problem.

Jun 25 13:53:01 openvpn[15895]: NOTE: Empirical MTU test completed [Tried,Actual] local->remote=[1557,1557] remote->local=[1557,1557]

Using the DF bit and ping the symptoms are the same independent of packet size. I've also tried using TCP (with nping) and the results are the same: only 50% of the flows are successfully established, but if they are then they're good for any amount of data.

Thinking it might be a L2 problem with some funky duplicate MAC addresses floating around makes sense so I rebooted all of the switches and forcibly changed the MAC addresses on my FW interfaces to random values. The problem persists.

So, to re-cap:

(1) - From FW to any host (including Remote Lab via OpenVPN tunnel): All good.
(2) - From FW to any external host: All good.
(3) - From LAN host to any external host: All good.
(4) - From LAN host to Remote Lab via host-based OpenVPN SW Client: All good.
(5) - From LAN host to Remote Lab via OpenVPN tunnel: 50% of flows fail.

I think the key points are 1 and 5. For (5) this even includes the proxmox server hosting the pfSense instance.

Any ideas on where to look next?

To be honest, I've spent the time writing this as much to get my thoughts in order about the problem… but if anyone has any suggestions on where to look next then please share.

Thanks in advance

heper

are you running NAT of your openvpn instances by any chance ?

wawawawa

Yes I am.

I found the problem. I had an OpenVPN server configured for incoming traffic and ended up with equal cost routes for the same subnet via each tunnel!

You live and learn….

Thanks