Strange slow down between 2 sites.

atcronin

I have given up trying to figure this one out on my own and I havn't been able to find anything similar in searching.

I have 2 pfSense 2.2.4 systems that experience a slowdown of transfers (http, sftp, smb) between sites, in and out of the openvpn tunnel, after a random amount of time (hours to days).
From being able to max the upload speeds (40mbit and 20mbit) with a single transfer, to being limited to about 6mbit per transfer - but able to achieve full bandwidth with multiple simultaneous transfers.
Speed tests (inside and outside isp network) at both sites remain at full speed, only transfers between the sites are slowed.
I experienced it last week, last night and again this afternoon, at first I thought it might be peak time congestion, but simply reconnecting the pppoe session on one box returned the transfers to full speed.

I have checked:

all nic hardware offloading is disabled
No shaping or limiting
nothing obvious appears in the logs
route between sites doesn't appear to change
manual mtu discovery doesn't reveal any change to mtu path
rtt remains the same
no packet loss
vmware nics are all negotiated to 1000/full duplex
direct 1m cable from vmware external nic to NBN NTD
no vlans
with and without vmware-tools installed

Have not yet tried using e1000 nics, and don't have the spare hardware to try barebones installs.

Setup

System 1
–---------
ESXI 5.5.0
VM Version 8
Intel S1200RPL MB - dual intel 210
Xeon E3-1231 v3 aesni
2 vcpu
1GB ram
2 VMXNet3

2.2.4-RELEASE (amd64)
WAN - TPG.com.au NBN 100/40 fibre PPPoE mtu 1492 dynamic IP
OpenVPN Site to Site - udp tun AES256CBC, SHA256, Cryptodev, comp-lzo no pref, default mtu,mss,fragment etc

System 2

ESXI 6.0.0
VM Version 11
Asrock B85M-ITX MB - intel pro 1000/pt dual port
i3-4170 aesni
2 vcpu
1GB ram
2 VMXNet3

2.2.4-RELEASE (amd64)
WAN - TPG.com.au NBN 50/20 fibre PPPoE mtu 1492 dynamic IP
OpenVPN Site to Site - udp tun AES256CBC, SHA256, Cryptodev, comp-lzo no pref, default mtu,mss,fragment etc

RTT between sites is about 8ms.
1 shared hop between WAN IPs.
up to ping -D -c 1 -s 1464 x.x.x.x works between hosts wan ips
up to ping -D -c 1 -s 1472 x.x.x.x works between hosts vpn tun ips - guess its being fragmented but performance doesn't appear to suffer.

Transfers between sites via wan = max bandwidth.
Transfers between sites via vpn tunnel = max bandwidth.
VPN RTT remains stable whilst maxing out bandwidth, 8>12ms.
CPU Max 10%
RAM 15%

Anyone got some ideas on how I might be able to pin point the cause?

Is it likely ISP related? the pppoe reconnect fix hints at some skullduggery, but why they would shape customer to customer but nothing else seems odd.

I will probably build 2 new VMs with e1000 nics tomorrow and gradually add features week by week to see if I can pin it on something in pfSense.

Other than a slight annoyance with this bug - https://redmine.pfsense.org/issues/5053 which I have overcome with a cron job that runs setup_gateways_monitor() every 15 mins, I have thoroughly enjoyed using pfSense for the last year and a bit and look forward to squashing this issue and being able to continue using it.

Thanks,
Andrew

firewalluser

You'll need to eliminate the HW at either end before you can look at the ISP infrastructure. Do you spot any patterns like excessive number of states in the state table, whats the ram usage like, is the swap being used and anything else thats seems unusual when you experience the slow down. Might even be worth checking the workload on each core to see if there is a problem with the FreeBSD OS scheduler, as its quite easy to make various programs run on a particular core which then slows that core up as it gets overloaded leading to slowdown of the rest of the cores on cpu.

If you cant find anything wrong with your hw, then looking at the internet infrastructure seems like the only option left, and yes ISP's can do bandwidth throttle-ling quite easily even if you have an unlimited data package at either end, its also why the market forces didnt win out in the rigged game as theres little technical difference between adsl and sdsl modems, other than upload speed.

I believe its harder to bruteforce crack large amounts of ssl data compared to short bursts, but with the fact the ISP/Govt will have a complete oversight of the entire communication from TLS handshake to goodbye, getting your certs should make it easier to bruteforce crack the transmission to then see what you were transmitting which is why having so much functionality on your firewall increases the risk.

One way to eliminate the FW hardware being at fault is to shift the openvpn functionality onto separate machines at either end and then just use pfsense to do the routing and fw. Theres also nothing stopping you using pfsense again to manage openvpn on your seperate vpn boxes.
Where you create and manage the certs for your vpn is up to you, personally I am of the view to isolate various functionality onto individual machines as a zero day could give complete access to a machine and with so many eggs in one basket, makes it easy picking for hackers.

When looking for HW changes, also keep an eye on other devices in your network, just this morning I caught my TalkTalk isp supplied set top tv box exploring the network looking for other network service facilities as it couldnt get online, despite all its network settings being correct.

Its interesting to watch how devices react when different aspects of net functionality become no longer available. I'd like to suggest its harmless but as most of it is encrypted or uses an algorithm which makes it hard to decipher the meaning of the plaintext context, one cant help but be increasingly suspicious especially as its quickest to hack from a rogue device inside your network.