Strange packet loss on OpenVPN client



  • Hi All,

    I’m currently troubleshooting a strange issue I’ve had for quite a while regarding hosts that are routed through my OpenVPN tunnel - it seems I’m receiving packet loss/failure to resolve hostnames intermittently.

    Some general info to start with, will post screenshots if required:
    VPN provider: ExpressVPN
    Connection type: tun
    Host types: Debian-based, in this case an ESXi VM, but in the past a QNAP with the same behaviour.
    Policy route method: LAN rule out, by source IP, specifies gateway to be interface of OpenVPN connection, also tags with and is matched on a floating rule on WAN to be dropped (to act as killswitch). NAT on 10.1.1.0/24 subnet to NAT IP of OVPN interface.

    DNS etc: DNS is pushed to VPN clients via DHCP, currently using FreeDNS - DNS Resolver also has a deny ACL rule for misbehaving clients.

    Custom Options and VPN Config:```
    tun-mtu 1500;fragment 1300;mssfix 1450;sndbuf 524288;rcvbuf 524288;fast-io;persist-key;persist-tun;verify-x509-name Server name-prefix;ns-cert-type server

    
    See the following link regarding setup with ExpressVPN: omitted some customoptions as it seems their staff just copy pasted, but the rest of the settings on the VPN client setup is identical, changes are generally routing and policy, as I do not route my whole LAN.
    
    https://www.expressvpn.com/support/vpn-setup/pfsense-with-expressvpn-openvpn/
    
    **Behaviour:**
    Any hosts that route through the VPN seem to intermittently drop packets, and sometimes not resolve hostnames, however wget for files and anything with transmission seems to work with no issues.
    
    Pings to hostnames and 8.8.8.8 will usually take 5 seconds to get started, then respond with arojnd 60ms, then drop more often than not.
    
    Browsing sucks, pages will not resolve like 60-70% of the time.
    So I’m not sure what’s going on here.
    
    Off the VPN no packet loss, VPN is not dropping, nothing abnormal in logs.
    Initially had a compression mismatch or something in the logs but after restarting VPN this seemed to have disappeared.
    
    Note I am using no compression - comp-lzo no
    
    **Troubleshooting:**
    
    - Tried multiple hosts, same behaviour.
    - Tried different DNS servers, same behaviour.
    - Restored configs.
    - Changed tunnel MTU in steps down to 1200, no change,
    - Changes fragment size, changed mssfix
    - Changed client MTU on ESXi, down to 1300
    - Changed MTU on Debian host, probably did it wrong with the vmnic, but the above should’ve enacted that change (afaik)
    - Changed ESXi loaf balancing route method to IP hash (from port-id)
    - Used the exact ExpressVPN custom options (slows data rate and still same symptoms)
    - Tried multiple server locations for VPN provider, including a local one
    - Contacted ExpressVPN support, sent me guide and that was it, besides TCP suggestion.
    - Removed fast-io, no changes
    - Tried multiple compression settings, no changes
    - Tried TCP, couldn’t even get throughout, probably a config problem nonetheless.
    - As per logs, one of the push options is net30, set this manually.
    
    Need some ideas guys.
    Cheers.
    
    EDIT: Checked logs, compression error has come back```
    Bad compression stub decompression header byte: 102
    


  • Almost 100% certain this is an MTU issue.

    If I ping from my desktop (win 10) - 1472 is the largest packet size I can send without fragmentation.

    I've tried setting the tun-mtu to 1472, 1400, 1200, etc, no improvements.

    I've tried fragment and mssfix at 1400, 1300, 1200 - some improvements, cant find a good setting, need advice here.

    What's best practice for this sorta thing guys?

    Cheers.


  • Netgate

    Pings to hostnames and 8.8.8.8 will usually take 5 seconds to get started, then respond with arojnd 60ms, then drop more often than not.

    This has zero to do with MSS so you are probably chasing a red herring. Fix that and all of your problems will probably go away.

    Packet capture those pings on the OpenVPN interface. If they are going out and not coming back, getting a new VPN provider is probably easier than getting them to fix it. There are plenty of choices.



  • @Derelict:

    Pings to hostnames and 8.8.8.8 will usually take 5 seconds to get started, then respond with arojnd 60ms, then drop more often than not.

    This has zero to do with MSS so you are probably chasing a red herring. Fix that and all of your problems will probably go away.

    Packet capture those pings on the OpenVPN interface. If they are going out and not coming back, getting a new VPN provider is probably easier than getting them to fix it. There are plenty of choices.

    I’ll do a packet capture first thing and test this.

    Any reason I noticed improvement with ping response when messing with those 3 values?
    I’m really quite convinced it is MTU related, but only because this issue has been occurring for 2 years straight, it hasn’t effected me all that much but it’s something I’d like to fix.
    I’m not 100% ready just yet to try a different VPN provider, I’m pretty convinced it’s something else as it doesn’t happen on desktop VPN client - will confirm again however.

    Let’s just say it was MTU related, how would one rule that out?
    I also did the MTU test with openvpn, it returned 1201, tried that for tun-mtu also, just increased ping really, probably did something wrong though.


  • Netgate

    Unless you are deliberately setting the packet size high, the packets will be too small to display MTU issues.

    An MTU issue for something like a ping will not be intermittent like that in almost all cases.



  • I have the same behavior on my setup. What I have noticed is that it's actually related to the frequency that you issue the icmp requests. Interestingly enough, the sweet spot seems to be 1000ms between icmp requests (I tried numerous times) you actually get more packet loss if you do 2000ms...

    for example:

    ping -i 0.2 google.com
    103 packets transmitted, 24 received, 76% packet loss, time 21443ms

    ping -i 0.25 google.com
    55 packets transmitted, 17 received, 69% packet loss, time 13744ms

    ping -i 0.5 google.com
    49 packets transmitted, 23 received, 53% packet loss, time 24100ms

    ping -i 0.75 google.com
    51 packets transmitted, 29 received, 43% packet loss, time 37550ms

    ping -i 1 google.com
    20 packets transmitted, 20 received, 0% packet loss, time 19026ms

    ping -i 2 google.com
    20 packets transmitted, 17 received, 15% packet loss, time 38014ms