OpenVPN dropout due to apinger latency detection

jdp0418

I am having a problem where I am seeing my OpenVPN interfaces restart on a regular basis at the main site of an OpenVPN hub and spoke multi site configuration

I believe I am seeing latency on my WAN which leads to these issues. The latency is likely due to usage. However, I believe latency leads to these issues mainly because the apinger monitor service flags the gateway with delay. I have had many instances where the gateway gets marked as down then immediately comes back up, but many more where delay and loss is indicated on the gateway and then clears. My logs only go back so far at this point and the last time everything reset the gateway had been marked as down. While that makes sense, I have played with my thresholds, increasing latency tolerance to 1000 ms and it still doesn't seem to help. I think the VPN's only drop when the gateway gets marked down, but I am not ruling out that it has occurred when the gateway was marked with delay.

I have temporarily disabled apinger on the gateways and the connections have not dropped for the last 2 days. So while latency is bad to occur, it isn't substantial enough to actually influence the tunnels or traffic, just the apinger service, which in turns starts flapping the gateway and reloads services and states in the firewall.

So I have several questions to pose:

1. Can I traffic shape or prioritize apinger traffic specifically so that any bandwidth spikes in/out the WAN don't step all over the monitor traffic?
2. Can I traffic shape my down stream traffic before it affects or kills the apinger traffic? It's kind of hard to stop download traffic if it's already getting to the device, but I haven't messed much with PFSense traffic shaping to know how well it performs.
3. Can I disable Latency as a monitor type and only flag a gateway down when there is packet loss?

More about the configuration:
I am running OpenVPN between multiple sites in a hub and spoke configuration. There are 7 spoke sites and the HUB site. The HUB site is a dual WAN setup with basic fail over between a primary circuit and a backup circuit. All traffic will use one or the other. Each spoke site has 2 OpenVPN tunnel connections into the HUB site, one over each WAN. I am running OSPF to distribute routes and handle fail over between the OpenVPN tunnel connections. Each of these tunnel connections is a separate "server" configuration on the HUB firewall. There is also an OpenVPN server configured for remote (dial in style) clients. All tunnels are UDP and each use a different port on the firewall.

I wasn't sure if this would best be posed in the OpenVPN sub, or the traffic shaping sub.

stephenw10

1. Yes, set a rule on WAN to catch the ping traffic to and from the monitor IP and prioritise it.
2. Sort of. Like you say you can't really stop that traffic arriving at the interface but you can drop it in favour of other traffic. Of course that may have other adverse effects!
3. You can tune apinger to better suit your line conditions. Set the delay thresholds sufficiently high and the delay alarm will never be triggered. You can also increase the number of delayed packets required to trigger an alarm.
However even with all that apinger can be a tricky beast. ;)

Steve

jdp0418

Thanks for your input.

So I can just set a rule up on the WAN to tag and prioritize the ICMP monitor traffic? Do you think a shaper for this traffic would be helpful at all?

I am considering a traffic shaper on the LAN side which could help control the download speed to client PC's. Although that won't stop a burst of download traffic, it would shape it immediately and hopefully reduce the amount of time latency is being seen on the WAN.

I actually had another team member who worked on the apinger settings. I am seeing now that while he adjusted the latency thresholds, he still left the timers at default. So this goes hand in hand with above and hopefully extending the timers on the monitor to a longer period (an increasing the number of failed tests) will also help reduce reports of latency from apinger.

AhnHEL

Try Traffic Shaping using CoDel. Set it and forget it, not much easier than that. See if it helps in your situation, couple of clicks and you can test it out.

https://forum.pfsense.org/index.php?topic=88162.msg487235#msg487235

jdp0418

Awesome tip, thanks. I haven't played much with shaping on PFSense and I didn't know about that option in the shaper. I will post back with results. Thanks again!

AhnHEL

You could also try disabling "State Killing On Gateway Failure" in System/Advanced/Miscellaneous within the GUI. This should keep your VPN up when Apinger reports a Loss.

jdp0418

@stephenw10:

1. Yes, set a rule on WAN to catch the ping traffic to and from the monitor IP and prioritise it.

Do you have any tips on doing this? I attempted to make a rule on the WAN interface and it doesn't appear to work. I don't see packets logging (logging is enabled in the rule) and I got this error in the log leading me to believe the rule isn't even working.

php-fpm[35382]: /rc.filter_configure_sync: New alert found: There were error(s) loading the rules: /tmp/rules.debug:270: syntax error - The line in question reads [270]: pass in log quick on $WAN1 reply-to ( em0 xx.xx.xx.xx ) inet proto icmp from yy.yy.yy.yy to $monitor_ips dscp 56 tracker 1427123720 keep state label "USER_RULE: ICMP Monitor Prioritize"

The rule was setup as follows:

Allow ICMP -> Source WAN IP -> Destination (match alias list - 8.8.8.8) -> DiffServ setting = cs7

I did play with the CODELQ options and yes, it is very straight forward. It's hard to tell what it is doing for my firewall WAN and apinger latency though.

doktornotor

@jdp0418:

Do you have any tips on doing this? I attempted to make a rule on the WAN interface and it doesn't appear to work. I don't see packets logging (logging is enabled in the rule) and I got this error in the log leading me to believe the rule isn't even working.

Make two quick floating rules for your WAN(s) like this:

The first one is direction out, the second one is direction in.
The Monitoring_IPs alias contains the Monitor IP entries from System - Routing. Or just stick the IP there directly if you have just a single WAN.
Set the Ackqueue/Queue as none/qACK or something else high priority. I made a qICMP queue for this so that you can see that something really gets matched in there.

Note: the above assumes you have some shaper in place, produced by the wizard or manually. If you just stick Codel on WAN, well… there's nothing to configure with queues or anything.

jdp0418

doktornotor - I gave your suggestion a shot. It seems I have a lot to learn on the traffic shaping capabilities of the PFSense firewall. I wasn't able to get it to work as you described, but I think my error is primarily in how I have the shaping currently setup.

I enabled CBQ shaping on my WAN and LAN. I created 2 queues - qICMP and qTraffic. I set qTraffic as my default queue. Under the shaper statistics, I see traffic being reported in the qTraffic queue (although i didn't attach it to any rules) but I do not see any traffic in the qICMP queue, despite having created the floating rules as you suggested.

I attached a screenshot of my queue setting in the rule. Is that right?

Otherwise, not sure what I am missing other than just not doing the shaping right to begin with. In your test setup, you saw matches in the queue under the shaper stats?

ackqICMP.PNG_thumb

jdp0418

@AhnHEL:

You could also try disabling "State Killing On Gateway Failure" in System/Advanced/Miscellaneous within the GUI. This should keep your VPN up when Apinger reports a Loss.

Actually I found that this isn't a state clearing issue.

When delay occurs:
Mar 24 12:14:36 apinger: alarm canceled: AWAN(x.x.x.x) *** AWANdelay ***
Mar 24 12:14:08 apinger: ALARM: AWAN(x.x.x.x) *** AWANdelay ***

The firewall is RESTARTING services altogether!
Mar 24 12:14:52 php-fpm[15338]: /rc.start_packages: Restarting/Starting all packages.
Mar 24 12:14:51 check_reload_status: Starting packages
Mar 24 12:14:51 php-fpm[15338]: /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - -> 10.1.1.1 - Restarting packages.
Mar 24 12:14:51 check_reload_status: Reloading filter
Mar 24 12:14:51 php-fpm[15338]: /rc.newwanip: rc.newwanip: on (IP address: 10.1.1.1) (interface: []) (real interface: ovpns4).
Mar 24 12:14:51 php-fpm[15338]: /rc.newwanip: rc.newwanip: Info: starting on ovpns4.
Mar 24 12:14:50 check_reload_status: rc.newwanip starting ovpns4
Mar 24 12:14:50 kernel: ovpns4: link state changed to UP
Mar 24 12:14:47 check_reload_status: Reloading filter
Mar 24 12:14:47 kernel: ovpns4: link state changed to DOWN
Mar 24 12:14:47 php-fpm[65238]: /rc.openvpn: OpenVPN: Resync server4 Remote Access VPN
Mar 24 12:14:47 php-fpm[65238]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use AWAN.
Mar 24 12:14:46 check_reload_status: Reloading filter
Mar 24 12:14:46 check_reload_status: Restarting OpenVPN tunnels/interfaces
Mar 24 12:14:46 check_reload_status: Restarting ipsec tunnels
Mar 24 12:14:46 check_reload_status: updating dyndns AWAN
Mar 24 12:14:33 nrpe[3162]: There's already an NRPE server running (PID 88503). Bailing out…
Mar 24 12:14:33 nrpe[3162]: Starting up daemon
Mar 24 12:14:31 php-fpm[25418]: /rc.filter_configure_sync: MONITOR: AWAN has high latency, omitting from routing group WAN1toWAN2
Mar 24 12:14:30 nrpe[73937]: There's already an NRPE server running (PID 88503). Bailing out…
Mar 24 12:14:30 nrpe[73937]: Starting up daemon
Mar 24 12:14:29 php-fpm[56975]: /rc.start_packages: [filer] filer_xmlrpc_sync.php is starting.
Mar 24 12:14:29 php-fpm[56975]: /rc.start_packages: [filer] filer_xmlrpc_sync.php is starting.
Mar 24 12:14:28 php-fpm[56975]: /rc.start_packages: Restarting/Starting all packages.
Mar 24 12:14:27 check_reload_status: Starting packages
Mar 24 12:14:27 php-fpm[39957]: /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - -> 10.1.1.1 - Restarting packages.
Mar 24 12:14:27 check_reload_status: Reloading filter
Mar 24 12:14:27 php-fpm[39957]: /rc.newwanip: rc.newwanip: on (IP address: 10.1.1.1) (interface: []) (real interface: ovpns4).
Mar 24 12:14:27 php-fpm[39957]: /rc.newwanip: rc.newwanip: Info: starting on ovpns4.
Mar 24 12:14:26 check_reload_status: rc.newwanip starting ovpns4
Mar 24 12:14:26 kernel: ovpns4: link state changed to UP
Mar 24 12:14:21 php-fpm[34202]: /rc.filter_configure_sync: MONITOR: AWAN has high latency, omitting from routing group WAN1toWAN2
Mar 24 12:14:20 php-fpm[25418]: /rc.openvpn: MONITOR: AWAN has high latency, omitting from routing group WAN1toWAN2
Mar 24 12:14:20 check_reload_status: Reloading filter
Mar 24 12:14:20 kernel: ovpns4: link state changed to DOWN
Mar 24 12:14:20 php-fpm[25418]: /rc.openvpn: MONITOR: AWAN has high latency, omitting from routing group WAN1toWAN2
Mar 24 12:14:20 php-fpm[25418]: /rc.openvpn: MONITOR: AWAN has high latency, omitting from routing group WAN1toWAN2
Mar 24 12:14:20 php-fpm[25418]: /rc.openvpn: OpenVPN: Resync server4 Remote Access VPN
Mar 24 12:14:20 php-fpm[25418]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use AWAN.
Mar 24 12:14:20 php-fpm[25418]: /rc.dyndns.update: MONITOR: AWAN has high latency, omitting from routing group WAN1toWAN2
Mar 24 12:14:19 check_reload_status: Reloading filter
Mar 24 12:14:19 check_reload_status: Restarting OpenVPN tunnels/interfaces
Mar 24 12:14:19 check_reload_status: Restarting ipsec tunnels
Mar 24 12:14:19 check_reload_status: updating dyndns AWAN

I know I need to educate myself on the traffic shaper in PFSense, however, it seems to me that services shouldn't be restarted just because apinger detects delay or removes a gateway from a group. And I do currently have state killing disabled.