Peer to peer (shared key) OpenVpn issues

cmb

The traffic that's getting blocked is a symptom of the VPN timing out and reconnecting. The question is why is it doing that. My first guess, your WAN gateway monitor IP isn't replying reliably, it's going into and out of alarm on occasion, which will restart the VPN when it comes out of alarm.

coolnicklas

Hi, That seems reasonable.

I have been on the wrong track here thinking it was a firewall issue. A did see in a thread in the Firewall section that "chemlud" had a similar problem with blocked valid traffic on his OpenVpn interface but it is was FA blocks. I guess this is what I see here.

So the issue here is that one of my WAN connections are unstable causing the VPN-connection to go down? Or at least the GW monitoring detects that that the WAN GW is down and restarts the VPN connection (and other connections). I should see this in the syslog right?

Yesterday, I did setup a Site to site OpenVpn connection using two Linux servers with static routes through the pfsense boxes, and actually I got the same symptoms. Worked for a short while and then the link died + a lot of valid packets was blocked on the LAN interface this time. Adding that info, it makes sense that the WAN connection is the root cause of the problem.

I have a symmetric 8/8 mbit SDSL connection on one end and a 100/10 fibre connection in the other, one of them must be Fubar.

After having 4-5 hours of sleep three nights in a row because of trubleshhoting this issue my brain doesn't function as it should. Do yoy have any hints for troubleshooting further, except looking in the syslog?
Any additional logging that I should enable? I send all logs to syslog servers so log space is not an issue.

Guest

If it's the gateway: Status -> System Logs -> Gateways and see if you can match the time when the VPN goes down to an apinger alarm…

I had such an alarm yesterday when the firewall went crazy on the FA packages...

This morning I changed the server to ping to for gateway monitoring (the address I used was apparently causing troubles from time to time and the gateway of the PPPoE side of the tunnel is not responding to pings...).

Hope now the tunnel is more stable...

coolnicklas

OK, I think I nailed it this time. It pretty much confirms the theory, right?

System->Gateway log:

May 17 03:48:18 	apinger: alarm canceled: WANGW(193.12.110.9) *** delay ***
May 17 03:48:01 	apinger: ALARM: WANGW(193.12.110.9) *** delay ***

System->General log:

May 17 03:49:02 	php: rc.start_packages: Restarting/Starting all packages.
May 17 03:49:02 	php: rc.start_packages: Restarting/Starting all packages.
May 17 03:48:57 	php: rc.filter_configure_sync: Not installing NAT reflection rules for a port range > 500
May 17 03:48:57 	php: rc.filter_configure_sync: Not installing NAT reflection rules for a port range > 500
May 17 03:48:52 	check_reload_status: Starting packages
May 17 03:48:52 	php: rc.newwanip: pfSense package system has detected an ip change -> 192.168.101.2 ... Restarting packages.
May 17 03:48:52 	check_reload_status: Starting packages
May 17 03:48:52 	php: rc.newwanip: pfSense package system has detected an ip change -> 192.168.50.1 ... Restarting packages.
May 17 03:48:52 	php: rc.newwanip: rc.newwanip: on (IP address: 192.168.101.2) (interface: []) (real interface: ovpnc2).
May 17 03:48:52 	php: rc.newwanip: rc.newwanip: on (IP address: 192.168.50.1) (interface: []) (real interface: ovpns1).
May 17 03:48:52 	php: rc.newwanip: rc.newwanip: Informational is starting ovpnc2.
May 17 03:48:52 	php: rc.newwanip: rc.newwanip: Informational is starting ovpns1.
May 17 03:48:44 	php: rc.start_packages: Restarting/Starting all packages.
May 17 03:48:43 	check_reload_status: rc.newwanip starting ovpnc2
May 17 03:48:43 	kernel: ovpnc2: link state changed to UP
May 17 03:48:43 	php: rc.start_packages: Restarting/Starting all packages.
May 17 03:48:42 	check_reload_status: rc.newwanip starting ovpns1
May 17 03:48:42 	kernel: ovpnc2: link state changed to DOWN
May 17 03:48:42 	kernel: ovpns1: link state changed to UP
May 17 03:48:40 	php: rc.openvpn: OpenVPN: Resync client2
May 17 03:48:40 	kernel: ovpns1: link state changed to DOWN
May 17 03:48:37 	php: rc.openvpn: OpenVPN: Resync server1 RoadWarriors
May 17 03:48:37 	php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WANGW.
May 17 03:48:32 	check_reload_status: Starting packages
May 17 03:48:32 	php: rc.newwanip: pfSense package system has detected an ip change -> 192.168.101.2 ... Restarting packages.
May 17 03:48:32 	php: rc.newwanip: rc.newwanip: on (IP address: 192.168.101.2) (interface: []) (real interface: ovpnc2).
May 17 03:48:32 	php: rc.newwanip: rc.newwanip: Informational is starting ovpnc2.
May 17 03:48:30 	check_reload_status: Starting packages
May 17 03:48:30 	php: rc.newwanip: pfSense package system has detected an ip change -> 192.168.50.1 ... Restarting packages.
May 17 03:48:30 	php: rc.newwanip: rc.newwanip: on (IP address: 192.168.50.1) (interface: []) (real interface: ovpns1).
May 17 03:48:30 	php: rc.newwanip: rc.newwanip: Informational is starting ovpns1.
May 17 03:48:28 	check_reload_status: Restarting OpenVPN tunnels/interfaces
May 17 03:48:28 	check_reload_status: Restarting ipsec tunnels
May 17 03:48:28 	check_reload_status: updating dyndns WANGW
May 17 03:48:24 	php: rc.filter_configure_sync: Not installing NAT reflection rules for a port range > 500
May 17 03:48:23 	php: rc.filter_configure_sync: Not installing NAT reflection rules for a port range > 500
May 17 03:48:23 	check_reload_status: rc.newwanip starting ovpnc2
May 17 03:48:23 	kernel: ovpnc2: link state changed to UP
May 17 03:48:22 	check_reload_status: rc.newwanip starting ovpns1
May 17 03:48:22 	kernel: ovpns1: link state changed to UP
May 17 03:48:22 	kernel: ovpnc2: link state changed to DOWN
May 17 03:48:21 	php: rc.openvpn: OpenVPN: Resync client2
May 17 03:48:21 	check_reload_status: Reloading filter
May 17 03:48:21 	kernel: ovpns1: link state changed to DOWN
May 17 03:48:20 	php: rc.openvpn: OpenVPN: Resync server1 RoadWarriors
May 17 03:48:20 	php: rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WANGW.
May 17 03:48:11 	check_reload_status: Reloading filter
May 17 03:48:11 	check_reload_status: Restarting OpenVPN tunnels/interfaces
May 17 03:48:11 	check_reload_status: Restarting ipsec tunnels
May 17 03:48:11 	check_reload_status: updating dyndns WANGW

And it correlates to when Rsync fails (after a timeout of 15 minutes). I see the same thing happen again at 04:48 and 08:58. I have submitted this error to our ISP so I'll get back when they have looked in to it. I have also used Googles public DNS server, 8.8.8.8, to monitor the GW and I get the same behaviour….

Guest

Where did I write that a I changed to 8.8.8.8? :o …you can see my config online? ;D

Maybe the NSA doesn't like openVPN tunnels... ;)

btw: If you are running snort, have an eye on that funny little beast, too. It took me some weeks to figure out, that under heavy load some P2P rules kill tunnels (IPsec or openVPN does not matter)... ;D

coolnicklas

;D Google…
Update:
My ISP does not see anything special on their side so I did changed the parameters on the GW monitor to,Probe Interval: 10and```
Down: 60


With this setting the GW Alarm does not trigger and the VPN connection and Backups run fine.

However, I think I have found a bug related to OpenVPN and GW monitoring: Can anyone try this:

You must have a OpenVpn tunnel with PSK.

1\. Print the routes in pfsense, Diagnostics->Routes and notice the route to your LAN network on other side of OpenVpn PSK tunnel
2\. In Systems->Routing: edit default GW and change monitor IP (you can use Google's 8.8.8.8 ), save
3\. Print the routes in pfsense, Diagnostics->Routes: The route to the other side of the OpenVpn tunnel is gone
4\. Even a reeboot won't help
5\. In VPN->OpenVpn: edit the PSK tunnel, don't change anything, just save.
6.  Print the routes in pfsense, Diagnostics->Routes: the route is back

NetBandit

@coolnicklas:

My ISP does not see anything special on their side so I did changed the parameters on the GW monitor to,Probe Interval: 10and```
Down: 60

I don't think this has anything to do with the monitoring.

From my experience this happens when you saturate (max out) the bandwidth of the VPN connection.

I get the same problem when I initiate a file transfer in Remote Desktop, which works ok until the bandwidth goes up to maximum.

You can try throttling the bandwidth of RSYNC to 50% of the max capacity of the WAN link it uses to see if that prevents it. You might want to play around with that speed to see where the threshold is.

It looks like this is definitely a bug because there are several other threads for the same root issue; all of which characterized by the same thing:
event_wait : Interrupted system call (code=4)

See these threads:
https://forum.pfsense.org/index.php?topic=75989.0
https://forum.pfsense.org/index.php?topic=76735.0
https://forum.pfsense.org/index.php?topic=77169.0
and possibly
https://forum.pfsense.org/index.php?topic=77214.0

-nb

Andyrue

This seems to be happening to me too. I just switched from Untangle and have been experiencing problems with the pfSense OpenVPN staying stable at full bandwidth. I limited my Rsync bandwidth and it is stable now….I hope they find a fix for this bug soon.

cmb

The "detected IP change" part is the root of the issue. 2.1.4 (should be released this week) has a fix for that. Under maximum upload, your latency increases enough to put the gateway into alarm, and when it comes out of alarm it incorrectly thinks your WAN IP has changed (if it actually did, it would have to restart the OpenVPN instance, which is why it's doing that).

Two things.

a one-letter fix in /etc/rc.newwanip should be enough to take care of the "detected IP change" issue. Find curnwanip in that file and remove the n, so that line shows:
if (is_ipaddr($curwanip))
under the Gateway advanced properties (System>Routing, edit gateway), if your connection's latency under load goes up enough to trigger an alarm, increase the latency thresholds so normal high latency when your connection is under heavy load won't trigger an alarm.

Andyrue

Thank you for the suggestion. I have increased the latency thresholds to 500/1000 but I don't seem to have any typos of curwanip in that file. I'm on 2.1.2. Will do some testing today. Thanks!

Andyrue

Adjusting the threshold seems to have fixed it. Thank you!