OpenVPN process(es) die sporadically

JeGr

pfSense 1.2, embedded Version, 256MB RAM, 1Ghz Via Eden CPU

Hi folks,

after quite a while with this problem I definetly need some help solving it. I don't know if that's a configuration issue or an openvpn bug. Here we go:

We have the mentioned device running as out (mulit-wan) firewall gateway at out company main location. It works very well besides some very very strange openvpn issues.
Historically we had a openvpn site2site connection to our datacenter and an openvpn server at work, that our guys use for remote working. After I did some reorganization, I merged the VPN tunnel server and the VPN worker server together with the old and bad working firewall into one strong pfSense installation. Yay! That worked for quite some time, site2site tunnel is very solid (the mentioned device is configured as client for the tunnel). But the dialin OpenVPN services has grown bad.

For our team, I merged the old configurations into pfSense by creating 11 server configs within pfSense. One for each team member (like the old openvpn server was configured). Now problem is, every now and then, some of these server processes simply die. Yesterday I checked after I got some bug report and found server #10 was not running, so I restarted it and checked the process-list:


??  Ss     1:09.45 openvpn --config /var/etc/openvpn_server0.conf
??  Ss     1:07.35 openvpn --config /var/etc/openvpn_server1.conf
??  Ss     1:02.15 openvpn --config /var/etc/openvpn_server2.conf
??  Ss     1:03.42 openvpn --config /var/etc/openvpn_server3.conf
??  Ss     1:01.38 openvpn --config /var/etc/openvpn_server4.conf
??  Ss     1:06.45 openvpn --config /var/etc/openvpn_server5.conf
??  Ss     1:10.11 openvpn --config /var/etc/openvpn_server6.conf
??  Ss     0:43.88 openvpn --config /var/etc/openvpn_server7.conf
??  Ss     0:43.12 openvpn --config /var/etc/openvpn_server8.conf
??  Ss     0:41.23 openvpn --config /var/etc/openvpn_server9.conf
??  Ss     0:40.20 openvpn --config /var/etc/openvpn_server10.conf
??  Ss    40:51.03 openvpn --config /var/etc/openvpn_client0.conf

So as you can see, all servers and the client (running the s2s tunnel) are running and are up. No errors in the logs, team can dial in (via dynamic dsl lines).

This morning I checked again:


  553  ??  Ss     1:09.45 openvpn --config /var/etc/openvpn_server0.conf
  792  ??  Ss     0:43.88 openvpn --config /var/etc/openvpn_server7.conf
  857  ??  Ss    40:51.03 openvpn --config /var/etc/openvpn_client0.conf

Besides the s2s tunnel, only 2! servers are running, all other server processes have died over the night (including the 3 bosses' links, which was quite … stunning :()

What's wrong with openvpn in this scenario? Isn't it possible to constantly run the processes?

I see some errors in the logs like:


Dec  2 19:13:47 gate23 openvpn[696]: event_wait : Interrupted system call (code=4)
Dec  2 19:13:47 gate23 openvpn[696]: /etc/rc.filter_configure tun5 1500 1545 10.0.1.17 10.0.1.18 init
Dec  2 19:14:07 gate23 openvpn[696]: SIGHUP[hard,] received, process restarting
Dec  2 19:14:07 gate23 openvpn[696]: OpenVPN 2.0.6 i386-portbld-freebsd6.2 [SSL] [LZO] built on Sep 13 2007
Dec  2 19:14:10 gate23 openvpn[696]: LZO compression initialized
Dec  2 19:14:10 gate23 openvpn[696]: TUN/TAP device /dev/tun6 opened
Dec  2 19:14:10 gate23 openvpn[696]: /sbin/ifconfig tun6 10.0.1.17 10.0.1.18 mtu 1500 netmask 255.255.255.255 up
Dec  2 19:14:11 gate23 openvpn[696]: /etc/rc.filter_configure tun6 1500 1545 10.0.1.17 10.0.1.18 init
Dec  2 19:14:27 gate23 openvpn[696]: UDPv4 link local (bound): [undef]:5005
Dec  2 19:14:27 gate23 openvpn[696]: UDPv4 link remote: [undef]
Dec  2 23:22:26 gate23 openvpn[696]: event_wait : Interrupted system call (code=4)
Dec  2 23:22:26 gate23 openvpn[696]: /etc/rc.filter_configure tun6 1500 1545 10.0.1.17 10.0.1.18 init
Dec  2 23:22:52 gate23 openvpn[696]: SIGHUP[hard,] received, process restarting
Dec  2 23:22:52 gate23 openvpn[696]: OpenVPN 2.0.6 i386-portbld-freebsd6.2 [SSL] [LZO] built on Sep 13 2007
Dec  2 23:22:55 gate23 openvpn[696]: WARNING: file '/var/etc/openvpn_server6.secret' is group or others accessible
Dec  2 23:22:55 gate23 openvpn[696]: LZO compression initialized
Dec  2 23:22:55 gate23 openvpn[696]: TUN/TAP device /dev/tun6 opened
Dec  2 23:22:55 gate23 openvpn[696]: /sbin/ifconfig tun6 10.0.1.17 10.0.1.18 mtu 1500 netmask 255.255.255.255 up
Dec  2 23:22:55 gate23 openvpn[696]: /etc/rc.filter_configure tun6 1500 1545 10.0.1.17 10.0.1.18 init
Dec  2 23:22:58 gate23 openvpn[696]: UDPv4 link local (bound): [undef]:5005
Dec  2 23:22:58 gate23 openvpn[696]: UDPv4 link remote: [undef]
Dec  3 00:16:44 gate23 openvpn[696]: event_wait : Interrupted system call (code=4)
Dec  3 00:16:44 gate23 openvpn[696]: /etc/rc.filter_configure tun6 1500 1545 10.0.1.17 10.0.1.18 init
Dec  3 09:12:05 gate23 openvpn[696]: SIGHUP[hard,] received, process restarting
Dec  3 09:12:05 gate23 openvpn[696]: OpenVPN 2.0.6 i386-portbld-freebsd6.2 [SSL] [LZO] built on Sep 13 2007
Dec  3 09:12:07 gate23 openvpn[696]: WARNING: file '/var/etc/openvpn_server6.secret' is group or others accessible
Dec  3 09:12:07 gate23 openvpn[696]: LZO compression initialized
Dec  3 09:12:07 gate23 openvpn[696]: TUN/TAP device /dev/tun3 opened
Dec  3 09:12:07 gate23 openvpn[696]: /sbin/ifconfig tun3 10.0.1.17 10.0.1.18 mtu 1500 netmask 255.255.255.255 up
Dec  3 09:12:07 gate23 openvpn[696]: FreeBSD ifconfig failed: shell command exited with error status: 1
Dec  3 09:12:07 gate23 openvpn[696]: Exiting

I'm suspecting the LoadBalancer changing the routing or switching lines (UP state) near the "Exiting" timestamps of openvpn to have sth to do with the situation, but am not sure, if that is possible an if that can end up with those side effects, that my openvpn server processes may die when loadbalancing changes. As we have few problems with our second WAN line at the moment (connected to OPT1, VPN processes are running on the "good one" on WAN), the slbd changes weighting quite often, so if that is the problem I'm doomed ;) I can restart the processes manually, but that's no working perspective.

Anyone? Any ideas to help?

JeGr

There seems some sort of connection between our bandwith problems with WAN2, the loadbalancer (slbd) very often has to realign routing and up'ing and down'ing the second IF. All VPN daemon kills are logged shortly after slbd balances the outgoing connection from WAN2->WAN and back. It seems like the VPN servers are restarted or sth. alike and don't re-use their former tunnel interface correctly. So the openvpn server daemon that was formerly using configuration #3 and tun3 as its interface tries to restart with tun5 for example and that fails, 'cause tun5 can not be configured with the same interface settings as tun3 (duplicate IP etc.). So the daemon terminates. And so on. After many (or most) daemons terminate the other ones can be configured correctly ('cause there are not that much other interfaces available to duplicate) .
I tried that manually today after I had again lost 6 daemons (that were restarted this morning). After starting a few of them I ran into the tunX IP already configured issue. After I restarted them in a correct order that the tun-IFs won't collide, all was up again.

But I don't fully understand the correlation with slbd and restarting the daemons. Perhaps some dev has some spare minutes to look at that issue? I suppose that didn't happens that much when one has only a few openvpn daemons configured as server.
If someone needs further information on that to help, let me know.

Greets
Grey

JeGr

Another update:

I disabled the line-failover via slbd completely last night and this morning, all OpenVPN server processes were still there. But after the service guy from our cable company worked on our bad WAN line2, we disconnected that line and plugged it in later. After the cable modem started working again I checked the pfSense device and: 5 server daemons were down including the client one (to our datacenter). Again the interface binding problem!

Is there some possibility, that each openvpn configuration can be bound to a specific tunXY interface? I think that would solve that issue.

Edit: Seems that command is causing the whole issue:

 sh -c killall -HUP openvpn 2>/dev/null

JeGr

Sorry if that seems like I'm speaking to myself. Did modify all server configurations and added a custom devX parameter into all openvpn configurations so they have to use their given tunnel interface. But after running the SIGHUP of all openvpn processes, 2 were missing afterwards. So it seems like the "SIGHUP kill" does not restart all tunnels/servers of openvpn correctly

GruensFroeschli

Hmmm. I answered to this thread suggesting the -dev tunX , but it seems my post got lost somewhere ^^"

About the SIGHUP: what exactly do you mean 2 process where missing afterwards?
You mean they just died and didnt restart?

JeGr

Exactly. I did see the "killall" in the processlist and waited for ~5min, afterwards only 8 out of 10 openvpn daemons were still alive. And this one was after I added the "dev tunX" to the configs. I don't get it..

GruensFroeschli

Well a kind of a workaround would be to kill the processes via sigterm and then restart them manually in the correct order.

(8 processes, sounds like a 3 bit counter to me…)

JeGr

There are even more. Its 12 alltogether. 1 client (tunnel to datacenter), 11 server, one per co-worker.

JeGr

OK re-enabled slbd today and it works, even after slbds ICMP poll states DOWN and filters are reloaded, daemons stay alive. I think the problem is related to two things:

one interface changing (dhcp, dis-/enabling)
reloading openvpn daemons via the stated command (sh killall -HUP openvpn)

The SIGHUP seems to kill a random number of daemons while restarting them (whysoever). ATM I'm ordering new CF-cards to try a clean new installation on one of these and do some modifications. If anyone knows more about that "restart phenomenom" or has problems alike I would be glad to hear some comments.