Cron based OpenVPN watchdog



  • In my losing battle in trying to keep OpenVPN running consistently and not crashing all the time, I have thrown in the towel and created yet another cron based restart script, this time for OpenVPN.

    I run this every 7 minutes.

    Note that this will not restart the service if it is already running but incorrectly shows the OpenVPN as not being connected on the sevrer side. That is another problem I am having with OpenVPN p-2-p tunnels.

    
    #!/usr/local/bin/php -f
    /* $Id$ */
    /*
        /etc/watchdog_openvpn
        written for pfSense (http://www.pfSense.com)
        Shahid Sheikh
    
        This is a quick hack to restart any stopped
        or crashed opencpn client or server instances.
    */
    
    require_once("service-utils.inc");
    require_once("system.inc");
    
    $services = get_services();
    foreach ($services as $service) {
    	if ($service["name"] == "openvpn") {
    		if(!get_service_status($service)) {
    			log_error($service[description] . " was found dead.");
    			$settings = openvpn_get_settings($service[mode], $service[vpnid]);
    			openvpn_restart($service[mode], $settings);
    		}
    	}
    }
    ?>
    
    

    For those interested, my OpenVPN client is the one that mostly dumps. At least once in 24 hours. Here is the log sorted in reverse:

    Sep 9 22:26:12	openvpn[52037]: Exiting due to fatal error
    Sep 9 22:26:12 openvpn[52037]: FreeBSD ifconfig failed: external program exited with error status: 1
    Sep 9 22:26:12 openvpn[52037]: /sbin/ifconfig ovpnc1 192.168.5.2 192.168.5.1 mtu 1500 netmask 255.255.255.255 up
    Sep 9 22:26:12 openvpn[52037]: do_ifconfig, tt->ipv6=1, tt->did_ifconfig_ipv6_setup=0
    Sep 9 22:26:12 openvpn[52037]: TUN/TAP device /dev/tun1 opened
    Sep 9 22:26:12 openvpn[52037]: TUN/TAP device ovpnc1 exists previously, keep at program end
    Sep 9 22:26:10 openvpn[52037]: [<cn of="" my="" cert="">] Peer Connection Initiated with [AF_INET]<openvpn_server_ip>:16001
    Sep 9 22:26:10 openvpn[52037]: WARNING: 'ifconfig' is present in remote config but missing in local config, remote='ifconfig 192.168.5.2 192.168.5.1'
    Sep 9 22:26:10 openvpn[52037]: WARNING: 'tun-ipv6' is present in remote config but missing in local config, remote='tun-ipv6'
    Sep 9 22:26:09 openvpn[52037]: UDPv4 link remote: [AF_INET]<openvpn_server_ip>:16001
    Sep 9 22:26:09 openvpn[52037]: UDPv4 link local (bound): [AF_INET]</openvpn_server_ip></openvpn_server_ip></cn>
    


  • The new Service Watchdog package should do that for you - it works for me.
    Of course, it doesn't work in the odd cases where the OpenVPN instance is still running in a process somewhere, but the pid file does not point to that pid. That happens sometimes, somehow, with all the "killed" memory problems. Then the OpenVPN link is working but the dashboard can't find it. Service Watchdog, like the good puppy it is, faithfully tries to restart the OpenVPN instance every minute, without success since the port is already in use by the working but "lost" process. (I don't think your script will be any better at detecting this condition, since it uses the built-in pfSense routine get_service_status the same as the dashboard…)



  • It is too brute force. In that it attempts to restart the service as soon as it goes down. That causes loops and race conditions and all sorts of other headaches. A backoff algorithm needs to be built in to that watchdog service to make it a little less intrusive.



  • Have you been able to diagnose why your openvpn crashes all the time?  Thats strange behavior.  Not at all something happening for me here.



  • @phil.davis:

    …but the pid file does not point to that pid... That happens sometimes, somehow, with all the "killed" memory problems.

    In my case that wasn't happening because of low memory conditions. It was happening because PHP was attempting to restart openvpn faster than the disk could commit the .pid file and eventually would lose track which pid was running and which one had crashed.

    There need to be proper dynamic wait states for forks of processes to finish processing before a loop re-iterates or carries on.

    Ever since I simplified my setup to the point that OpenVPN was completely CARP independent even when running on a CARP cluster and every client always had one or two unique server IPs to connect to, things have been fairly stable. Both my primary and backup firewalls have OpenVPN connections established to primary and backup firewalls at the other sides. Now I am trying to get OSPF to be stable enough to correctly make all routing decisions.

    The setup is a full cross mesh of firewall pairs each at the three sites and each site having a primary and backup internet connection.



  • @kejianshi:

    Have you been able to diagnose why your openvpn crashes all the time?  Thats strange behavior.  Not at all something happening for me here.

    Its a combination of using the firewalls in CARP clusters, having OSPF running and having a full cross mesh OpenVPN connections between three sites.

    The single biggest reason for OpenVPN to dump due to a fatal error is because of not being able to bring up the ovpn tunnel interface or not being able to inject the route in the kernel's routing table.

    the next thing I am working on it to make the start script resilient to such problems and try to recover from them, fix the issue and restart the openvpn service. Probably not going to be able to finish it since I am already 2 weeks behind in delivering this overall solution to a client.


Log in to reply