OpenVPN HA Sync failover

cmhddti

I'll cross-post this a couple places, but I'm casting a wide net in hopes of getting some advice.

We've got two pfSense boxes (currently running 2.4.4-RELEASE-p3) configured with HA Sync, and sharing a CARP interface between them. I've got OpenVPN listening on the public CARP address, and it works great. However, if I were to initiate a CARP failover (by doing something as innocuous as unplugging a completely unrelated Ethernet cable) users get knocked off the VPN, and it takes about 30-60 seconds to failover to the secondary pfSense box, then another 30-60 seconds when it fails back to the primary. For comparison, I also have these boxes terminating an IPSEC Site-to-Site tunnel, and that only misses a ping or two when CARP fails over.

Does anyone know of any way to make this less impacting on my remote users? If, for example, I reboot the primary box to update the firmware, I get a bunch of messages from users saying they got disconnected from VPN, then another bunch of messages two minutes later saying that they got disconnected again. It's the only imperfection on an otherwise perfect setup, so of course, its significance to me is magnified.

I'm aware that the OpenVPN service isn't running on the backup server until a failure of the primary server is detected, so I assume part of the delay is waiting for a few heartbeats to be missed, and for the service to start up and accept connections. IPSEC is in the kernel, so maybe that's why it fails over so seamlessly. There's maybe also some delay in the ARP cache, but again, IPSEC would have those same issues, and failover is really fast. I'm running on relatively powerful, dedicated hardware with fast SSD, so I would imagine services could start up a lot faster than 30-60 seconds.

I've seen a couple of posts that suggested tweaking some keepalive settings that are sent out to the client. I experimented a little with a few of those, but it didn't seem to have a significant impact on the failover time. I'm also wondering if there are some tweaks to encourage the secondary server to detect the failure of the primary faster. Or maybe keep the service started on a sort of hot-standby. The two servers sync network is a crossover cable on a dedicated NIC, so I don't have a problem increasing the heartbeat rate, but I don't know how to do that, nor whether it would decrease failover time.

It seems to me like handing off the VPN session without interruption is probably impossible, so I expect the client will have to renegotiate the session. Most of our users are Windows users who use Viscosity VPN, which is capable of auto-reconnect when a tunnel is dropped, but it seems like that application (which is built on the OpenVPN client) isn't doing a great job of detecting the tunnel failure. I'm hoping I can push some settings out to them without having to configure each user's settings, too.

Anyhow, suggestions would be greatly appreciated.

JeGr

@cmhddti said in OpenVPN HA Sync failover:

I'm aware that the OpenVPN service isn't running on the backup server until a failure of the primary server is detected, so I assume part of the delay is waiting for a few heartbeats to be missed, and for the service to start up and accept connections.

Nope. Even if there is a slight delay in firing up the OVPN server process on the standby node, the problem is not the OVPN server not running. It's the change of servers and the routing. With the failover, your CARP IP on WAN stays the same. But the OVPN server and the dial-up transit network were only running on node 1, now they are on node 2 and node2's server has no clue, how many and which VPN roadwarriors were connected. They need to re-connect, so the new VPN server instance can get them IPs via internal DHCP and route correctly.

The IPSEC comparison is flawed, as you have a single tunnel terminating on the CARP WAN side and have nearly static routing over there. You don't have clients connected with dynamic adresses that may change by their order of dial-in so you don't have to change a thing.
Same goes for OVPN site 2 site tunnels. We have all three running on our big cluster and S2S tunnels are always nearly flawlessly working in failover scenarios, be it IPSEC or OVPN. Tunnels are very very easy to deal and sync with. We had an exchange of hardware once and I drove to the remote site with completely new hardware preconfigured for an OVPN S2S tunnel. Powered it up, had it ready and between "two heartbeats" I switched network cables from the old to the new hardware. Tunnel came up that fast, that with active pings running we never even missed one single PING on that line.

So you see, S2S is no problem configured by whatever means. Clients are different because of the way their DHCP-like IP adressing and routing works. Even if you run your OVPN Server on localhost instead of on the CARP VIP and just port forward the OVPN connection to localhost (that way the OVPN server can actively run on BOTH nodes without problems all the time because only the active node gets the traffic on the WAN VIP for the port forward) you have a tad smaller switch times, but the clients will re-connect nevertheless.

Using that with smaller keepalive and dead peer detection settings so the client gets to know faster, that its peer is down, can make the situation better (also the client should be able to reconnect automatically). But your users should (and have to!) know, that a failover situation on a cluster/firewall is no "no-loss" situation. A VPN connection may and should always be able to be allowed to fail/re-dial for a short amount of time. And by your initial paragraph it seems that you already have fail timings that are far below 5min. And 1-2min for failover and failback are in range of a very normal an healthy way. Perhaps send out a quick E-Mail to all VPN users and inform them about failover/failback - but as long as we run a similar setup (for years now) I never had anyone complaining in ernest about ~1-2min where they had to re-connect to the VPN.

After all, you don't failover a firewall every day just for the fun of it

Derelict

Try adding this to the server config advanced custom options:

explicit-exit-notify 2;

That should tell the OpenVPN server to instruct all clients to disconnect when the OpenVPN server process on the primary is shut down. Else they need to wait for the timeout to happen.

They should immediately try to reconnect and should get the secondary there.

cmhddti

@JeGr We're putting two new pfSense servers in our datacenter, and I'm going to try binding the OpenVPN interface to the WAN instead of the CARP interface. It definitely makes sense in my head, and seems like it might make the failover happen faster. I'm interested to see if the clients that connect to the secondary server stay connected to that one when CARP fails back over to the primary server. My guess is that they will, but it'll be interesting to look at.

cmhddti

@Derelict Fantastic! This is what I came here looking for. I'm already resigned to the fact that the connection is going to drop. But if I can figure out a way that the client knows quickly and initiates a re-connect (to a hot-running server, using @JeGr method above) I could make the switchover time almost nonexistent.

And if I'm not failing over my primary firewall at least once a day, what kind of Network Engineer would I be? Gotta keep those users on their toes!

Derelict

It is definitely going to drop. There is no session sync between nodes. Though you want to bind to the CARP VIP. That will stop the server instance on the "failed" node (triggering explicit-exit-notify) and start it on the new master.

I used to be a fan of binding the OpenVPN server to localhost and port forwarding the CARP VIP to it (the server is always running on both nodes in that case), but explicit-exit-notify is leaning me toward killing the server on the failed node.

JeGr

@Derelict said in OpenVPN HA Sync failover:

That will stop the server instance on the "failed" node (triggering explicit-exit-notify) and start it on the new master.

Wouldn't that only happen if you manually shutdown the primary node? Just curious, as in a failure (e.g. uplink down) case, no one would get the exit notify as the line is down? But yeah in an update scenario where you shut down the primary for updating, it would definetly cut links quicker than a timeout.

@cmhddti said in OpenVPN HA Sync failover:

I'm going to try binding the OpenVPN interface to the WAN instead of the CARP interface.

Naah! Not the WAN. Do it either on the CARP VIP - so you have it failover to the standby node AND send out the exit notify like @Derelict said - or set it up on localhost and port-forward the OpenVPN port to it from the WAN(s). That's how you do it with MultiWAN normally, but it can also help like I wrote with switching scenarios a bit even if it doesn't send out the exit notifiy (but we have a relativly short re-connect timeout in the client settings to counter that).

@cmhddti said in OpenVPN HA Sync failover:

I'm already resigned to the fact that the connection is going to drop.

It sure will :)

@cmhddti said in OpenVPN HA Sync failover:

Gotta keep those users on their toes!

Yeah... no ;) That angry mob would be too much for my taste. But seriously, everyone of our VPN users, clients or tunnel endpoints (companies) knows, that a VPN tunnel may go down for a few seconds/minutes and that's completely normal and OK. No warranty for 100% connection service 24/7/365 ;)

@Derelict said in OpenVPN HA Sync failover:

but explicit-exit-notify is leaning me toward killing the server on the failed node.

Would be nice to read a comparison! Have you done any measuring of the client's reconnect timeout? Or even a feel that it's faster? Otherwise I've got something to play with in our lab ;)

Derelict

@JeGr said in OpenVPN HA Sync failover:

Wouldn't that only happen if you manually shutdown the primary node? Just curious, as in a failure (e.g. uplink down) case, no one would get the exit notify as the line is down? But yeah in an update scenario where you shut down the primary for updating, it would definetly cut links quicker than a timeout.

Of course a hard failure will probably not send anything from the OpenVPN process. Those are actually pretty rare compared to setting maintenance mode. An interface down event would (unless the down interface was necessary to send the disconnect advisories)