OpenVPN CPU spikes to 100% after upgrade to 2.5
I had 2.4.5 running pretty happily for a good long while, but there are some features in 2.5 that I was really longing for. After upgrading to 2.5, OpenVPN has started misbehaving twice. I wasn't specifically watching CPU usage, but one of our ~40 remote users dropped from the VPN and couldn't get logged back on. The OpenVPN log said that we were out of IP addresses, which I knew wasn't right. I looked at the connections, and that user had 100+ connections. He just kept connecting every 30 seconds, and his old connections weren't dropping, even when his VPN client stopped. (In other words, he wasn't really connected.) The CPU was pegged at 100%, and OpenVPN was using around 95% of that. So I restarted the OpenVPN service and everything settled down.
Then, tonight, I logged on, and the CPU was at 100% again, with OpenVPN hogging most of it. However, there wasn't a rogue user connecting over 100 times. Only 15 or so idle connections. This leads me to believe that whatever caused the CPU usage to spike is what caused the remote user to connect 100 times, not the other way around. I checked the system monitor, and the CPU spike is sudden, not gradual. Usage goes from around 10% to 100% within a 5 minute window.
It seems like something is happening within the OpenVPN process that is pegging the CPU, and it's only been happening since the upgrade. Admittedly, we are doing a lot of heavy lifting with our OpenVPN configuration. We've got two pfSense boxes in HA, and two ISPs, with OpenVPN listening on the loopback address, and the OpenVPN ports from each ISP forwarding to 127.0.0.1. This configuration allows for nearly instantaneous CARP failover, and has been working well for over a year on the same hardware.
Anything that anyone can think of to check? Or any more information that would be helpful? I'm hoping to catch "The Incident" the next time it happens so that I can look at the OpenVPN logs to see if there is a corresponding entry. If I catch it, I'll post here. I'd appreciate any insight that anyone might have on it.
Happened again this morning, exactly between 7:29 am and 7:30 am. I checked the OpenVPN logs. There is absolutely no activity between 7:31:10 and 7:36:40, which is unusual. Typically someone is reconnecting every few minutes. Then, at 7:36:40, all 13 connected users have an inactivity timeout at the exact same time. After that, all remote users start aggressively connecting, every few seconds. By this point, the CPU is already at 100%. Starting at 8:16:59, the log starts throwing errors that no more IP addresses are available.