OpenVPN instability since upgrade from pfSense Plus 23.09.1 to 24.11
-
I manage 9 Netgate routers, two 7100's and the other 7 are 2100's, with site to site OpenVPN connections between 8 of the sites to the main site (a 7100). I previously ran site to site connections with shared keys for years and it was incredibly stable. Back in December, while still on 23.09.1, set up new site to site OpenVPN connections with SSL/TLS following the recipe in the docs. Those connections worked perfectly for months on 23.09.1.
To make things a little more complicated (but possibly not relevant) I also have OpenVPN client connections between these 9 sites and a pfSense CE instance running on DigitalOcean, and an Ubuntu 22.04 server running as an OpenVPN server with client connections to the 9 routers (I opened a discussion about that a long time ago, never did figure out why I couldn't get traffic from that server into my sites' LANs through the pfSense instance on DO, so I ended up setting it up this way). I also have a couple remote access OpenVPN servers set up on the main router for IT staff. None of this probably matters, but I figure I should fully disclose my somewhat tangled VPN web. All of it was working perfectly until recently.
Each OpenVPN server has its own port, its own assigned unique IP range for the tunnel, its own set of certs. Each site-to-site one is configured to only allow one connection, and I have firewall rules set to only allow each site's IP to connect to their assigned port.
Over the course of a few weeks, I've upgraded all the routers to 24.11. Initially I noticed a few OpenVPN connections dropping off for a while, but they seemed to eventually reconnect. Just one here or there. Yesterday however, while investigating one of them disconnecting I discovered that most of them had dropped their connections, kind of in a cascade staring around 12:30. I restarted a few, eventually it seemed to resolve itself.
It happened again at 12pm today. The first failure had errors (on the main router) such as:
TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
TLS Error: TLS handshake failed
[UNDEF] Inactivity timeout (--ping-restart), restarting
TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
TLS Error: TLS handshake failed
MULTI: new incoming connection would exceed maximum number of clients (1)Over the course of maybe an hour most of the others did the same. I also see occasionally errors like "TLS Error: Unroutable control packet received from [AF_INET]one of my IP addresses:46175 (si=3 op=P_CONTROL_V1)
I also noticed a bunch of errors in dmesg like so:
sonewconn: pcb 0xfffff80025248200 (local:/var/etc/openvpn/server10/sock): Listen queue overflow: 2 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 0, jail 0
sonewconn: pcb 0xfffff80025270900 (local:/var/etc/openvpn/server14/sock): Listen queue overflow: 2 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 0, jail 0
sonewconn: pcb 0xfffff80025139100 (local:/var/etc/openvpn/server15/sock): Listen queue overflow: 2 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 0, jail 0
sonewconn: pcb 0xfffff800252c8d00 (local:/var/etc/openvpn/server20/sock): Listen queue overflow: 2 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 0, jail 0
sonewconn: pcb 0xfffff80025270d00 (local:/var/etc/openvpn/server12/sock): Listen queue overflow: 2 already in queue awaiting acceptance (1 occurrences), euid 0, rgid 0, jail 0
sonewconn: pcb 0xfffff80025248200 (local:/var/etc/openvpn/server10/sock): Listen queue overflow: 2 already in queue awaiting acceptance (9 occurrences), euid 0, rgid 0, jail 0
sonewconn: pcb 0xfffff80025270900 (local:/var/etc/openvpn/server14/sock): Listen queue overflow: 2 already in queue awaiting acceptance (9 occurrences), euid 0, rgid 0, jail 0Googling that led me to suggestions to increase kern.ipc.soacceptqueue. I changed it from 128 to 2048, figuring at least it might buy me some time. Since I did that, I haven't seen any additional errors in dmesg.
Anyway, I suspect maybe there's some tweak to settings that I need to make to make this stable again, does anybody have some suggestions?
Thanks!
-
Just a follow-up, since there's been no reply. I've concluded that it's related to or at least severely exacerbated by this issue in 24.11 with the dashboard impacting the system load: https://redmine.pfsense.org/issues/15969
It's kind of like the observer effect--it seems most prone to happening when I'm investigating it happening, or more particularly, when I've accidentally left a tab running the dashboard open. Earlier this week I logged in to get some info on a DHCP lease, forgot to log out and went on my merry way, came back a couple hours later to find that most of the OpenVPN connections had gone down again, plus there were a bunch of entries in system.log relating to php-fpm and connections being refused for loading the dashboard widgets. I restarted php-fpm and the gui from the console menu, and the VPN connections all came back online within a short period of time.
I'll be glad when 25.03 comes out so this problem is fixed!