IPSec P2 stability problems with 20.02
-
After upgrading to 20.02, I am having major issues with IPSec, tunnels come up, but stop transmitting data after a while, if I kill the P2, it rebuilds and works fine for a while longer, but their IPSec IDs are much higher than they used to be, i.e. used to be like con1000, now they are like con1000000, con5000000(I don't really think that's relevant, just including as much info as possible), they also don't show the description in that column, and at the bottom of the list, all tunnels show with their description as offline. These tunnels have a variety of endpoints from WatchGuards, Junipers, ASAs, Meraki, even a few pfSense boxes. Right now, as a workaround, I've had to set P1 lifetime to a few minutes at most.
output of: swanctl --load-all --file /var/etc/ipsec/swanctl.conf --debug 1
no authorities found, 0 unloaded no pools found, 0 unloaded loaded ike secret 'ike-0' loaded ike secret 'ike-1' loaded ike secret 'ike-2' loaded ike secret 'ike-3' loaded ike secret 'ike-4' loaded ike secret 'ike-5' loaded ike secret 'ike-6' loaded ike secret 'ike-7' loaded ike secret 'ike-8' loaded ike secret 'ike-9' loaded connection 'bypass' loaded connection 'con200000' loaded connection 'con300000' loaded connection 'con400000' loaded connection 'con500000' loaded connection 'con600000' loaded connection 'con700000' loaded connection 'con800000' loaded connection 'con900000' loaded connection 'con1000000' loaded connection 'con1100000' successfully loaded 11 connections, 0 unloaded
output of: swanctl --list-conns
bypass: IKEv1/2, no reauthentication, rekeying every 14400s local: %any remote: 127.0.0.1 local unspecified authentication: remote unspecified authentication: con400000: IKEv2, no reauthentication, rekeying every 25920s, dpd delay 10s local: 142.166.5.144 remote: 142.177.205.170 local pre-shared key authentication: id: 142.166.5.144 remote pre-shared key authentication: id: 142.177.205.170 con400000: TUNNEL, rekeying every 25920s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.234.5.0/24|/0 con200000: IKEv2, no reauthentication, rekeying every 25920s, dpd delay 10s local: 142.166.5.143 remote: 72.138.96.114 local pre-shared key authentication: id: 142.166.5.143 remote pre-shared key authentication: id: 72.138.96.114 con200000: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.234.1.0/24|/0 con300000: IKEv2, no reauthentication, rekeying every 25920s, dpd delay 10s local: 142.166.5.143 remote: 142.163.178.178 local pre-shared key authentication: id: 142.166.5.143 remote pre-shared key authentication: id: 142.163.178.178 con300000: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.234.4.0/24|/0 con500000: IKEv2, no reauthentication, rekeying every 25920s, dpd delay 10s local: 142.166.5.143 remote: 209.128.21.50 local pre-shared key authentication: id: 142.166.5.143 remote pre-shared key authentication: id: 209.128.21.50 con500000: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.234.2.0/24|/0 con600000: IKEv1, reauthentication every 25920s, dpd delay 10s local: 142.166.5.141 remote: 206.45.20.106 local pre-shared key authentication: id: 142.166.5.141 remote pre-shared key authentication: id: 206.45.20.106 con0: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.234.8.0/24|/0 con700000: IKEv1, reauthentication every 25920s, dpd delay 10s local: 142.166.5.144 remote: 24.222.54.138 local pre-shared key authentication: id: 142.166.5.144 remote pre-shared key authentication: id: 24.222.54.138 con0: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 192.168.0.0/24|/0 con800000: IKEv1, reauthentication every 25920s, dpd delay 10s local: 142.166.5.144 remote: 24.222.51.228 local pre-shared key authentication: id: 142.166.5.144 remote pre-shared key authentication: id: 24.222.51.228 con0: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.217.1.0/24|/0 con900000: IKEv2, no reauthentication, rekeying every 25920s, dpd delay 10s local: 142.166.5.144 remote: 142.177.143.178 local pre-shared key authentication: id: 142.166.5.144 remote pre-shared key authentication: id: 142.177.143.178 con900000: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.234.3.0/24|/0 con1000000: IKEv1/2, no reauthentication, rekeying every 25920s, dpd delay 10s local: 142.166.5.141 remote: 162.253.21.18 local pre-shared key authentication: id: 142.166.5.141 remote pre-shared key authentication: id: 162.253.21.18 con1000000: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.234.6.13/32|/0 con1100000: IKEv2, no reauthentication, rekeying every 25920s, dpd delay 10s local: 142.166.5.141 remote: 205.200.239.116 local pre-shared key authentication: id: 142.166.5.141 remote pre-shared key authentication: id: 205.200.239.116 con1100000: TUNNEL, rekeying every 12960s, dpd action is hold local: 10.221.28.0/24|/0 remote: 10.10.20.232/32|/0
Any help would be appreciated.
Of note, I've also had all my VTI tunnels between 3x SG-1100 go down, but that's a battle for another day, and honestly, I'll probably be switching them to WireGuard this weekend.
-
Figured I'd add a bit more that I've done. I have taken a backup of the box, removed the entire <ipsec> section from the config, restored it to the appliance, and manually recreated the tunnels, no change. I've also disabled all tunnels, and tried one at a time, same issue with each one on its own.
I'm sure I'll think of more things, or even have already done some things suggested.
-
Another note, the patches from @jimp resolved part of the issue(names, and widget that I only noticed after reading another thread), however, the new P2 instability remains.
-
@mmapplebeck I'm thankful for those patches. But.... why are they missing from the distro? I'm a bit concerned about the QA on this release.
-
@gtoger I definitely think there was a lack of QA, it sounds like a lot of it is affecting 21.02 strictly, which concerns me greatly where there was no beta/RC process of this release(I had 2.5 running on my home SG-5100 and these IPSec problems only cropped up for me once I installed 21.02). I am hoping this is just a one-time issue with the split between 2.5 and 21.02
-
@mmapplebeck said in IPSec P2 stability problems with 20.02:
I definitely think there was a lack of QA
There was no lack of QA, there were some gaps in the scenarios we could test but there is no way for us to test every possible combination of parameters.
it sounds like a lot of it is affecting 21.02 strictly
That is not the case. With the exception of different hardware (like QAT, safeXcel, CESA, etc) on Netgate hardware, the IPsec code is all the same on both CE and pfSense Plus software.
stop transmitting data after a while
Can you quantify "a while"? Maybe it fails on P2 rekey or reauth? There were some behavior changes there to avoid duplicate SA problems that many were seeing. But with a typical P2 lifetime of an hour, most things wouldn't rekey until about 90% of that time (~50 minutes).
Anything shorter than that is unlikely to be related to the IPsec (re)negotiation but maybe there is something else happening.
The exact timeout could give us a lead about other things which may be affecting it, like maybe state table entries, rules matching the traffic, maybe somehow asymmetric routing of the ESP packets is happening... Hard to say without more detail.
-
A while is less than 10 minutes, sometimes under 5(I've seen it last as little as 3 minutes). As a stopgap, I decreased lifetime to 600 seconds with 300 random, and am still having disruptions before the tunnel gets rebuilt. When this happens, P1 and P2 stay up, but no traffic flows. If I kill the P2 and try to reconnect children, nothing happens, I have to disconnect the P1 and build an entirely new P1 to fix this.
-
Check the detail of the firewall states (
pfctl -vvss
) between the IPsec endpoints (so the WAN/public addresses) and see if there is any change in the states from when it works vs when it doesn't.~5 minutes is suspiciously around a default state timeout for a state which only has traffic in one direction, which sounds sort of like asymmetric routing somehow.
Also are these IPsec tunnels all on the WAN with the default gateway? Or are they on an alternate WAN?