I have a problem with an IPsec tunnel between a pfSense 2.3.2 ("router on a stick" running on a first generation NUC) and a Sophos XG virtual appliance (SFOS 16.01.1).
The tunnel goes up nicely, no problem with that.
However, sometimes (it's not on each key regeneration) a second SA connexion happens (instead of two lines, one per way in /status_ipsec_sac.php, there's three of them).
When this happens, a kind of loops happens too: the Sophos considers the connection needs to come down, then closes it then creates a new one and so on.
The result for users is that each 7-8 seconds, there's a packet loss (because tunnel goes down and up).
Some users even reported "we can not reach anything" (from branch office to datacenter) while we (admins) were able to connect to the pfSense webUI from the datacenter.
In order to fix things, we have to either disable the tunnel on Sophos side and wait a couple of minutes (so all the SAD drops) before re-enabling it or find the bad SAD on pfSense side and delete it (while I'm not sure this really solves the issue Sophos-side).
The problem happened 5-6 times yesterday morning (between 8am and 1pm)…
But did not happen since 1pm yesterday (nearly 24 hours running ok).
I've found several threads with people with the same problem, the usual answer seems to be "not the same parameters on each side of the tunnel" (or some "the two sides are using different versions of IPsec daemons that don't go along well, mostly about DPD or "Prefer old IPsec SAs" - that I didn't find in 2.3).
https://forum.pfsense.org/index.php?topic=48259.0 (older pfSense version)
https://forum.pfsense.org/index.php?topic=32385.0 (older pfSense version)
https://forum.pfsense.org/index.php?topic=35889.0 (older pfSense version)
Both sides are "main" (not "aggressive"), it's a PSK authentication.
The Sophos side (static IP, in datacenter) acts as "respond only") and the pfSense side (branch office) is supposed to start the tunnel.
Here are the parameter on the Sophos side:
DH Group 2
Key Life 28800 seconds
Re-Key margin 360 seconds
Randomize Re-Keying Margin by 100%
Check Peer After Every 30 seconds
Wait for Response Upto 120 seconds
DH Group 2
Key life 3600 seconds
And on pfSense side:
DH Group 2
Lifetime 28800 seconds
Delay between requesting peer acknowledgement 10 seconds
DH Group 2
Lifetime 3600 seconds
Automatic ping host enabled to an IP on the other subnet
Do you have any hint on what I could try?
Tried to play with the "Make before Break" checkbox in advanced setttings?
Unfortunately, we're running IKEv1 (as Sophos only handle IKEv1).
This parameter seems to be IKEv2 related.
Resolved - Sophos suxxx. ;D There's also the "Configure Unique IDs" to play with. Otherwise, post the logs and maybe someone can decipher some useful info from that mess (certainly not me, I'd have the guys who designed strongswan logging executed instantly).
In the remote branch there was another device (Sophos XG105) connected to the internet with a buggy 4G connection…
This device was setup with the same parameter (IPsec initiator) than the pfSense box and was, sometimes (no idea when/why), connecting the main XG appliance.
The message in the log (main XG appliance) is: "System received a P2 connexion request whose Localsubnet-Remotesubnet configuration conflicts with that of an already established connexion "XXXX-1". System is terminate connection "XXXX-1" to honor the incoming request."
That message leaded me to thing there was an issue on the pfSense box, trying to start several tunnels.
It was (obviously) not the case, it was another device...
Once that other device is shutdown, problem is solved.