Rekey of Child SA Subnet A creates Child SA of Subnet B rendering tunnel unusable
-
Short version
Site to site VPN. The (expected) rekeying of one Child SA Subnet A creates a new Child SA of Subnet B (instead of refreshing Child SA of the very same Subnet A!) and then deletes the Child SA of Subnet A. Leaving 2 Child SAs of Subnet B and - after a few seconds - a new Child SA of Subnet A. This goes on randomly and creates Child SA duplicates. (Once I saw over 20 Child SAs.) Impact: The Connection with the duplicates doesn't work anymore.What goes wrong?
:Correct process of Child SA rekeying in theory according to my understanding. (Desired):- con5 #413 is the old Child SA for 172.17.0.0/23 and will be replaced by con5 #441, the new Child SA for 172.17.0.0/23
- con5 #299 is the old Child SA for 172.20.0.0/16 and will be replaced by con5 #443, the new Child SA for 172.20.0.0/16
Result: There is one Child SA for each of our Subnets (OK)
:Process according to log. (Not desired.):
- con5 #299 is the old Child SA for 172.20.0.0/16 and will be replaced by #441, the new Child SA for 172.17.0.0/23
- con5 #443 is the new Child SA for 172.20.0.0/16 and is created a few seconds later because otherwise Child SA for 172.20.0.0/16 is missing
- con5 #413 is the old Child SA for 172.17.0.0/23 and remains
Result: There is one Child SA more. One of the Subnets got a duplicate (not OK)
:Commented details with Log:
A rekey job is initiated (probably by pfsense) for Child SA con5{299} (172.20.0.0/16). Some data is collected and sent to the peer. Meanwhile a new Child SA con5{441} is created. But there's the problem this new Child SA is of the wrong Subnet (172.17.0.0/23).
2023-07-11 08:56:51.130151+02:00 charon 54934 08[KNL] creating rekey job for CHILD_SA ESP/0x911b1be3/178.174.xx.xx 2023-07-11 08:56:51.130216+02:00 charon 54934 06[IKE] <con5|1300> queueing CHILD_REKEY task 2023-07-11 08:56:51.130236+02:00 charon 54934 06[IKE] <con5|1300> activating new tasks 2023-07-11 08:56:51.130245+02:00 charon 54934 06[IKE] <con5|1300> activating CHILD_REKEY task 2023-07-11 08:56:51.130300+02:00 charon 54934 06[CFG] <con5|1300> proposing traffic selectors for us: 2023-07-11 08:56:51.130309+02:00 charon 54934 06[CFG] <con5|1300> 172.17.0.0/23|/0 2023-07-11 08:56:51.130316+02:00 charon 54934 06[CFG] <con5|1300> 172.20.0.0/16|/0 2023-07-11 08:56:51.130328+02:00 charon 54934 06[CFG] <con5|1300> proposing traffic selectors for other: 2023-07-11 08:56:51.130335+02:00 charon 54934 06[CFG] <con5|1300> 192.168.200.0/24|/0 2023-07-11 08:56:51.130341+02:00 charon 54934 06[CFG] <con5|1300> 192.168.200.0/24|/0 2023-07-11 08:56:51.130368+02:00 charon 54934 06[CFG] <con5|1300> configured proposals: ESP:AES_CBC_256/HMAC_SHA2_512_256/MODP_2048/NO_EXT_SEQ 2023-07-11 08:56:51.130380+02:00 charon 54934 06[IKE] <con5|1300> establishing CHILD_SA con5{441} reqid 13 2023-07-11 08:56:51.132612+02:00 charon 54934 06[CHD] <con5|1300> CHILD_SA con5{299} state change: INSTALLED => REKEYING 2023-07-11 08:56:51.132659+02:00 charon 54934 06[ENC] <con5|1300> generating CREATE_CHILD_SA request 62 [ N(REKEY_SA) N(ESP_TFC_PAD_N) SA No KE TSi TSr ] 2023-07-11 08:56:51.132733+02:00 charon 54934 06[NET] <con5|1300> sending packet: from 194.165.xx.xx[500] to 178.174.xx.xx[500] (528 bytes)
With this (wrong) choice made, con5{441} (172.17.0.0/23) is established basically correctly. And afterwards con5{299} (172.20.0.0/16) is removed.
2023-07-11 08:56:51.758356+02:00 charon 54934 06[CHD] <con5|1300> CHILD_SA con5{441} state change: INSTALLING => INSTALLED 2023-07-11 08:56:51.758373+02:00 charon 54934 06[CHD] <con5|1300> adding outbound ESP SA 2023-07-11 08:56:51.758380+02:00 charon 54934 06[CHD] <con5|1300> SPI 0xe04c955d, src 194.165.xx.xx dst 178.174.xx.xx 2023-07-11 08:56:51.758425+02:00 charon 54934 06[IKE] <con5|1300> outbound CHILD_SA con5{441} established with SPIs c6e325fc_i e04c955d_o and TS 172.17.0.0/23|/0 === 192.168.200.0/24|/0 2023-07-11 08:56:51.758434+02:00 charon 54934 06[CHD] <con5|1300> CHILD_SA con5{299} state change: REKEYING => REKEYED 2023-07-11 08:56:51.758479+02:00 charon 54934 06[IKE] <con5|1300> reinitiating already active tasks 2023-07-11 08:56:51.758485+02:00 charon 54934 06[IKE] <con5|1300> CHILD_REKEY task 2023-07-11 08:56:51.758509+02:00 charon 54934 06[IKE] <con5|1300> closing CHILD_SA con5{299} with SPIs c10cf427_i (12538 bytes) 911b1be3_o (1418520 bytes) and TS 172.20.0.0/16|/0 === 192.168.200.0/24|/0 2023-07-11 08:56:51.758518+02:00 charon 54934 06[IKE] <con5|1300> sending DELETE for ESP CHILD_SA with SPI c10cf427 2023-07-11 08:56:51.758525+02:00 charon 54934 06[CHD] <con5|1300> CHILD_SA con5{299} state change: REKEYED => DELETING 2023-07-11 08:56:51.758540+02:00 charon 54934 06[ENC] <con5|1300> generating INFORMATIONAL request 63 [ D ] 2023-07-11 08:56:51.758589+02:00 charon 54934 06[NET] <con5|1300> sending packet: from 194.165.xx.xx[500] to 178.174.xx.xx[500] (96 bytes) 2023-07-11 08:56:51.781184+02:00 charon 54934 13[NET] <con5|1300> received packet: from 178.174.xx.xx[500] to 194.165.xx.xx[500] (96 bytes) 2023-07-11 08:56:51.781276+02:00 charon 54934 13[ENC] <con5|1300> parsed INFORMATIONAL response 63 [ ] 2023-07-11 08:56:51.781309+02:00 charon 54934 13[IKE] <con5|1300> CHILD_SA closed 2023-07-11 08:56:51.781337+02:00 charon 54934 13[CHD] <con5|1300> CHILD_SA con5{299} state change: DELETING => DELETED 2023-07-11 08:56:51.781401+02:00 charon 54934 13[IKE] <con5|1300> activating new tasks 2023-07-11 08:56:51.781414+02:00 charon 54934 13[IKE] <con5|1300> nothing to initiate 2023-07-11 08:56:56.781498+02:00 charon 54934 01[IKE] <con5|1300> queueing CHILD_DELETE task 2023-07-11 08:56:56.781546+02:00 charon 54934 01[IKE] <con5|1300> activating new tasks 2023-07-11 08:56:56.781562+02:00 charon 54934 01[IKE] <con5|1300> activating CHILD_DELETE task 2023-07-11 08:56:56.781595+02:00 charon 54934 01[CHD] <con5|1300> CHILD_SA con5{299} state change: DELETED => DESTROYING 2023-07-11 08:56:56.781745+02:00 charon 54934 01[IKE] <con5|1300> activating new tasks 2023-07-11 08:56:56.781761+02:00 charon 54934 01[IKE] <con5|1300> nothing to initiate
Last but not least. At some point IPsec seems to realize, something's missing and establishes a new Child SA for 172.20.0.0/16.
2023-07-11 08:57:15.114638+02:00 charon 54934 06[CHD] <con5|1300> CHILD_SA con5{443} state change: CREATED => INSTALLING 2023-07-11 08:57:15.117260+02:00 charon 54934 06[CHD] <con5|1300> CHILD_SA con5{443} state change: INSTALLING => INSTALLED 2023-07-11 08:57:15.117252+02:00 charon 54934 06[IKE] <con5|1300> CHILD_SA con5{443} established with SPIs cd8810b8_i 9d002767_o and TS 172.20.0.0/16|/0 === 192.168.200.0/24|/0
Workarounds
- Remove/Disconnect all Child SAs. But duplicates will be created again after certain time
- I've found that setting option "Split connections" helps. (Still verifying! It seems that I have to rebuild the tunnel completely. I can't just set the option and save.)
Conclusion
So this is no rekey conflict (I think). Lifetime, Timing or other Child SA settings (like Responder only etc) do not matter. Rekeying seems to go on as expected (well - basically). Rekeying simply seems to make a wrong decision by chosing the wrong combination of Local subnet/Remote subnet randomly.I know that sometimes it is necessary to set "Split connections" so IPsec works at all (Cisco ASA as peer). But in my use case it seems that I have to set this option in general. No matter what vendor sits on the other side.
My questions: Am I missing something obvious? Is this a bug? If this looks like a bug: Where would I register it? (Strongswan or pfSense?)
Thanks for your opinion on this!
-
TL;DR
If you see Child SA duplicates, check Split connections in Phase 1 options. Reason: pfSense may create a Child SA of the wrong traffic selector when rekeying.Test scenarios
As a test I've divided all IPsecs in three groups. For each group I've implemented one of these measures- Child SA behaviour to best practice (responder only on one side etc.)
- Disable DPD (desperate act)
- Set "Split connections" (SC) in Phase 1
Test report
On day one all groups had duplicates during the day. I again removed all duplicates. On day two (overnight) only the first two groups had duplicates. The SC-group was alright. So I changed some IPsecs of the other groups to SC as well. On day three I could confirm that the new SC-IPsecs were duplicate-free as well. On the other hand the ones without SC created 2-5 duplicates overnight.Other conclusions/observations
I think which Child SA is created is random, because- duplicates within one IPsec are mostly of the same network. But not always.
- Child SA id and its order in the configuration has no relation to of which network a duplicate is created
I say there is no rekey conflict because
- pfSenses Rekey- (and other) randomization is working well. (I can see this in the status page.) There is no way a conflict could happen so often
- I see the duplicates in IPsecs that have huge lifetime (24h/8h) and very short lifetime (15min/15min)
- There are duplicates even if I'm strict an am responder only.
It doesn't matter what vendor the peer is. There are Fortigate, ZyWall, Sophos, Cisco. Only Cisco worked from the beginning - because Cisco ASA doesn't work without Split Connections:)
-
No duplicates appeared over the weekend. Seems split connections is the way to go.