IPSEC + VTI + IKEV2 - will not auto-reconnect

bbrendon

I'm using IKEv2 because I had issues with IKEv1. IKEv2 has been working better. Today the internet went down for a bit at one end and the tunnel won't come back automatically. I had to manually click "Connect VPN" in the GUI. Does anyone know what the problem might be with this?

Settings: All default except IKEv2.
Version: 2.4.4-p3

# While it was down...
# ipsec status
Shunted Connections:
   bypasslan:  10.11.0.0/24|/0 === 10.11.0.0/24|/0 PASS
Security Associations (0 up, 0 connecting):
  none

# ipsec.log show this over and over on both sides...

Aug 29 13:54:24 gw charon: 05[KNL] received an SADB_ACQUIRE with policy id 38 but no matching policy found
Aug 29 13:54:24 gw charon: 05[KNL] creating acquire job for policy {WANIP}/32|/0 === {WANIP}/32|/0 with reqid {0}
Aug 29 13:54:24 gw charon: 05[CFG] trap not found, unable to acquire reqid 0

Derelict

It will not initiate until there is interesting traffic. Was there interesting traffic?

bbrendon

@Derelict if icmp counts, then yes.

I think there was DNS traffic as well but I didn’t actually verify that.

Derelict

Then it would have initiated. If it did not connect you would need to look in the IPsec logs to see why.

You also need to be 100% sure the traffic was interesting. For instance if you ping something across the tunnel from the firewall you have to set the source address to something in the local networks in the IPsec phase 2.

bbrendon

@Derelict
Well, VTI is routed, so I'm not sure what you mean by source address. Would the traffic have to be coming from the numbered VTI interface? Maybe I need to setup a cron job to ping from the VTI interface on the pfsense itself?

These are the logs from the two sides (links below). I'm not sure what is relevant so I posted the whole thing. The internet connection was down at the manuf site from 11:40 to 12:01. After 12:01 I wasn't able to get the VPN to reconnect by itself.

office: https://pastebin.com/2vqEF2Fp
manuf: https://pastebin.com/jHXpAiwt

bbrendon

@Derelict
This is still a constant battle for me. The tunnel is down now, and I tried pinging the remote tunnel IP in pfsense using the VTI interface as the source address.

The log is below.

May 31 19:30:12	charon	65059	14[CFG] trap not found, unable to acquire reqid 1000
May 31 19:30:12	charon	65059	01[KNL] creating acquire job for policy 1.1.123.153/32|/0 === 2.2.142.61/32|/0 with reqid {1000}
May 31 19:30:09	charon	65059	01[KNL] <con3000|181> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 in failed, not found
May 31 19:30:09	charon	65059	01[NET] <con3000|181> sending packet: from 1.1.123.153[500] to 50.196.146.217[500] (80 bytes)
May 31 19:30:09	charon	65059	01[ENC] <con3000|181> generating INFORMATIONAL response 1040 [ ]
May 31 19:30:09	charon	65059	01[ENC] <con3000|181> parsed INFORMATIONAL request 1040 [ ]
May 31 19:30:09	charon	65059	01[NET] <con3000|181> received packet: from 50.196.146.217[500] to 1.1.123.153[500] (80 bytes)
May 31 19:30:08	charon	65059	09[CFG] vici client 2831 disconnected
May 31 19:30:08	charon	65059	09[KNL] <con3000|181> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 out failed, not found
May 31 19:30:08	charon	65059	09[KNL] <con3000|181> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 in failed, not found
May 31 19:30:08	charon	65059	09[CFG] vici client 2831 requests: list-sas
May 31 19:30:08	charon	65059	12[CFG] vici client 2831 registered for: list-sa
May 31 19:30:08	charon	65059	09[CFG] vici client 2831 connected
May 31 19:30:06	charon	65059	15[CFG] trap not found, unable to acquire reqid 1000
May 31 19:30:06	charon	65059	12[KNL] creating acquire job for policy 1.1.123.153/32|/0 === 2.2.142.61/32|/0 with reqid {1000}
May 31 19:30:02	charon	65059	12[CFG] vici client 2830 disconnected
May 31 19:30:02	charon	65059	12[KNL] <con3000|181> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 out failed, not found
May 31 19:30:02	charon	65059	12[KNL] <con3000|181> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 in failed, not found
May 31 19:30:02	charon	65059	12[CFG] vici client 2830 requests: list-sas
May 31 19:30:02	charon	65059	12[CFG] vici client 2830 registered for: list-sa
May 31 19:30:02	charon	65059	15[CFG] vici client 2830 connected
May 31 19:30:00	charon	65059	06[CFG] vici client 2829 disconnected
May 31 19:30:00	charon	65059	11[KNL] <con3000|181> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 out failed, not found
May 31 19:30:00	charon	65059	11[KNL] <con3000|181> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 in failed, not found
May 31 19:30:00	charon	65059	11[CFG] vici client 2829 requests: list-sas
May 31 19:30:00	charon	65059	11[CFG] vici client 2829 registered for: list-sa
May 31 19:30:00	charon	65059	06[CFG] vici client 2829 connected
May 31 19:30:00	charon	65059	09[CFG] trap not found, unable to acquire reqid 1000
May 31 19:30:00	charon	65059	11[KNL] creating acquire job for policy 1.1.123.153/32|/0 === 2.2.142.61/32|/0 with reqid {1000}
May 31 19:29:59	charon	65059	11[KNL] <con3000|181> querying policy 0.0.0.0/0|/0 === 0.0.0.0/0|/0 in failed, not found

And help would be great.
tia.

bbrendon

The only thing I could find is the [0] keyingtries setting which maybe should be forever? This problem seems to mostly occur if there is a connectivity issue for more than 5-15 minutes or so.

The issue though is I don't see a way to set it in pfsense.

[0] https://wiki.strongswan.org/projects/strongswan/wiki/connsection

BarronC

Was there ever any resolve on this? I have the same problem when the Internet drops. This usually happens when Starlink does a firmware upgrade early in the morning. In the morning I see the VPN is down and I have to click reconnect, even when there is interesting traffic.

dotdash

@barronc
I've also come across this issue with VTI tunnels not reconnecting after an outage. I used the script referenced here: https://www.reddit.com/r/PFSENSE/comments/ceg1qb/ipsec_site_to_site_no_auto_restart/ as a starting point. I created a simpler script and run it via cron before work starts in the office for the day, and periodically throughout the day.

bbrendon

@dotdash Not resolved here. Thanks for the link.

jimp

On one end, set Child SA Close Action to Restart/Reconnect. Do not set it on both sides or you'll likely end up with duplicate child SAs due to collisions in negotiation.

VTI cannot be triggered on demand because VTI does not support trap policies.

bbrendon

@jimp said in IPSEC + VTI + IKEV2 - will not auto-reconnect:

On one end, set Child SA Close Action to Restart/Reconnect.

Yes, but that setting doesn't solve the reconnect issue.

jimp

That's exactly what it does in my testing. It will keep trying to reconnect.

Though perhaps there is some other edge case I'm not aware of, but there isn't anything else to be done in strongSwan other than setting the child SA close action.

It already tries to start VTIs when loaded, and by setting that option for child SA close actions it will reconnect any time they are cleared.

At least for me it's been quite persistent about it.

shellbr

I'm having this same problem. VTI does not reconnect if the site is down for more than 5 minutes.
I've also been able to create this in a lab environment. It's only a problem for those of us where one side is set to Responder Only - as is required at one of my sites due to NAT beyond my control. The reason it works fine for tunnels not limited to responder only is because even though one side gives up after 5 minutes and changes to Disconnected, the service is still listening for inbound requests and so the tunnel comes back up as soon as the far side is back online. This becomes a problem when only one side can initiate.

It would be nice to see an option to make it retry every x seconds and do it indefinitely.

dotdash

@jimp
When I was testing the tunnel would reconnect after a short outage, but if I left it down for an hour, it would never re-establish the tunnel. The other quirk I came across was that we got packet loss after switching to VTI until setting the MSS to 1360.

jimp

I thought about this some more and have some ideas. I don't know if I'll get to them soon, though.

See https://redmine.pfsense.org/issues/12169

mdomnis

@jimp I'm playing with IPSEC + VTI + IKEv2 on the 2.6 RC and I am still seeing the tunnel (P1 and P2) remain down after a WAN outage of > ~5 minutes and subsequent recovery. I have tried setting the Child SA Close Action to Restart/Reconnect on one side, but that does not seem to help. I confirmed that the code listed in https://redmine.pfsense.org/issues/12169 is in place on my test box.

I'm not sure if having gateway monitoring enabled on the VTI would help in this situation, but I was forced to disable that on older versions due to ipsec restarting all tunnels any time ANY VTI gateway went down. So if I had a HQ site with 10 VTI tunnels to branch sites, any time ONE of them suffered an outage, there would be brief outages to all branches when the gateway goes down and again when it comes back up. No good. Disabling gateway monitoring on the VTI gateways helped tremendously with that problem, but now I'm having issues with the tunnel not reconnecting after a lengthy outage.

To me it seems like it retries for a certain period of time (5 minutes ish) and then beyond that, it's never going to reconnect itself and an admin will have to go and manually reconnect the tunnel.

bbrendon

@mdomnis Thank you for testing this on the betas.

I don't run betas but our solution was to create a cron job that pings across the tunnel and if it fails, restart ipsec. It's very hacky but was the only workable option other than replacing pfsense. In the future I might try wireguard since VTI has been broken for many years now.

jimp

@mdomnis

Did you enable the new option on the VTI P2 to activate the new keep-alive feature?

ay

@jimp
for VTI tunnels, should we still be setting one side to responder only in 2.6?

Troubleshooting Duplicate IPsec SA Entries