IPsec not reconnecting after site failure

kevindd992002

So I have two sites (main and remote) with both having a pfsense box and are connected via an site-to-site IPsec tunnel using routed VTI. The main site uses a public static IP but the remote site is behind a CGNAT (so private IP assigned to WAN interface). To make the tunnel work, I had to have a DDNS entry for the remote site WAN interface and put that as the peer identifier in the main site IPsec settings. I also had to check "Responder only" on the main site IPsec settings. I have DPD check on both sides.

So to establish the connection, I have to click the Connect button under Status -> IPsec. After this, if I restart either of the pfsense boxes I don't have any issues with the remote pfsense box reconnecting and re-establishing the IPsec tunnel. The problem is when either of the site has an Internet outage for say more than an hour, the tunnel does not automatically get reconnected. I have to do the manual "Connect" process again under Status -> IPsec.

I also don't use the "automatically ping host" feature in the phase 2 settings of both sides because I already have gateway monitoring (by pinging the IPsec interface IP on the far side) set. I read somewhere that this does the same thing with routed VTI.

@jimp Any ideas how I can solve the reconnection failure?

bbrendon

@kevindd992002 Did you make progress on this? There is a restart on child close option, but I have tried that and still do not get consistent connections. https://redmine.pfsense.org/issues/9767#note-1

kevindd992002

@bbrendon said in IPsec not reconnecting after site failure:

@kevindd992002 Did you make progress on this? There is a restart on child close option, but I have tried that and still do not get consistent connections. https://redmine.pfsense.org/issues/9767#note-1

I know I resolved this in the past but sorry I forgot what I did because I have since transitioned to using WireGuard. It's way faster than both OpenVPN and IPSec for a 200Mbps link between the two sites.

shellbr

I was about to start a topic for this. I have your exact issue verbatim, so you saved all the typing! I've also been able to recreate the issue in a lab environment. If anyone wants to see any logs, just let me know how to collect the data you want to see and I'll be happy to share it.

shellbr

So I've been trying to figure this out on my lab environment. It seems when the responder-only (site A) is taken offline, the other side (Site B) goes into "connecting" status for 5 minutes. If site A is brought back online within that time, the tunnel will reconnect. Otherwise, Site B changes to "Disconnected" state and it makes no further attempt to contact site A. These are the last few lines in Site B's log:
Jul 11 16:30:06 rtr2 charon[69811]: 16[IKE] <con1000|2> giving up after 5 retransmits
Jul 11 16:30:06 rtr2 charon[69811]: 16[IKE] <con1000|2> establishing IKE_SA failed, peer not responding
Jul 11 16:30:06 rtr2 charon[69811]: 16[MGR] <con1000|2> checkin and destroy IKE_SA con1000[2]
Jul 11 16:30:06 rtr2 charon[69811]: 16[IKE] <con1000|2> IKE_SA con1000[2] state change: CONNECTING => DESTROYING
Jul 11 16:30:06 rtr2 charon[69811]: 16[MGR] checkin and destroy of IKE_SA successful

I've tried playing with DPD and reauth values, but they make no difference. It's always 5 minutes and log shows the same giving up after 5 attempts. I'm not sure what setting is causing it to stop retrying so quickly.

bbrendon

@shellbr There is another thread going on about this. Someone suggested a script.
https://forum.netgate.com/post/992563