Pfsense 2.2.4 rekey issues

Wolvesclaw

So I got a lot of Site-to-Site tunnels. Most of them working just fine. But since 2.2.3 some of them will stop working after 8 hours.

When i look at the pfsense status it says "connecting" while the counterpart thinks its still connected. Its happening with two different Draytek Vigor routers (2760 Delight and 2710n). Only solution is to drop the tunnel in router and it comes up within seconds and works just fine for another 8 hours.

doktornotor

Starting yet another thread about the same doesn't help.

Recently https://forum.pfsense.org/index.php?topic=97699.0 – and many others before.

cmb

First, upgrade to 2.2.4. There were some ID issues in 2.2.3 that might be impacting you there. Then, if it still happens, need some info. Logs from both sides around the time it occurs.

@doktornotor:

Starting yet another thread about the same doesn't help.

Except it's not the same as his issue. For that matter, all the issues you were hitting previously and are still complaining about have been resolved in the last 3 releases.

doktornotor

@cmb:

For that matter, all the issues you were hitting previously and are still complaining about have been resolved in the last 3 releases.

Seeing the amount of IPsec-related complaints here, frankly I'm not going to revert any of the tunnels back to IPsec for another year or so. Strongswan == steaming pile of poo ATM. And frankly, for those site-to-site tunnels, people don't give a damn about what kind of VPN is running there, they never see it… They need stable access to the other side of the tunnel, that's all. Sadly, that's not something IPsec has ever offered on 2.2.x with the strongswan thing behind. Endless regressions, works one version, quits on another, I also have better things than babysit the configuration for the changes in the code, plus - debugging any of this is an incredible PITA. Whoever wrote that logging code must be smoking something pretty strong. Useful info burried in heaps of noise.

Wolvesclaw

@cmb:

First, upgrade to 2.2.4. There were some ID issues in 2.2.3 that might be impacting you there. Then, if it still happens, need some info. Logs from both sides around the time it occurs.

@doktornotor:

Starting yet another thread about the same doesn't help.

Except it's not the same as his issue. For that matter, all the issues you were hitting previously and are still complaining about have been resolved in the last 3 releases.

I'm already sitting on 2.2.4 and still have the same behavior. Will get some logs next time it happens!

Wolvesclaw

Just happened with 1 site… Only got logs from the strongswan.

Aug 11 15:26:12 charon: 16[CFG] ignoring acquire, connection attempt pending
Aug 11 15:26:12 charon: 12[KNL] creating acquire job for policy 1.1.1.1/32|/0 === 2.2.2.2/32|/0 with reqid {32}

Aug 11 15:25:31 charon: 02[NET] <con31000|3422>sending packet: from 1.1.1.1[500] to 2.2.2.2[500] (196 bytes)
Aug 11 15:25:31 charon: 02[IKE] <con31000|3422>sending retransmit 5 of request message ID 0, seq 1
Aug 11 15:25:31 charon: 02[IKE] <con31000|3422>sending retransmit 5 of request message ID 0, seq 1

Aug 11 15:24:01 charon: 16[NET] <con31000|3422>sending packet: from 1.1.1.1[500] to 2.2.2.2[500] (196 bytes)
Aug 11 15:24:01 charon: 16[ENC] <con31000|3422>generating ID_PROT request 0 [ SA V V V V V V ]
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>initiating Main Mode IKE_SA con31000[3422] to 2.2.2.2
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>initiating Main Mode IKE_SA con31000[3422] to 2.2.2.2
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>peer not responding, trying again (3/3)
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>peer not responding, trying again (3/3)
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>giving up after 5 retransmits
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>giving up after 5 retransmits

Aug 11 15:23:49 charon: 05[CFG] ignoring acquire, connection attempt pending
Aug 11 15:23:49 charon: 11[KNL] creating acquire job for policy 1.1.1.1/32|/0 === 2.2.2.2/32|/0 with reqid {32}</con31000|3422></con31000|3422></con31000|3422></con31000|3422></con31000|3422></con31000|3422></con31000|3422></con31000|3422></con31000|3422></con31000|3422></con31000|3422>

Wolvesclaw

Router doesn't say much when it happens:
2015-08-17 20:01:43 connection: 810d68dc is dial-out and NOT for dynamic client; in_index=0, out_index=-1. U can try to reduce phase1 lifetime…
2015-08-17 20:01:43 Responding to Main Mode from x.x.x.x

cmb

Given this:
@Wolvesclaw:

Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>peer not responding, trying again (3/3)
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>peer not responding, trying again (3/3)
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>giving up after 5 retransmits
Aug 11 15:24:01 charon: 16[IKE] <con31000|3422>giving up after 5 retransmits</con31000|3422></con31000|3422></con31000|3422></con31000|3422>

and this:
@Wolvesclaw:

2015-08-17 20:01:43 connection: 810d68dc is dial-out and NOT for dynamic client; in_index=0, out_index=-1. U can try to reduce phase1 lifetime…

It appears your Draytek is configured as initiator-only maybe, though the "responding to" in the next line seems odd given no actual response was ever received on the other end. There is some setting on the Draytek side similar to "initiator only" that shouldn't be set for the site to site VPN in this case, so either end can initiate the connection.

Wolvesclaw

Yeah, you can switch the Drayteks to "Dialo out only" and "always on". This is the setup that always worked for us.

On the problematic sites I switched to dial in AND out, so it's initiated, when someone starts working at the site. But that does not really help. After 7,5 hours the pfsense initiates the reconnect and the Draytek shows, that its still connected.

The workaround at the time is to put up the phase 2 lifetime to 12 hours. So the problem occurs, when nobody is working.