2.4.5 <-> 2.4.4-p3 IPsec tunnel stops passing traffic after ~48 hours
-
Hello,
We have three sites with pfSense firewalls in redundant pairs. We are relying on combination of OpenVpn and IPsec for site-to-site links*.
We have started to roll out pfSense 2.4.5. However, we've found that after a period of time (usually 48 hours, give or take an hour), traffic will stop flowing over the IPSec tunnel.This is the IPsec configuration from the site which has been upgraded to pfSense 2.4.5 (0.50.50.50):
conn con2000 fragmentation = yes keyexchange = ikev2 reauth = yes forceencaps = no mobike = no rekey = no installpolicy = yes type = tunnel dpdaction = restart dpddelay = 10s dpdtimeout = 60s auto = route left = 0.50.50.50 right = 0.40.40.40 leftid = 0.50.50.50 ikelifetime = 28800s lifetime = 3600s ike = aes256gcm128-sha256-modp2048! esp = aes256gcm128-sha256-modp2048,aes256gcm96-sha256-modp2048,aes256gcm64-sha256-modp2048! leftauth = psk rightauth = psk rightid = 0.40.40.40 rightsubnet = 172.16.0.0/21,172.16.16.0/21 leftsubnet = 172.16.8.0/21,172.16.24.0/21
This is the IPsec configuration from the site which is still running pfSense 2.4.4-p3 (0.40.40.40):
conn con1000 fragmentation = yes keyexchange = ikev2 reauth = yes forceencaps = no mobike = no rekey = no installpolicy = yes type = tunnel dpdaction = restart dpddelay = 10s dpdtimeout = 60s auto = route left = 0.40.40.40 right = 0.50.50.50 leftid = 0.40.40.40 ikelifetime = 28800s lifetime = 3600s ike = aes256gcm128-sha256-modp2048! esp = aes256gcm128-sha256-modp2048,aes256gcm96-sha256-modp2048,aes256gcm64-sha256-modp2048,aes256gcm128-sha256-modp2048,aes256gcm96-sha256-modp2048,aes256gcm64-sha256-modp2048,aes256gcm128-sha256-modp2048,aes256gcm96-sha256-modp2048,aes256gcm64-sha256-modp2048,aes256gcm128-sha256-modp2048,aes256gcm96-sha256-modp2048,aes256gcm64-sha256-modp2048! leftauth = psk rightauth = psk rightid = 0.50.50.50 rightsubnet = 172.16.8.0/21,172.16.24.0/21 leftsubnet = 172.16.0.0/21,172.16.16.0/21
First thing to draw on, is that the esp ciphers (for the P2 proposal) appear to be repeated 4 times in the second sites configuration.
When the problem occurs, the only indication of an issue is that DPD requests and responses are getting transmitted over the tunnel:
Apr 19 00:31:57 charon 15[ENC] <con2000|63> parsed INFORMATIONAL response 54 [ ] Apr 19 00:31:57 charon 15[NET] <con2000|63> received packet: from 0.40.40.40[500] to 0.50.50.50[500] (57 bytes) Apr 19 00:31:57 charon 15[NET] <con2000|63> sending packet: from 0.50.50.50[500] to 0.40.40.40[500] (57 bytes) Apr 19 00:31:57 charon 15[ENC] <con2000|63> generating INFORMATIONAL request 54 [ ] Apr 19 00:31:57 charon 15[IKE] <con2000|63> sending DPD request Apr 19 00:31:46 charon 05[ENC] <con2000|63> parsed INFORMATIONAL response 53 [ ] Apr 19 00:31:46 charon 05[NET] <con2000|63> received packet: from 0.40.40.40[500] to 0.50.50.50[500] (57 bytes) Apr 19 00:31:46 charon 05[NET] <con2000|63> sending packet: from 0.50.50.50[500] to 0.40.40.40[500] (57 bytes) Apr 19 00:31:46 charon 05[ENC] <con2000|63> generating INFORMATIONAL request 53 [ ] Apr 19 00:31:46 charon 05[IKE] <con2000|63> sending DPD request
These are the equivalent logs from the other firewall (0.40.40.40)
Apr 19 00:29:46 charon 05[NET] <con1000|119> sending packet: from 0.40.40.40[500] to 0.50.50.50[500] (57 bytes) Apr 19 00:29:46 charon 05[ENC] <con1000|119> generating INFORMATIONAL response 41 [ ] Apr 19 00:29:46 charon 05[ENC] <con1000|119> parsed INFORMATIONAL request 41 [ ] Apr 19 00:29:46 charon 05[NET] <con1000|119> received packet: from 0.50.50.50[500] to 0.40.40.40[500] (57 bytes) Apr 19 00:29:36 charon 05[NET] <con1000|119> sending packet: from 0.40.40.40[500] to 0.50.50.50[500] (57 bytes) Apr 19 00:29:36 charon 05[ENC] <con1000|119> generating INFORMATIONAL response 40 [ ] Apr 19 00:29:36 charon 05[ENC] <con1000|119> parsed INFORMATIONAL request 40 [ ] Apr 19 00:29:36 charon 05[NET] <con1000|119> received packet: from 0.50.50.50[500] to 0.40.40.40[500] (57 bytes)
After the above instance, we increased the logging level to diag for some aspects of IPsec. In this example, traffic stopped flowing over the tunnel between 00:00:06 and 00:00:12:
Apr 21 00:00:06 PF101 charon: 11[MGR] checkout IKEv2 SA with SPIs 8d092fb85ad68e18_i 3d019220e36fccd3_r Apr 21 00:00:06 PF101 charon: 11[MGR] checkout IKEv2 SA with SPIs 8d092fb85ad68e18_i 3d019220e36fccd3_r Apr 21 00:00:06 PF101 charon: 11[MGR] IKE_SA con2000[193] successfully checked out Apr 21 00:00:06 PF101 charon: 11[MGR] IKE_SA con2000[193] successfully checked out Apr 21 00:00:06 PF101 charon: 11[MGR] checkin IKE_SA con2000[193] Apr 21 00:00:06 PF101 charon: 11[MGR] <con2000|193> checkin IKE_SA con2000[193] Apr 21 00:00:06 PF101 charon: 11[MGR] checkin of IKE_SA successful Apr 21 00:00:06 PF101 charon: 11[MGR] <con2000|193> checkin of IKE_SA successful Apr 21 00:00:12 PF101 charon: 11[MGR] checkout IKEv2 SA with SPIs 8d092fb85ad68e18_i 3d019220e36fccd3_r Apr 21 00:00:12 PF101 charon: 11[MGR] checkout IKEv2 SA with SPIs 8d092fb85ad68e18_i 3d019220e36fccd3_r Apr 21 00:00:12 PF101 charon: 11[MGR] IKE_SA con2000[193] successfully checked out Apr 21 00:00:12 PF101 charon: 11[MGR] IKE_SA con2000[193] successfully checked out Apr 21 00:00:12 PF101 charon: 11[IKE] sending DPD request Apr 21 00:00:12 PF101 charon: 11[IKE] <con2000|193> sending DPD request Apr 21 00:00:12 PF101 charon: 11[IKE] queueing IKE_DPD task Apr 21 00:00:12 PF101 charon: 11[IKE] <con2000|193> queueing IKE_DPD task Apr 21 00:00:12 PF101 charon: 11[IKE] activating new tasks Apr 21 00:00:12 PF101 charon: 11[IKE] <con2000|193> activating new tasks Apr 21 00:00:12 PF101 charon: 11[IKE] activating IKE_DPD task Apr 21 00:00:12 PF101 charon: 11[IKE] <con2000|193> activating IKE_DPD task Apr 21 00:00:12 PF101 charon: 11[ENC] <con2000|193> generating INFORMATIONAL request 0 [ ] Apr 21 00:00:12 PF101 charon: 11[NET] <con2000|193> sending packet: from 0.50.50.50[500] to 0.40.40.40[500] (57 bytes) Apr 21 00:00:12 PF101 charon: 11[MGR] checkin IKE_SA con2000[193] Apr 21 00:00:12 PF101 charon: 11[MGR] <con2000|193> checkin IKE_SA con2000[193] Apr 21 00:00:12 PF101 charon: 11[MGR] checkin of IKE_SA successful Apr 21 00:00:12 PF101 charon: 11[MGR] <con2000|193> checkin of IKE_SA successful Apr 21 00:00:12 PF101 charon: 11[MGR] checkout IKEv2 SA by message with SPIs 8d092fb85ad68e18_i 3d019220e36fccd3_r Apr 21 00:00:12 PF101 charon: 11[MGR] checkout IKEv2 SA by message with SPIs 8d092fb85ad68e18_i 3d019220e36fccd3_r Apr 21 00:00:12 PF101 charon: 11[MGR] IKE_SA con2000[193] successfully checked out Apr 21 00:00:12 PF101 charon: 11[MGR] IKE_SA con2000[193] successfully checked out Apr 21 00:00:12 PF101 charon: 11[NET] <con2000|193> received packet: from 0.40.40.40[500] to 0.50.50.50[500] (57 bytes) pr 21 00:00:12 PF101 charon: 11[ENC] <con2000|193> parsed INFORMATIONAL response 0 [ ] Apr 21 00:00:12 PF101 charon: 11[IKE] activating new tasks Apr 21 00:00:12 PF101 charon: 11[IKE] <con2000|193> activating new tasks Apr 21 00:00:12 PF101 charon: 11[IKE] nothing to initiate Apr 21 00:00:12 PF101 charon: 11[IKE] <con2000|193> nothing to initiate Apr 21 00:00:12 PF101 charon: 11[MGR] checkin IKE_SA con2000[193]
The notable new difference here, between working and broken states are the 'activating new tasks' and 'nothing to initiate' log messages. These are not logged on the 2.4.4-p3 firewall when the problem happens.
We have now set the log level to 'diag' for more components and will hopefully have some more information when the problem occurs again. Are there any commands we can run or particular things we should look out for, the next we hit this situation? We are trying to determine whether we should plow on with the pfSense 2.4.5 upgrade or rollback the firewalls that have already received it.
Thanks
*We recently found, upon upgrading to 2.4.4-p3 that OpenVPN was CPU bound on our virtualised instances, and will be flipping out for IPsec shortly after this issue is resolved.
-
Why do you have rekeying disabled? Are you 100% sure that your MTUs are configured? PMUTD has never worked for me in any vpn scenario.
Also, why is your lifetime so low? Its almost a 5th of the original value that Pfsense sets by default. Try going back to that.
In the IPSEC phase 2 of each host, have you set it to automatically ping the other side (Under "Advanced Options")? There should be pings back and forth between the hosts at all times.
-
Thank you for your reply.
@rmccall2k16 said in 2.4.5 <-> 2.4.4-p3 IPsec tunnel stops passing traffic after ~48 hours:
Why do you have rekeying disabled? Are you 100% sure that your MTUs are configured? PMUTD has never worked for me in any vpn scenario.
No idea why rekeying was disabled. We have enabled it now.
I don't believe MTUs are a problem here.Also, why is your lifetime so low? Its almost a 5th of the original value that Pfsense sets by default. Try going back to that.
The values P1/IKE lifetime and P2/SA lifetime are the same as the ones I see when we set a new connection up from scratch. Do you think changing them would affect the issue we have?
In the IPSEC phase 2 of each host, have you set it to automatically ping the other side (Under "Advanced Options")? There should be pings back and forth between the hosts at all times.
Yes. However, this tunnel is always carrying traffic in both directions.
Thanks again
MT -
Just to clarify, are you seeing the tunnels as up (both P1 and P2), but no traffic passing from one side to the other?
I'm seeing that on 2.4.4-p3 too, using VTIs though. I posted about it here https://forum.netgate.com/topic/149043/gateway-monitoring-gets-stuck-in-infinite-loop-when-using-multiple-vtis-on-sg-3100
Back then i thought i had drilled the root cause down to gateway monitoring, but with some more reports i've read in this forum, i believe i was following a symptom rather than a cause.Some more threads
https://forum.netgate.com/topic/150508/ipsec-tunnels-work-for-several-hours-to-days-but-then-stop-routing-traffic/10
https://forum.netgate.com/topic/148857/ipsec-ikev2-error-trap-not-found-unable-to-acquire-reqidIt might all get fixed by the strongSwan patch mentioned in the last thread. Time will tell when it's going to be implemented
-
@marcquark said in 2.4.5 <-> 2.4.4-p3 IPsec tunnel stops passing traffic after ~48 hours:
Just to clarify, are you seeing the tunnels as up (both P1 and P2), but no traffic passing from one side to the other?
I'm sorry I'm not sure it is the same issue. We started the IPSec Tunnel and everything was fine, until around 48 hours afterwards, at which point traffic seems to stop flowing over the tunnel, save for the DPD requests and responses suggesting the tunnel itself is fine.
I think we have resolved that problem too. Again, we have no idea why rekeying was disabled on the P1s, but having enabled it the tunnels have been working faultlessly for just over 10 days.