IPSec Phase 1 Renegotiation - Multiple SAs no Traffic
-
I found an issue that I'm able to reproduce. I have GRE over IPsec passing traffic from an HP MSR 20-10 router to a PFsense virtual machine in HyperV. GRE works just fine, it isn't until I add IPsec that I start having issues. I have disabled ipsec offloading on the virtual NIC and am using PFsense 2.2.
After the IPsec tunnel comes up initially, everything is OK until several hours later when I notice pings getting dropped across the tunnel. Issuing the command to forcibly reset IKE SA on the HP router several times would allow the tunnel to come back up fully about 20-30 seconds later. The same operation on PFsense doesn't do anything unless I disable IPSec services and wait for the SAs to expire on the HP router.
I set Phase 1 and Phase 2 lifetimes at minimums supported on both ends of the tunnel (1440 seconds Phase 1 and 180 seconds Phase 2) to be able to attempt reproduce this issue more quickly, and it worked. It seems to happen when Phase 1 attempts to renegotiate, the SAs get out of sync on both sides and appear to be fighting for who is in charge. Some packets pass, others don't and the SAs flap on both sides of the tunnel.
The closest match, although likely not relevant, I've been able to find thus far on the forums is a rather old post: https://forum.pfsense.org/index.php?topic=16274.0
I've attempted disabling DPD, disabling and re-enabling TCP large receive/checksum/offloading. Nothing seems to have any lasting affect.
I know the issue is between the HP MSR router and PFSense, when using PFsense on both ends of the configuration the tunnel stays up without issue. I've upgraded the HP MSR router firmware a few times but it's not having any effect either.
I'm stuck, looking for ideas what to try next.
-
Which version of IKE are you using? If v2 then the following may be relevant:
I have had similar problem(s) but I've found a workaround.
My configuration:
Branch offices: pfSense v.2.2-RELEASE.
Headquarter: FortiOS v.4.0 MR3 patch 7, behind a NAT device (pfSense v2.2-RELEASE).
Each branch office is connected to HQ via multiple IPsec (phase 2) SA's.My problems:
1: charon fails to initiate IKE (v2) connection to FortiOS. It can only work as responder.
2: charon fails to rekey IKE (phase 1) SA.
Workaround: Configure very long lifetime for both Phase 1 and 2 at pfSense device. Namely:
Pfsense 2.2 phase 1 90000s, phase 2 90000s.
FortiOS 4.0 phase 1 9000s, phase 2 3000s.
P/S. Don't set lifetime < 1080, i.e. 18 minutes.
-
@dusan - I'm using IKEv1, and I only put the incredibly short lifetimes on both phase 1 and 2 to make the issue recur more quickly. I've thought about as a workaround increasing the lifetimes.
I only have a single phase 2 in the config, and it's set for transport mode to use the GRE tunnel. Any chance it's related to the re-key bug noted in the PFsense blog? https://blog.pfsense.org/?p=1546 I'm OK waiting for a patch, just need to know one is on the horizon that's applicable to this situation.
I've attempted to go IKEv2, both endpoints should support it, but I'm new to IKEv2 and having a little trouble understanding the differences in how Phase 1 and Phase 2 happen. I show many failed attempts at establishing the tunnel with IKEv2, so I know I'm part of the way there.
I'm more than happy to post configs and debug traces, just not sure how to get the latter from PFsense so it's useful in a post.
-
@dusan - I'm using IKEv1, and I only put the incredibly short lifetimes on both phase 1 and 2 to make the issue recur more quickly. I've thought about as a workaround increasing the lifetimes.
I've had rekeying problems between pfSense 2.2 and Fortigate 4.0 under IKE v1, too. Unfortunately the trick with non-equal lifetimes does not work under v1 (see below).
18 minutes is the maximal variation of the (randomized) lifetime that is set as default in charon (strongswan) the new IKE daemon in pfSense 2.2. So lifetime <= 18 minutes may not reflect the real use case. Actually it may trigger more exceptional conditions and more errors. For testing purpose, 20 minutes should be the minimum.
I've attempted to go IKEv2, both endpoints should support it, but I'm new to IKEv2 and having a little trouble understanding the differences in how Phase 1 and Phase 2 happen. I show many failed attempts at establishing the tunnel with IKEv2, so I know I'm part of the way there.
The most visible difference between IKE v1 and v2 is perhaps the semantics of configured lifetime. In v1 both sides must negotiate a common lifetime, so even if the lifetime is configured differently, both sides will share a common lifetime, with a small random variation. Thus, which side plays which role (initiator/responder) in the next re-keying is completely random. In v2, the two sides do not negotiate lifetime, each side keeps its own value of lifetime for the negotiated SA, so for re-keying, if the configured lifetimes are different enough the role of sides is deterministic: the side with shorter configured lifetime becomes the initiator (and the other side becomes the responder).
A more precise description of the difference may be found in Section 2.8 RFC 7296, Oct 2014.
I only have a single phase 2 in the config, and it's set for transport mode to use the GRE tunnel. Any chance it's related to the re-key bug noted in the PFsense blog? https://blog.pfsense.org/?p=1546
I can't tell.
-
I'll test using IKEv1 with 3600 seconds for phase 1 and 1200 seconds for phase 2 and see how long it takes to bomb out. I hadn't thought to try keying different times for either phase in IKEv1, pity that wouldn't work anyway. If it's been like it has in the past, right around an hour when phase 1 rekeys.
I was talking to an associate at work today about the issue, and he had detailed an issue with similar symptoms. In his case dynamic routing and route injection was happening, ended up attempting to push the default gateway through the IPsec tunnel which would collapse the tunnel, allow the default gateway to reach, bring up the tunnel again, and so on. I doubt that is what is happening to me in this case though, I'm using all static routes, just a /24 on either side of the tunnel while using a /30 network for the GRE tunnel with a single IP on each side.
While responding, I did notice about 6 seconds (dropped pings) when phase 2 rekeys, and another time phase 2 rekeyed but only about 2 pings.
The HP router has an SSL VPN server in it, but I think it's designed to work only with clients, not another VPN server, at least I can't find any configuration examples.
-
No net change, after phase 1 rekeys, flapping of SAs just as before. I was switching back to just a GRE tunnel (disabling IPsec phase 1 entry) and when I did that the PFsense box crashed. I was unable to get the crash dump file because this lab environment isn't connected to the Internet. I've done that before without incident, but not while actively having issues.
Any ideas how to get the crash dump file uploaded when not connected to the Internet? The dump is likely gone already but I will attempt to reproduce the situation that caused the crash.
-
I found a bug in Strongswan that may be related to what I've been experiencing:
https://wiki.strongswan.org/issues/597
The bugtrack was closed, but there wasn't any indication if the bug was actually confirmed as resolved.
-
I came across another post regarding making PFsense acting as a responder only in 2.2.1 (not yet released): https://forum.pfsense.org/index.php?topic=89475.0
I'm thinking there is an outside chance this might also help to fix the problem I've been seeing. There doesn't appear to be a way for me to force the HP router as an IKEV1 initiator only as with some examples I've found with Cisco devices that appear to help with this type of issue.
In the interim between now and a maintenance release, when I get the time to experiment with IKEv2, I'll post results in this thread as well.
-
Looks like the answer to the question is use IKEv2. I finally found a config on the HP router that worked with PFsense, been stable now for over 8 hours. I didn't realize how much faster the negotiation would be on establishing the tunnel either! The missing part was in the Phase 1 proposal:
ikev2 proposal 1 encryption aes-cbc-128 integrity sha1 prf sha1 group 2
The tunnel wouldn't come up until I added "prf sha1". I'm new to IKEv2 so I'm not entirely sure what that did for me because it didn't look like I had a corresponding setting in PFsense on Phase 1. I had thought it was that hash, but that's covered with "integrity sha1".
The only oddity to report is that PFsense seems to be holding on to old Phase 2 SAs for quite some time. It might have been because I had a mismatch on Phase 2 SA lifetime. I matched them up now and so far so good. I'll post more results tomorrow along with some reference config on the HP router in case anyone else ever runs into something like this.
-
All the leftover SAs haven't recurred, I have a working config! Thanks @dusan for nudging me to approach this differently - IKEv2 was the answer! Now all I have to do is implement this in production.
There was a severe lack of info on HP routers especially with GRE over IPSec involved. I believe most of the configs I found were in face IPSec over GRE (configs terminated the VPN on the Tunnel Interface - wasn't sure how to translate that into PFsense). I've attached PFsense screen shots of a working config which I based off a YouTube video: https://www.youtube.com/watch?v=YPYFcya3Qls
What is missing from the config screen shots of PFsense are the necessary firewall rules to allow ESP, GRE, etc in through the WAN interface. Additionally a floating rule to permit all IP traffic over the GRE tunnel.
Here's the config from the HP router:
# version 5.20, Release 2513P45 # # firewall enable firewall fragments-inspect # domain default enable system # acl number 2000 name WAN_Block description WAN interface ACL rule permit source 1.1.1.1 0 acl number 3500 name test_gre rule permit ip # vlan 1 description Network Management VLAN # vlan 3999 description TEST VLAN # ikev2 proposal 1 encryption aes-cbc-128 integrity sha1 prf sha1 group 2 # ikev2 policy test.ikev2 proposal 1 # ikev2 keyring test.ikev2 peer test.ikev2 address 1.1.1.1 identity address 1.1.1.2 pre-shared-key local cipher $c$3$Yuq221T+ag6huY0FNUH3Yh6Cj7RuD3vIINwRxg== pre-shared-key remote cipher $c$3$Yuq221T+ag6huY0FNUH3Yh6Cj7RuD3vIINwRxg== # ikev2 profile test.ikev2 keyring test.ikev2 identity local address 1.1.1.2 match identity remote address 1.1.1.1 # ipsec transform-set test encapsulation-mode transport transform esp esp authentication-algorithm sha1 esp encryption-algorithm aes-cbc-256 aes-ctr-256 # ipsec policy test.ikev2 1 isakmp security acl 3500 ikev2 profile test.ikev2 transform-set test # interface Ethernet0/0 port link-mode route description WAN IP(s) for VPN - NAT/PT pool assigned firewall packet-filter 2000 inbound nat outbound ip address 1.1.1.2 255.255.255.0 ipsec policy test.ikev2 undo lldp enable # interface Vlan-interface1 description Network Management VLAN ip address 172.16.4.1 255.255.255.0 ip flow-ordering internal # interface Vlan-interface3999 ip address 172.16.5.1 255.255.255.0 # interface Ethernet0/1 port link-mode bridge port link-type trunk port trunk permit vlan all loopback-detection enable loopback-detection control enable # interface Tunnel1 mtu 1400 ip address 10.10.10.2 255.255.255.252 tcp mss 1340 source Ethernet0/0 destination 1.1.1.2 # ip route-static 0.0.0.0 0.0.0.0 1.1.1.254 ip route-static 172.16.4.0 255.255.255.0 10.10.10.1 #
-
You normally have assigned the GRE interface so for sure you need rules for that!