60+ tunnels. IKEv2 OK. All IKEv1 randomly drops traffic. Tunnel status stays ok

  • I hope someone can point us in the right direction. I think it all started when we upgraded boxes to 2.2.x and started using IKEv2 for IPsec on several connections.

    The IPSec connections we have using IKEv2 are all ok. All the other IPsec connections were ok for years. They all started dropping traffic for several weeks/months now.

    We first thought it could be hardware but we upgrade from ALIX APU to a C2558 system.

    The other side are all ALIX.2D13 boxes or ALIX.APU boxes. Running pfSense 2.0.1 to 2.1.5. The 2.2.x boxes running on IKEv2 are ok.

    The tunnel status is ok on both ends. All traffic is dead. If i disconnect the tunnel on our side it starts reconnecting and after a few seconds everything is working again.

    I tried to take a look at the log on our pfSense but with over 60 tunnels there is a lot of logging. I also don't understand why it worked for years and now it seems like it is happening because we are using IKEv1 & IKEv2 tunnels on our system. Or is it because of using Strongswan at our end and Racoon on the other end?

    A quick look in the log on the other end of some boxes shows 'IPsec-SA expired'. It looks like traffic is dead ather that but i'm not quite sure.

    I hope it's a simple settting that will fix the problem  :D.

  • On a side note, is this IKEv2 interconnects between pfsense boxen, or is it third party clients?

  • The other side are all ALIX.2D13 boxes or ALIX.APU boxes. Running pfSense 2.0.1 to 2.1.5. The 2.2.x boxes running on IKEv2 are ok.

    Only using pfSense  ;)

  • If you're not on 2.2.6 already on the 2.2.x side, upgrade. Some of the issues fixed in the latest strongswan version could be the source of occasional IKEv1 rekeying issues.

  • All the updates are installed as soon as possible so the box is on 2.2.6

  • Ok good.

    So where you have one side logging that its SA is expired, and the other end shows it's still up, you almost certainly have a lifetime mismatch (assuming after the expired log there isn't a log that it renegotiated). If there are no logs on the racoon side beyond the expiry, then there isn't any traffic being initiated from that end across the tunnel, and the other side is still using an old SA that is valid as far as it's concerned.

    In that circumstance, sounds like you're probably not using DPD. Granted on some of the really old versions, at least 2.0.x and earlier, it likely doesn't work reliably. Should have DPD enabled on both sides on at least 2.1.x and newer versions, as that would even recover from a config mismatch, plus a variety of other possible reasons one end might drop an SA and the other end wouldn't.

  • Just checked 1 of the failing tunnels. 2.2.6 <-> 2.1.5 says it's up on both sides. No traffic.

    2.1.5 log showing
    Dec 24 19:43:32 racoon: [xx]: INFO: IPsec-SA expired: ESP x.x.x.x[500]->x.x.x.x[500] spi=3435406966(0xccc42676)
    Dec 24 19:43:32 racoon: [xx]: INFO: IPsec-SA expired: ESP/Tunnel x.x.x.x[500]->x.x.x.x[500] spi=266953082(0xfe9617a)

    DPD is ON at all configurations. Even tried to switch it off to see what happens. No luck.
    Lifetime on phase 1 28800 seconds - default on all the boxes
    Lifetime on phase 2 3600 seconds - default on all the boxes
    I'm pretty sure the configurations match because i checked the most of them and they worked for years until now. The mismatched configurations we had were corrected when 2.2 came available (a long time ago). Also 1 or 2 mistakes are possible but 30+ tunnels are failing.

    Eventually the tunnel wil come back up but this is not within a reasonable time. With over 30 tunnels that are failing it's pretty annoying.

  • Ok if they both think they're up, that's different, sounded like from the description one side was expired and down. Do you have the "Prefer old SAs" option enabled on the 2.0.x/2.1.x sides? That's under System>Advanced. That checkbox no longer exists in 2.2.x versions, and was rarely if ever desirable on older versions but got enabled more often than it should have been. It should be disabled.

  • Checked about 10 boxes. Some of them had the option enabled. Unchecked it for now. The ones that were already unchecked are also failing. So i don't think it will help but worth a try.

  • Also compare the SPIs between them, Status>IPsec, SAD tab. More than one pair? Should be one entry in each direction. Where there are multiple pairs, and one end was set to prefer old, that would have been a problem up until the old one that end was using expired. The ones where prefer old was enabled, that's certainly been a problem at some point.

    Comparing the SPIs when both show up but traffic isn't passing would be telling as well.

  • After upgrading 2 remote side boxes to the latest version of pfSense the tunnels are not dropping anymore. IPSec settings (on IKEv1) didn't change during/after upgrade.

    Before the upgrade, tunnels to these boxes were dropping multiple times a day. So i was thinking if this all could be a problem between Racoon en Strongswan?

    Checked settings (again) on other (not updated) boxes, "dpd enabled", "prefer old SAs" disabled and matching phase 1 and phase2 settings. Tunnels still dropping….. Didn't had time to dig into the log.

  • I have had a similar issue:

    Connections think they are up
    'Restarting IPSEC' doesn't seem to fix the links
    Stopping and starting often does

    I do see in the logs:
    rc.newipsecdns: IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.  (ALL ARE STATIC)

    A lot of these errors:
    no matching CHILD_SA config found  (Starting and stopping IPSEC without altering anything works)
    Generating QUICK_MODE request 2190518743 [ HASH SA No ID ID ]

    When trying to stop a single connection and restart from either end:
    Unable to delete SAD entry with SPI c57b432d: No such file or directory (2)

  • It looks like DPD is the problem. Disabled it on 15 tunnels (both sides). All 15 connections are stable for at least a day now.

    DPD is still active on the "Strongswan" boxes. Not having any problems with them.

Log in to reply