IPSec tunnels drop during P1 Rekey version 2.4.3-RELEASE-p1



  • Hi all-
    I see some posts that are very similar. My apologies if I should have tagged this one under one of the others. However, the similar posts were between pfSense and other firewall systems.
    Randomly we have 3 of our IPSEC VPN tunnels drop during P1 rekey. We have 6 other tunnels that have been up and stable. We upgraded to 2.4.3_1 about a month ago. As we have been adding or changing circuits we have been upgrading the encryption protocols. Last Thursday we propped up a new connection between 2 offices using 500Mbps CL fiber and set up the VPN tunnel. It was fine for 2 days. the 3rd day we had a few ping drop (enough for vmware vcenter to notice the loss of heartbeat). By the morning of the 4th day a few more drops- some lasting a minute to 5 minutes and then it successfully negotiates and works. We went ahead and disabled this connection and went back to our Charter fiber connection using the less robust encryption. On Monday two other tunnels dropped as well a time or two. Tuesday was worse. Rebooted the firewall at headquarters on Tuesday night. Good until Wednesday mid day and then a short drop here and there and then longer ones into the evening. Never longer than about 5 minutes. One of the remote firewalls was on 2.3.5 so I tried to upgrade it last night to the current version. It is now being shipped back to us as it is down and not coming up. (like this one -https://forum.netgate.com/topic/121294/warning-php-startup-unable-to-load-dynamic-library-usr-local-lib-php-2013120).
    We of course can start playing with downgrading the encryption to see if we can get a more stable connection, but surprised at the issue. From some of the posts I have read this has been an issue since 2.4.2.
    From the remote firewall -
    Sep 5 18:16:20 charon 12[ENC] <con3|153> generating IKE_AUTH request 1 [ IDi N(INIT_CONTACT) IDr AUTH N(ESP_TFC_PAD_N) SA TSi TSr N(MULT_AUTH) N(EAP_ONLY) N(MSG_ID_SYN_SUP) ]
    Sep 5 18:16:20 charon 12[NET] <con3|153> sending packet: from 50.252.72.25[500] to 65.152.72.148[500] (256 bytes)
    Sep 5 18:16:20 charon 12[NET] <con3|153> received packet: from 65.152.72.148[500] to 50.252.72.25[500] (224 bytes)
    Sep 5 18:16:20 charon 12[ENC] <con3|153> parsed IKE_AUTH response 1 [ IDr AUTH N(ESP_TFC_PAD_N) SA TSi TSr N(AUTH_LFT) ]
    Sep 5 18:16:20 charon 12[IKE] <con3|153> authentication of '65.152.72.148' with pre-shared key successful
    Sep 5 18:16:20 charon 12[IKE] <con3|153> IKE_SA con3[153] established between 50.252.72.25[50.252.72.25]...65.152.72.148[65.152.72.148]
    Sep 5 18:16:20 charon 12[IKE] <con3|153> scheduling reauthentication in 41375s
    Sep 5 18:16:20 charon 12[IKE] <con3|153> maximum IKE_SA lifetime 41915s
    Sep 5 18:16:20 charon 12[IKE] <con3|153> received ESP_TFC_PADDING_NOT_SUPPORTED, not using ESPv3 TFC padding
    Sep 5 18:16:20 charon 12[IKE] <con3|153> CHILD_SA con3{2694} established with SPIs c3c3e452_i c80f587e_o and TS 192.168.60.0/24|/0 === 192.168.9.0/24|/0
    Sep 5 18:16:20 charon 12[IKE] <con3|153> received AUTH_LIFETIME of 41829s, scheduling reauthentication in 41289s
    Sep 5 18:16:23 charon 12[IKE] <con4|152> sending DPD request
    Sep 5 18:16:23 charon 12[ENC] <con4|152> generating INFORMATIONAL request 15 [ ]
    Sep 5 18:16:23 charon 12[NET] <con4|152> sending packet: from 50.252.72.25[500] to 216.147.164.74[500] (80 bytes)
    Sep 5 18:16:23 charon 12[NET] <con4|152> received packet: from 216.147.164.74[500] to 50.252.72.25[500] (80 bytes)
    Sep 5 18:16:23 charon 12[ENC] <con4|152> parsed INFORMATIONAL response 15 [ ]
    Sep 5 18:16:27 charon 10[KNL] <con2|150> unable to query SAD entry with SPI caaf1079: No such file or directory (2)
    Sep 5 18:16:27 charon 10[KNL] <con2|150> unable to query SAD entry with SPI cb5ba20a: No such file or directory (2)
    Sep 5 18:16:27 charon 10[KNL] <con2|150> unable to query SAD entry with SPI c899e4c6: No such file or directory (2)
    Sep 5 18:16:32 charon 05[KNL] <con2|150> unable to query SAD entry with SPI caaf1079: No such file or directory (2)
    Sep 5 18:16:32 charon 05[KNL] <con2|150> unable to query SAD entry with SPI cb5ba20a: No such file or directory (2)
    Sep 5 18:16:32 charon 05[KNL] <con2|150> unable to query SAD entry with SPI c899e4c6: No such file or directory (2)
    Sep 5 18:16:33 charon 05[IKE] <con4|152> sending DPD request

    And I just looked at the logs for the HQ firewall and I did not grab the right ones. Will repost when it happens again.

    After the reboot we are having less issues, but still random drops. Hoping 2.4.4 will take care of it.

    Again, I am hoping to not have to keep re-tweaking settings to these working correctly.

    Tunnels are set for IKEv2
    Mutual PSK
    AES 256 bits SHA256 DH 14 (2048) Lifetime 42400

    P2

    ESP AES256-GCM 128 bits Hash SHA256 lifetime 3600 seconds

    Our IPSEC tunnels have been rock solid for years on pfSense. If anyone sees anything obvious, please let me know..

    We are also seeing on some of the other tunnels with multiple P2 sessions showing under IPSEC > Status as I have seen in some of the other posts. Don't know if that is a related issue or no - one of them showed 5 P2s (when their should be one) all showing traffic. That tunnel has not dropped though.



  • Hey just a little side note- I did a search and replace on the IPs in the logs so it reads better- but they are just random IPs I dropped in. Thanks


  • Netgate

    Yeah. Need the logs around a failure. That doesn't really tell us anything.

    It is probably not something in the encryption settings. They should either be agreed-upon and the tunnel established or no agreement and no tunnel.

    You should disable SHA256 on any GCM ESPs for performance reasons. AES-GCM is authenticated in the encryption algorithm itself. There is no need for a second hashing step.

    Set the IPsec logs (VPN > IPsec, Advanced) to: IKE SA, IKE Child SA, and Configuration Backend to Diag. All others to Control.

    That will give you tunnel negotiation logs down to the protocols being agreed upon, etc.

    There are no known issues such as this for 2.4.4 to fix. I would concentrate on finding what the issue is in the tunnel configuration or the transport between the sites.



  • @derelict
    Thanks much for the feedback. Very helpful pointing us towards getting the logging right. Between the time that we posted the original entry and you posted your response I noted that the published pfSense guide now has more guidance on the recommended settings. We updated P1 to : AES128-GCM 128 bits, SHA256 and DH 14 (2048) Lifetime 28800. We did not have Dead Peer Detection turned on either. Enabled with the default settings. P2 settings are now AES128-GCM, 128 bits (instead of auto) and Hash is empty/null as you indicated under hash algorithms. Enabled PFS Key with the 14 (2048) and set lifetime to 7200 seconds.

    We made the changes Saturday morning the 8th. Stable since. (knocking on wood)

    Also noted that we had not enabled AES-NI crypto-acceleration either- System > Advanced Misc > Cryptographic Hardware.

    It is enabled now as well.

    Will set up the advanced logging if we have more issues. The other postings I had read did not provide that information.


  • Netgate

    Awesome. Please post back if you see continuing issues.