My IPSEC service hangs

mr.ortizx

I was asked by Netgate technical support to upgrade to the version. Pfsense Plus 22.05
Issue persisted. I will continue working with support.

ablizno

@mr-ortizx wish there was some way to help point them towards the root of the issue. We know its due to the vici socket getting overwhelmed/locked up. When it happens if you run sockstat | grep -i vici you can see charon is overwhelmed. It started as like once a week for me and now its every ~12 hours it seems. Tunnels expire every 8 hours, so it doesn't appear to be directly related to the tunnels reconnecting. Opening Command Prompt and running pgrep -f charon to get the PIDs then kill -9 [pid] [pid] works as long as you restart the IPSEC service twice (not sure why it needs to be restarted twice) seems to fix it. We know what the problem is, and I'd be willing to provide any logs that help as I understand it is some sort of "rare" issue.

If anyone from netgate sees this, I'd be more than willing to assist in getting this resolved.

rcoleman-netgate

@ablizno said in My IPSEC service hangs:

If anyone from netgate sees this, I'd be more than willing to assist in getting this resolved.

This is the redmine associated with this issue: https://redmine.pfsense.org/issues/13014

Any contribution you can provide would be located in there.

auroramus

Since i updated to 2.7 my ipsec service has not crashed once and i have had no drops on my running VPN's

gassyantelope

@auroramus It's so strange how the update can fix the problem for some, but not others. What type hardware are you running for pfSense and how many IPsec tunnels do you have? I think the issue has more to do with the latter, rather than the former, but the more information we have to work with, the better.

There's a few reasons I think it's related to the number of tunnels. Other than the crash only occurring (in my tests) on firewalls with a large number of tunnels, I also noticed various IPsec pages in the web UI started to be slow and/or timeout when trying to add new IPsec gateways/tunnels and apply the changes (editing the settings of current ones and applying the changes works as it should). This started occurring right after adding my ~30th tunneI and still happens. It's probably a separate issue, but it's odd that its occurrence seems tied to the number of gateways/tunnels, just like the crashing issues (in my tests). It's like some services are overloaded when there's too many tunnels (which we know to be the case with charon).

I wouldn't mind trying to perform a test on 2.7, with a similar number of tunnels as you have, to see if there are still issues.

gassyantelope

@rcoleman-netgate We can definitely post information over there, but do you have a way to get someone to look at it? The last comment from someone who works on pfSense (Brad Davis, a developer?) was 2.5 months ago when he said that they thought it was fixed, but needed more testing.

Since then, people have replied saying that it was not fixed and is still an issue. I'm sure that many of us wouldn't mind testing and providing whatever information is needed to get this fixed, but that can't happen until someone who works on pfSense is actively involved and tells us what they need from us.

rcoleman-netgate

@gassyantelope I've seen hints at a solution that is being tested but not a lot of specifics at this time. If there's anything that would be testable it will appear in the redmine notes.

Topogigio

I have same problem, maybe it can help:

pfsense 2.6 on VMWare, with only ONE IPSEC v2 VTI tunnel configured. After some weeks of (small) work it stops the tunnel. No way to resume it: it negotiates something but it does not go up. No help restarting IPSEC service: it is needed to restart the whole pfSense to restart.
pfSense 2.5 on VMWARE with A LOT OF v2/v1 VTI tunnels configured. It works well without any problem from long time.
pfSense 2.6 on HYPER-V with three tunnels configured. It seems working (23 days uptime currently)

When the problem occurs the IPSEC logs report only the error "[CFG] trap not found, unable to acquire reqid 5002" but I'm not sure it is "normal" with VTI tunnels.

gassyantelope

I just uploaded a kernel trace file to the redmine page. The trace shows what happens at the point it crashes. Hopefully this will help get this fixed.

slimjim2321

Has there been any update or possible workaround on this? I've been researching this for days and tried everything besides changing platforms (which I'm close to doing at this point). Have lost a lot of Faith in the stability of using Pfsense to be frank after so many years of no issues.
The issue seems to be completely random. Sometimes will skip a few days and others it happens multiple times a day but mostly at 3am when our Tunnels are being used the most, leading to multiple early hour calls for me the past 2 weeks.

I've confirmed it's not Hardware Related. I swapped the affected machine with new hardware, fresh install of the latest 2.7 Dev build (was on 2.6 but saw on another post someone mention upgrading to 2.7 fixed it for them) and restored most settings manually from scratch. The only things I restored from a backup config were the few hundred Ipsec Phase 2 entries we have and our Certs/Users which would take most of a day to recreate manually through the GUI. The hardware is more than enough for the amount of traffic we see (less than 50mbps). I have all hardware acceleration and offloading off for troubleshooting, even AES-NI. We have tons of tunnels but the total amount of traffic is not a significant amount. Probably 1-5 gigabytes of data is sent per night over the course of 12 hours.

Today it happened on our only other site's Firewall which up until this point I haven't seen happen on nor have I made any real changes to when troubleshooting the original problem firewall. The only commonality between the two is that they share an IPsec tunnel with about 50 phase 2 entries between each other and only really started happening after that tunnel was put in place. I noticed the issue happen about once every 3 weeks at first. Figured it was some sort of hardware acceleration problem and disabled offloading since I had recently enabled it when creating that new tunnel, another 3-4 weeks would go by without issue, then happen again. Now it happens multiple times a day almost every day.

Today, I've tried lowering our amount of tunnels by disabling "Split Connections" on Both ends which will reduce the amount of Connections by about 50. I prefer to have each Connection split normally just to be able to better track individual tunnels on the status page but we'll see if it makes a difference here. I'm not a programmer but my gut tells me that having hundreds of IPsec Phase 2 entries isn't the norm and might be linked to the issue based on what others have said. I've also lowered the verbosity of IPsec logging to Audit. The IPsec logs have never showed anything to indicate an issue and just stop generating when the issue occurs anyway so if the issue is log related (as I've seen other say it happens when Logs need to be rotated) I should see a decrease in frequency.

The only other pattern I've noticed is that it's happened multiple days in a row at 3am. There's a cron job in pfsense by default that checks the system called "rc.peroidic daily" (to my knowledge, I couldn't find any specifics on what it does) everyday at 3am so it could be linked to that but it's not definitive since otherwise the issue wouldn't be happening during the day. There's also a Weekly and Monthly one. Worth noting that today the issue happened on both my firewalls around 5am today which is when the "rc.periodic monthly" job was scheduled to run since this was the 1st of the month. Again, could be a coincidence but I'm going to try moving the daily cron job to a specific time, (or disable it completely) and see if it starts happening at that time instead of the small hours when our tunnels become mission critical.

Any help would be greatly appreciated.

gassyantelope

@slimjim2321 Welcome to the club, how nice of you to join us :)

There's been no update on the Redmine post since I posted the kernel trace log a month ago. I don't have any hope that it's going to be fixed any time soon. It appears that Netgate doesn't care about the problem at all, as this has been going on for months (maybe years) and has just been ignored. Many of us have provided information, which they asked for, but then haven't heard anything since.

It's a shame that a fix, for an issue that breaks core functionality, is a low priority for them. They should really start advertising that pfSense doesn't support working IPsec VPNs at this point, because that's where we're at. It's very frustrating, that's for sure. We've moved all our clients away from Netgate/pfSense to another manufacturer at this point. I still have a single pfSense box that I use for testing, in hopes that this will be fixed in 5 years.

Topogigio

@gassyantelope
Hi, we are at same point. We cannot use a system that requires IPSEC as main feature, and have it not stable.

Which platform did you choose?

david11717

@gassyantelope I've been having this issue as well and have been following this forum post and the post in the redmine. Clearly nothing is being done so I decided to take it into my own hands and wrote a script that I run every minute with a cron job. I feel like it's pretty straight forward but I'm looking for the length of the charon.vici queue and storing it into queueLength. The if queueLength is greater than 0, I kill the charon processes and restart ipsec twice. I've done a few different iterations of the script but this one seems to work perfectly. Hope this helps!

#!/bin/sh

queueLength=$(netstat -Lan | grep charon.vici | cut -c 7)

if [ $((queueLength)) -gt 0 ]; then

        	/usr/bin/killall -9 charon
        	/usr/local/sbin/pfSsh.php playback restartipsec; sleep 10; /usr/local/sbin/pfSsh.php playback restartipsec

else

fi

auroramus

@david11717 Hi sorry for not reply ing sooner since i upgraded fto 2.7 iv not had any issues since my ipsec has been holding strong

david11717

@auroramus Interesting. I'm on the newest 2.7 dev and I'm still having the issue.

slimjim2321

@david11717 Upgrading to 2.7 didn't fix my issue either. Thank you so much for the script.

Before the holiday weekend I greatly increased the log size (since I have hundreds of free gbs on that device) before swapping, Disabled Ipsec logging entirely, and disabled split connections on one of my bigger tunnels. One of these three changes seems to have kept it from happening over the weekend. I have a feeling it's being triggered by log swapping.

keyser

@david11717 I don’t know if this applies to you, but I have discovered an IPSec issue today when deploying a bunch of SG-2100 boxes (ARM64 CPU boxes).
If I use AES-CGM for encryption (both 128 and 256bit) as guided by Netgate, the boxes will stall/become unresponsive after a while if there is more than one Phase2 tunnels active in the Tunnel. Boxes with only one Phase2 tunnel does not seem to suffer the issue.

Disabling SafeXcel (HW Acceleration) does not mitigate the Issue.
But changing the cipher on the tunnel to AES256 (not CGM - I believe it is really AES256-CBC) resolves the issue.

I have a lot of testing to do still, but it’s quite evident the change of cipher resolves the issue.

david11717

@keyser 99% of my IPSec VPNs use AES256. I have one that uses AES256-GCM and I changed it to AES256 in the testing process (a month or two ago). The issue persisted. That being said, the devices I'm using are x86_64 and not ARM based.

david11717

This post is deleted!

david11717

@slimjim2321 said in My IPSEC service hangs:

@david11717 Upgrading to 2.7 didn't fix my issue either. Thank you so much for the script.

Before the holiday weekend I greatly increased the log size (since I have hundreds of free gbs on that device) before swapping, Disabled Ipsec logging entirely, and disabled split connections on one of my bigger tunnels. One of these three changes seems to have kept it from happening over the weekend. I have a feeling it's being triggered by log swapping.

Did any of these steps completely resolve your problem since then? My script is still working great but I'd love to fix the issue passively than to have a CRON job running every minute.