My IPSEC service hangs
-
My IPSEC service seems to hang requiring a reboot for the page to load and every so often.
Also My existing Tunnels seem to disconnect and struggle to reconnect.
-
-
@auroramus I've seen this problem when the IPSEC log file reaches maximum size and the IPSEC service just hangs. I'm not sure if the logs rotate successfully or not, but rebooting is the only solution I've found to getting the service back. I've curtailed IPSEC logging to avoid this issue.
-
if it was a one of or every now and then i wouldnt mind but seems like i have to reboot it everyday.
-
I wonder if this is related to an issue I am having. When you are having the issue, if you run
netstat -Lan
what is the output?In my case it seems to be a charon.vici overflow issue as the result shows "unix 4/0/3 /var/run/charon.vici". System logs also show an influx of sonewconn Listen queue overflow entries.
I believe the issue is related to this:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262743 -
I'm also having the same issue. All IPsec VPNs will randomly go down and won't come back up unless I reboot the firewall. Sometimes they'll stay up for a day, but sometimes it happens multiple times per day (3 times today).
I've attempted manually reconnecting the VPNs via the IPsec status page and restarting the IPsec service, but neither of those end up doing anything and I'm forced to reboot the firewalls. When it happens, my system logs also show
sonewconn: pcb 0xfffff80095042200: Listen queue overflow
entries. The process associated with that hex value always points to charon.The problem only appears to be happening on firewalls that have a lot of tunnels. Firewalls that only have a couple IPsec VPNs seem to be fine (thus far). All of the firewalls are running 2.6.0, but I've seen reports of people having the problem on some earlier versions as well.
I'd consider this a pretty serious issue, especially since IPsec VPNs can be considered a core component of firewalls and are heavily utilized throughout the industry. I'm almost at the point where I'm going to have to replace the firewalls with something else, since I can't keep having to reboot the firewalls throughout the day.
Hopefully this can be looked into and fixed. I'd be glad to provide any logs that may be needed.
-
@gassyantelope I've got a case open with Netgate support on this issue. They asked for a status output with IPSEC logging enabled when the appliance faults again.
-
@ablizno I'll check that when I submit my status output to Netgate. The odd thing is it only recently started, perhaps it is overall load related.
-
I have noticed it only occurs once the maximum amount of logs are hit so if set to 2000 and the logs reach that it crashes ipsec.
-
@auroramus Out of curiosity, where is the log setting you changed? I'm wondering if it is different than what I changed. I tried increasing the log space from 500KB to 20MB and still had the VPNs crash today within a few hours. So unless there is a different setting I missed, I'm thinking it may not be related.
-
Hi so i go to Status>IPsec>
>
and there you can set the log entries gui log entries i have set to 5 and left retention blank.
-
@glreed735 Hello, I am experiencing this exact same issue, see my log settings.
Have you received a resolution from support?
-
I do not have a support package in place.
-
@auroramus - Not yet, the first pass through the logs highlighted some issues, but they wanted a larger sample of data to work from pending the next failure.
-
I am no coding expert but it seems like once logs reach maxiumum capacity rather than overwriting the logs it crashes the ipsec service.
Thats what it looks like to me.
No matter what setting i change it to wether it is a low log count or high it maxes then kills service and unless you restart it will not work.
-
once i clear the logs i manage to go past the screen above i mentioned of collecting ipsec status info and see my connection but when you hit connect it attemps and stops doesnt do anything only way to get them connected back is restart
-
i also found this post;
This might be entirely normal behaviour; IPSec and many other forms of VPN tunnels connect only when there is traffic to transmit.
Take for example you have an 8 hour lifetime on the IKE (Phase 1) tunnel. The tunnel will connect upon some traffic being transmitted down the tunnel and will always terminate as soon as 8 hours has passed since it came up. Only if packets are still trying to be sent down the tunnel will the tunnel come back up again and continue transmitting traffic for another 8 hours. The down and up happens very quickly and packets may not even be lost. This is for security reasons to refresh the security associations.
Some people choose to run a ping or similar constantly down the tunnels so it always looks to be connected except for the brief milliseconds to reassociate. I find this to be generally unnecessary. -
@auroramus were you ever able to run
netstat -Lan
and provide the output when all your tunnels are down? -
@auroramus The behavior occurring is definitely not normal. I understand what that post is saying and completely agree that is normal IPsec behavior. The issue here is completely different though. The tunnels will never come back up once they all go down. I can ping, send data another way, etc., and they won't ever come back up until a restart is performed.
I've had multiple cases where I had active connections over the tunnel (sending data the whole time) and then the issue occurs and all tunnels go down. This has occurred way before the default 8 hour life span (sometimes within an hour or two).
-
@gassyantelope Yes 100% the behaviour is wrong.
as it seems to crash the service. and this shouldnt happen.
-
I just paid for Enterprise support and I was told the following:
"Hello,
Unfortunately, this is a somewhat rare issue that has not been solved yet. It is much less prevalent in pfSense CE 2.5.2, 2.7, and pfSense Plus 22.05. There aren't any workarounds currently, so rolling back or upgrading are the only steps you can currently take to mitigate the issue. You may track the issue here:
https://redmine.pfsense.org/issues/13014
"
I hope this helps you guys. event though redmine says all tunnels continue to operate normally, Netgate support mentioned that they also see instances where all tunnels will drop which is the case for all of us. -
@mr-ortizx really appreciate you letting us know.
-
I have updated to 2.7 i will keep you guys updated.
-
@mr-ortizx Thanks man! At least we finally got an official response from them. I'm gonna do what @auroramus did and update to 2.7 as well to see if it helps at all. It can't hurt at this point.
-
@gassyantelope @auroramus Please let me know how it went after upgrading to the version 2.7
-
Hi Guys
So far so good with 2.7 have not had a single drop in the tunnels for days now soo ye give it a go and let me know.
-
I have been running 2.7 since 30th June and i have not had a single blip.
Let me know how you guys get on.
-
@auroramus I updated to 2.7 yesterday. It's only been 24 hours, but I haven't had the issue yet. That's already an improvement for me, seeing as I had to reboot the firewall once or twice a day when on 2.6. I'll provide another update in a few days. I'm crossing my fingers.
-
@gassyantelope I spoke too soon. I just had the issue occur on 2.7.
Disclosure: Potentially justifiable rant below :)
Investigating and fixing this issue really needs to be a higher priority at this point. There are reports about the issue from 5+ years ago, yet it still exists. The latest redmine issue report (from 3 months ago) hasn't had much traction, as far as someone actually investigating the problem. It just keeps having its target version pushed back over and over.
I get that there are other issues that need to be fixed as well, but this is an issue that, essentially, makes pfSense a nonviable option to use as a firewall in a production environment. Netgate states it to be a "somewhat rare" issue, yet there are many threads and redmine reports, spanning years, that show that this issue is more common than they make it out to be.
My company has primarily used WatchGuard firewalls for years, which are decent enough, but their capabilities are lacking in various areas (I'd prefer to move away from them, personally). We started installing some Netgate/pfSense devices for some "smaller" networks, that only have 5-10 IPsec tunnels, and found pfSense to run stably and have far superior capabilities. We were ready to purchase ~30 Netgate firewalls to replace all of the WatchGuards, but wanted to test pfSense on a "larger" networks (50+ IPsec tunnels) to make sure there were no issues before we pulled the trigger. That large network test led us to where we are today, exposing this issue that completely breaks IPsec VPNs constantly.
As much as I like pfSense (which I'll continue to use for my home lab) and really want to move away from WatchGuard and transition to Netgate/pfSense firewalls, that can't be done for as long as this issue continues to exist. A firewall with lackluster capabilities, but fully working IPsec VPNs, is better than a very capable firewall that has to be rebooted 1-2 times per day to get IPsec VPNs, which I'd consider a core feature of all firewalls, to stay up and work properly.
I'll be putting the WatchGuards back in place for now. I'll continue to monitor this thread and the redmine issue page for updates. I'm still willing to swap the pfSense firewall back in to assist with the testing of possible solutions, as I'd like to see this problem fixed some day. I just can't have pfSense be our day to day, primary, firewall in its current state.
Rant over.
-
@mr-ortizx Updated to latest 2.7 dev build, issue still occurs with the same frequency as before.
-
I was asked by Netgate technical support to upgrade to the version. Pfsense Plus 22.05
Issue persisted. I will continue working with support. -
@mr-ortizx wish there was some way to help point them towards the root of the issue. We know its due to the vici socket getting overwhelmed/locked up. When it happens if you run
sockstat | grep -i vici
you can see charon is overwhelmed. It started as like once a week for me and now its every ~12 hours it seems. Tunnels expire every 8 hours, so it doesn't appear to be directly related to the tunnels reconnecting. Opening Command Prompt and runningpgrep -f charon
to get the PIDs thenkill -9 [pid] [pid]
works as long as you restart the IPSEC service twice (not sure why it needs to be restarted twice) seems to fix it. We know what the problem is, and I'd be willing to provide any logs that help as I understand it is some sort of "rare" issue.If anyone from netgate sees this, I'd be more than willing to assist in getting this resolved.
-
@ablizno said in My IPSEC service hangs:
If anyone from netgate sees this, I'd be more than willing to assist in getting this resolved.
This is the redmine associated with this issue: https://redmine.pfsense.org/issues/13014
Any contribution you can provide would be located in there.
-
Since i updated to 2.7 my ipsec service has not crashed once and i have had no drops on my running VPN's
-
@auroramus It's so strange how the update can fix the problem for some, but not others. What type hardware are you running for pfSense and how many IPsec tunnels do you have? I think the issue has more to do with the latter, rather than the former, but the more information we have to work with, the better.
There's a few reasons I think it's related to the number of tunnels. Other than the crash only occurring (in my tests) on firewalls with a large number of tunnels, I also noticed various IPsec pages in the web UI started to be slow and/or timeout when trying to add new IPsec gateways/tunnels and apply the changes (editing the settings of current ones and applying the changes works as it should). This started occurring right after adding my ~30th tunneI and still happens. It's probably a separate issue, but it's odd that its occurrence seems tied to the number of gateways/tunnels, just like the crashing issues (in my tests). It's like some services are overloaded when there's too many tunnels (which we know to be the case with charon).
I wouldn't mind trying to perform a test on 2.7, with a similar number of tunnels as you have, to see if there are still issues.
-
@rcoleman-netgate We can definitely post information over there, but do you have a way to get someone to look at it? The last comment from someone who works on pfSense (Brad Davis, a developer?) was 2.5 months ago when he said that they thought it was fixed, but needed more testing.
Since then, people have replied saying that it was not fixed and is still an issue. I'm sure that many of us wouldn't mind testing and providing whatever information is needed to get this fixed, but that can't happen until someone who works on pfSense is actively involved and tells us what they need from us.
-
@gassyantelope I've seen hints at a solution that is being tested but not a lot of specifics at this time. If there's anything that would be testable it will appear in the redmine notes.
-
I have same problem, maybe it can help:
- pfsense 2.6 on VMWare, with only ONE IPSEC v2 VTI tunnel configured. After some weeks of (small) work it stops the tunnel. No way to resume it: it negotiates something but it does not go up. No help restarting IPSEC service: it is needed to restart the whole pfSense to restart.
- pfSense 2.5 on VMWARE with A LOT OF v2/v1 VTI tunnels configured. It works well without any problem from long time.
- pfSense 2.6 on HYPER-V with three tunnels configured. It seems working (23 days uptime currently)
When the problem occurs the IPSEC logs report only the error "[CFG] trap not found, unable to acquire reqid 5002" but I'm not sure it is "normal" with VTI tunnels.
-
ablizno
-
ablizno
-
ablizno
-
I just uploaded a kernel trace file to the redmine page. The trace shows what happens at the point it crashes. Hopefully this will help get this fixed.
-
Has there been any update or possible workaround on this? I've been researching this for days and tried everything besides changing platforms (which I'm close to doing at this point). Have lost a lot of Faith in the stability of using Pfsense to be frank after so many years of no issues.
The issue seems to be completely random. Sometimes will skip a few days and others it happens multiple times a day but mostly at 3am when our Tunnels are being used the most, leading to multiple early hour calls for me the past 2 weeks.I've confirmed it's not Hardware Related. I swapped the affected machine with new hardware, fresh install of the latest 2.7 Dev build (was on 2.6 but saw on another post someone mention upgrading to 2.7 fixed it for them) and restored most settings manually from scratch. The only things I restored from a backup config were the few hundred Ipsec Phase 2 entries we have and our Certs/Users which would take most of a day to recreate manually through the GUI. The hardware is more than enough for the amount of traffic we see (less than 50mbps). I have all hardware acceleration and offloading off for troubleshooting, even AES-NI. We have tons of tunnels but the total amount of traffic is not a significant amount. Probably 1-5 gigabytes of data is sent per night over the course of 12 hours.
Today it happened on our only other site's Firewall which up until this point I haven't seen happen on nor have I made any real changes to when troubleshooting the original problem firewall. The only commonality between the two is that they share an IPsec tunnel with about 50 phase 2 entries between each other and only really started happening after that tunnel was put in place. I noticed the issue happen about once every 3 weeks at first. Figured it was some sort of hardware acceleration problem and disabled offloading since I had recently enabled it when creating that new tunnel, another 3-4 weeks would go by without issue, then happen again. Now it happens multiple times a day almost every day.
Today, I've tried lowering our amount of tunnels by disabling "Split Connections" on Both ends which will reduce the amount of Connections by about 50. I prefer to have each Connection split normally just to be able to better track individual tunnels on the status page but we'll see if it makes a difference here. I'm not a programmer but my gut tells me that having hundreds of IPsec Phase 2 entries isn't the norm and might be linked to the issue based on what others have said. I've also lowered the verbosity of IPsec logging to Audit. The IPsec logs have never showed anything to indicate an issue and just stop generating when the issue occurs anyway so if the issue is log related (as I've seen other say it happens when Logs need to be rotated) I should see a decrease in frequency.
The only other pattern I've noticed is that it's happened multiple days in a row at 3am. There's a cron job in pfsense by default that checks the system called "rc.peroidic daily" (to my knowledge, I couldn't find any specifics on what it does) everyday at 3am so it could be linked to that but it's not definitive since otherwise the issue wouldn't be happening during the day. There's also a Weekly and Monthly one. Worth noting that today the issue happened on both my firewalls around 5am today which is when the "rc.periodic monthly" job was scheduled to run since this was the 1st of the month. Again, could be a coincidence but I'm going to try moving the daily cron job to a specific time, (or disable it completely) and see if it starts happening at that time instead of the small hours when our tunnels become mission critical.
Any help would be greatly appreciated.