My IPSEC service hangs

glreed735

@auroramus - Not yet, the first pass through the logs highlighted some issues, but they wanted a larger sample of data to work from pending the next failure.

auroramus

I am no coding expert but it seems like once logs reach maxiumum capacity rather than overwriting the logs it crashes the ipsec service.

Thats what it looks like to me.

No matter what setting i change it to wether it is a low log count or high it maxes then kills service and unless you restart it will not work.

auroramus

once i clear the logs i manage to go past the screen above i mentioned of collecting ipsec status info and see my connection but when you hit connect it attemps and stops doesnt do anything only way to get them connected back is restart

auroramus

i also found this post;

This might be entirely normal behaviour; IPSec and many other forms of VPN tunnels connect only when there is traffic to transmit.
Take for example you have an 8 hour lifetime on the IKE (Phase 1) tunnel. The tunnel will connect upon some traffic being transmitted down the tunnel and will always terminate as soon as 8 hours has passed since it came up. Only if packets are still trying to be sent down the tunnel will the tunnel come back up again and continue transmitting traffic for another 8 hours. The down and up happens very quickly and packets may not even be lost. This is for security reasons to refresh the security associations.
Some people choose to run a ping or similar constantly down the tunnels so it always looks to be connected except for the brief milliseconds to reassociate. I find this to be generally unnecessary.

ablizno

@auroramus were you ever able to run netstat -Lan and provide the output when all your tunnels are down?

gassyantelope

@auroramus The behavior occurring is definitely not normal. I understand what that post is saying and completely agree that is normal IPsec behavior. The issue here is completely different though. The tunnels will never come back up once they all go down. I can ping, send data another way, etc., and they won't ever come back up until a restart is performed.

I've had multiple cases where I had active connections over the tunnel (sending data the whole time) and then the issue occurs and all tunnels go down. This has occurred way before the default 8 hour life span (sometimes within an hour or two).

auroramus

@gassyantelope Yes 100% the behaviour is wrong.

as it seems to crash the service. and this shouldnt happen.

mr.ortizx

I just paid for Enterprise support and I was told the following:

"Hello,

Unfortunately, this is a somewhat rare issue that has not been solved yet. It is much less prevalent in pfSense CE 2.5.2, 2.7, and pfSense Plus 22.05. There aren't any workarounds currently, so rolling back or upgrading are the only steps you can currently take to mitigate the issue. You may track the issue here:

https://redmine.pfsense.org/issues/13014
"
I hope this helps you guys. event though redmine says all tunnels continue to operate normally, Netgate support mentioned that they also see instances where all tunnels will drop which is the case for all of us.

auroramus

@mr-ortizx really appreciate you letting us know.

auroramus

I have updated to 2.7 i will keep you guys updated.

gassyantelope

@mr-ortizx Thanks man! At least we finally got an official response from them. I'm gonna do what @auroramus did and update to 2.7 as well to see if it helps at all. It can't hurt at this point.

mr.ortizx

@gassyantelope @auroramus Please let me know how it went after upgrading to the version 2.7

auroramus

Hi Guys

So far so good with 2.7 have not had a single drop in the tunnels for days now soo ye give it a go and let me know.

auroramus

I have been running 2.7 since 30th June and i have not had a single blip.

Let me know how you guys get on.

gassyantelope

@auroramus I updated to 2.7 yesterday. It's only been 24 hours, but I haven't had the issue yet. That's already an improvement for me, seeing as I had to reboot the firewall once or twice a day when on 2.6. I'll provide another update in a few days. I'm crossing my fingers.

gassyantelope

@gassyantelope I spoke too soon. I just had the issue occur on 2.7.

Disclosure: Potentially justifiable rant below :)

Investigating and fixing this issue really needs to be a higher priority at this point. There are reports about the issue from 5+ years ago, yet it still exists. The latest redmine issue report (from 3 months ago) hasn't had much traction, as far as someone actually investigating the problem. It just keeps having its target version pushed back over and over.

I get that there are other issues that need to be fixed as well, but this is an issue that, essentially, makes pfSense a nonviable option to use as a firewall in a production environment. Netgate states it to be a "somewhat rare" issue, yet there are many threads and redmine reports, spanning years, that show that this issue is more common than they make it out to be.

My company has primarily used WatchGuard firewalls for years, which are decent enough, but their capabilities are lacking in various areas (I'd prefer to move away from them, personally). We started installing some Netgate/pfSense devices for some "smaller" networks, that only have 5-10 IPsec tunnels, and found pfSense to run stably and have far superior capabilities. We were ready to purchase ~30 Netgate firewalls to replace all of the WatchGuards, but wanted to test pfSense on a "larger" networks (50+ IPsec tunnels) to make sure there were no issues before we pulled the trigger. That large network test led us to where we are today, exposing this issue that completely breaks IPsec VPNs constantly.

As much as I like pfSense (which I'll continue to use for my home lab) and really want to move away from WatchGuard and transition to Netgate/pfSense firewalls, that can't be done for as long as this issue continues to exist. A firewall with lackluster capabilities, but fully working IPsec VPNs, is better than a very capable firewall that has to be rebooted 1-2 times per day to get IPsec VPNs, which I'd consider a core feature of all firewalls, to stay up and work properly.

I'll be putting the WatchGuards back in place for now. I'll continue to monitor this thread and the redmine issue page for updates. I'm still willing to swap the pfSense firewall back in to assist with the testing of possible solutions, as I'd like to see this problem fixed some day. I just can't have pfSense be our day to day, primary, firewall in its current state.

Rant over.

ablizno

@mr-ortizx Updated to latest 2.7 dev build, issue still occurs with the same frequency as before.

mr.ortizx

I was asked by Netgate technical support to upgrade to the version. Pfsense Plus 22.05
Issue persisted. I will continue working with support.

ablizno

@mr-ortizx wish there was some way to help point them towards the root of the issue. We know its due to the vici socket getting overwhelmed/locked up. When it happens if you run sockstat | grep -i vici you can see charon is overwhelmed. It started as like once a week for me and now its every ~12 hours it seems. Tunnels expire every 8 hours, so it doesn't appear to be directly related to the tunnels reconnecting. Opening Command Prompt and running pgrep -f charon to get the PIDs then kill -9 [pid] [pid] works as long as you restart the IPSEC service twice (not sure why it needs to be restarted twice) seems to fix it. We know what the problem is, and I'd be willing to provide any logs that help as I understand it is some sort of "rare" issue.

If anyone from netgate sees this, I'd be more than willing to assist in getting this resolved.

rcoleman-netgate

@ablizno said in My IPSEC service hangs:

If anyone from netgate sees this, I'd be more than willing to assist in getting this resolved.

This is the redmine associated with this issue: https://redmine.pfsense.org/issues/13014

Any contribution you can provide would be located in there.