Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    My IPSEC service hangs

    Scheduled Pinned Locked Moved IPsec
    76 Posts 15 Posters 18.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S
      slimjim2321
      last edited by slimjim2321

      Has there been any update or possible workaround on this? I've been researching this for days and tried everything besides changing platforms (which I'm close to doing at this point). Have lost a lot of Faith in the stability of using Pfsense to be frank after so many years of no issues.
      The issue seems to be completely random. Sometimes will skip a few days and others it happens multiple times a day but mostly at 3am when our Tunnels are being used the most, leading to multiple early hour calls for me the past 2 weeks.

      I've confirmed it's not Hardware Related. I swapped the affected machine with new hardware, fresh install of the latest 2.7 Dev build (was on 2.6 but saw on another post someone mention upgrading to 2.7 fixed it for them) and restored most settings manually from scratch. The only things I restored from a backup config were the few hundred Ipsec Phase 2 entries we have and our Certs/Users which would take most of a day to recreate manually through the GUI. The hardware is more than enough for the amount of traffic we see (less than 50mbps). I have all hardware acceleration and offloading off for troubleshooting, even AES-NI. We have tons of tunnels but the total amount of traffic is not a significant amount. Probably 1-5 gigabytes of data is sent per night over the course of 12 hours.

      Today it happened on our only other site's Firewall which up until this point I haven't seen happen on nor have I made any real changes to when troubleshooting the original problem firewall. The only commonality between the two is that they share an IPsec tunnel with about 50 phase 2 entries between each other and only really started happening after that tunnel was put in place. I noticed the issue happen about once every 3 weeks at first. Figured it was some sort of hardware acceleration problem and disabled offloading since I had recently enabled it when creating that new tunnel, another 3-4 weeks would go by without issue, then happen again. Now it happens multiple times a day almost every day.

      Today, I've tried lowering our amount of tunnels by disabling "Split Connections" on Both ends which will reduce the amount of Connections by about 50. I prefer to have each Connection split normally just to be able to better track individual tunnels on the status page but we'll see if it makes a difference here. I'm not a programmer but my gut tells me that having hundreds of IPsec Phase 2 entries isn't the norm and might be linked to the issue based on what others have said. I've also lowered the verbosity of IPsec logging to Audit. The IPsec logs have never showed anything to indicate an issue and just stop generating when the issue occurs anyway so if the issue is log related (as I've seen other say it happens when Logs need to be rotated) I should see a decrease in frequency.

      The only other pattern I've noticed is that it's happened multiple days in a row at 3am. There's a cron job in pfsense by default that checks the system called "rc.peroidic daily" (to my knowledge, I couldn't find any specifics on what it does) everyday at 3am so it could be linked to that but it's not definitive since otherwise the issue wouldn't be happening during the day. There's also a Weekly and Monthly one. Worth noting that today the issue happened on both my firewalls around 5am today which is when the "rc.periodic monthly" job was scheduled to run since this was the 1st of the month. Again, could be a coincidence but I'm going to try moving the daily cron job to a specific time, (or disable it completely) and see if it starts happening at that time instead of the small hours when our tunnels become mission critical.

      Any help would be greatly appreciated.

      G 1 Reply Last reply Reply Quote 0
      • G
        gassyantelope @slimjim2321
        last edited by gassyantelope

        @slimjim2321 Welcome to the club, how nice of you to join us :)

        There's been no update on the Redmine post since I posted the kernel trace log a month ago. I don't have any hope that it's going to be fixed any time soon. It appears that Netgate doesn't care about the problem at all, as this has been going on for months (maybe years) and has just been ignored. Many of us have provided information, which they asked for, but then haven't heard anything since.

        It's a shame that a fix, for an issue that breaks core functionality, is a low priority for them. They should really start advertising that pfSense doesn't support working IPsec VPNs at this point, because that's where we're at. It's very frustrating, that's for sure. We've moved all our clients away from Netgate/pfSense to another manufacturer at this point. I still have a single pfSense box that I use for testing, in hopes that this will be fixed in 5 years.

        T david11717D 2 Replies Last reply Reply Quote 1
        • T
          Topogigio @gassyantelope
          last edited by

          @gassyantelope
          Hi, we are at same point. We cannot use a system that requires IPSEC as main feature, and have it not stable.

          Which platform did you choose?

          1 Reply Last reply Reply Quote 0
          • david11717D
            david11717 @gassyantelope
            last edited by david11717

            @gassyantelope I've been having this issue as well and have been following this forum post and the post in the redmine. Clearly nothing is being done so I decided to take it into my own hands and wrote a script that I run every minute with a cron job. I feel like it's pretty straight forward but I'm looking for the length of the charon.vici queue and storing it into queueLength. The if queueLength is greater than 0, I kill the charon processes and restart ipsec twice. I've done a few different iterations of the script but this one seems to work perfectly. Hope this helps!

            #!/bin/sh
            
            queueLength=$(netstat -Lan | grep charon.vici | cut -c 7)
            
            if [ $((queueLength)) -gt 0 ]; then
            
                    	/usr/bin/killall -9 charon
                    	/usr/local/sbin/pfSsh.php playback restartipsec; sleep 10; /usr/local/sbin/pfSsh.php playback restartipsec
            
            else
            
            fi
            
            A 1 Reply Last reply Reply Quote 1
            • A
              auroramus @david11717
              last edited by

              @david11717 Hi sorry for not reply ing sooner since i upgraded fto 2.7 iv not had any issues since my ipsec has been holding strong

              david11717D 1 Reply Last reply Reply Quote 0
              • david11717D
                david11717 @auroramus
                last edited by

                @auroramus Interesting. I'm on the newest 2.7 dev and I'm still having the issue.

                S keyserK 2 Replies Last reply Reply Quote 0
                • S
                  slimjim2321 @david11717
                  last edited by slimjim2321

                  @david11717 Upgrading to 2.7 didn't fix my issue either. Thank you so much for the script.

                  Before the holiday weekend I greatly increased the log size (since I have hundreds of free gbs on that device) before swapping, Disabled Ipsec logging entirely, and disabled split connections on one of my bigger tunnels. One of these three changes seems to have kept it from happening over the weekend. I have a feeling it's being triggered by log swapping.

                  david11717D 2 Replies Last reply Reply Quote 1
                  • keyserK
                    keyser Rebel Alliance @david11717
                    last edited by

                    @david11717 I don’t know if this applies to you, but I have discovered an IPSec issue today when deploying a bunch of SG-2100 boxes (ARM64 CPU boxes).
                    If I use AES-CGM for encryption (both 128 and 256bit) as guided by Netgate, the boxes will stall/become unresponsive after a while if there is more than one Phase2 tunnels active in the Tunnel. Boxes with only one Phase2 tunnel does not seem to suffer the issue.

                    Disabling SafeXcel (HW Acceleration) does not mitigate the Issue.
                    But changing the cipher on the tunnel to AES256 (not CGM - I believe it is really AES256-CBC) resolves the issue.

                    I have a lot of testing to do still, but it’s quite evident the change of cipher resolves the issue.

                    Love the no fuss of using the official appliances :-)

                    david11717D 1 Reply Last reply Reply Quote 0
                    • david11717D
                      david11717 @keyser
                      last edited by

                      @keyser 99% of my IPSec VPNs use AES256. I have one that uses AES256-GCM and I changed it to AES256 in the testing process (a month or two ago). The issue persisted. That being said, the devices I'm using are x86_64 and not ARM based.

                      1 Reply Last reply Reply Quote 0
                      • david11717D
                        david11717 @slimjim2321
                        last edited by

                        This post is deleted!
                        1 Reply Last reply Reply Quote 0
                        • david11717D
                          david11717 @slimjim2321
                          last edited by

                          @slimjim2321 said in My IPSEC service hangs:

                          @david11717 Upgrading to 2.7 didn't fix my issue either. Thank you so much for the script.

                          Before the holiday weekend I greatly increased the log size (since I have hundreds of free gbs on that device) before swapping, Disabled Ipsec logging entirely, and disabled split connections on one of my bigger tunnels. One of these three changes seems to have kept it from happening over the weekend. I have a feeling it's being triggered by log swapping.

                          Did any of these steps completely resolve your problem since then? My script is still working great but I'd love to fix the issue passively than to have a CRON job running every minute.

                          S 1 Reply Last reply Reply Quote 0
                          • F
                            Flukester
                            last edited by

                            Hi all

                            We have been getting similar to this... have around 25 IPsec site to site VPNs and they have been very unstable. Sometimes VPNs function for weeks, then they all drop and we cannot log into management UI and need a reboot to fix.... sometimes they drop twice in one day.

                            Been working with TAC enterprise support on this...

                            We applied a ipsec kernal hotfix but still died...

                            Currently trying a few things to fix -

                            Increased IPsec log file size to 1MB
                            Put in a 6hr cron job to remove IPsec log file archives
                            Disabled VPNs that not currently finished so they not spamming logs
                            Removed IPsec widget from dashboard

                            Now just waiting to see if we have any stability... if stable it could be down to any one of these changes. Touch wood

                            1 Reply Last reply Reply Quote 0
                            • F
                              Flukester
                              last edited by

                              new changes on top of those I made before...

                              reduced all the 'chatter' clogging up ipsec logs, also makes then much more readable for diags puposes.

                              vpn / ipsec / adv settings / ipsec logging

                              ike sa - diag > audit

                              ike child sa - diag > audit

                              networking - control > audit

                              message encoding - control > audit

                              1 Reply Last reply Reply Quote 1
                              • S
                                slimjim2321 @david11717
                                last edited by

                                @david11717 Since my last post 18days ago I haven't experienced the issue. I'm unsure which of the fixes I put in place actually solved it for me though. I have since then reenabled the "rc.peroidic daily" cron jobs that are in pfsense by default and it hasn't returned so that's not it. So my issue was either solved by completely disabling ipsec logging (which isn't ideal) along with greatly increasing the log rotation size (I have it set to 100mb, default is 500kb)....Or Disabling Split Connections for my largest tunnel. I still have split connections enabled for all of my other tunnels and I don't think this was the cause either.
                                For me my best guess is that it was being triggered when the logs rotated and simply drastically removing the majority of where my logs came from (Ipsec) helped slow it down. If I'm right eventually it'll happen again in a few months time given the slow rate the logs fill up now.

                                david11717D 1 Reply Last reply Reply Quote 1
                                • david11717D
                                  david11717 @slimjim2321
                                  last edited by

                                  @slimjim2321 I guess I'll try each of those separately for a few days and see how it goes. Thanks for the update!

                                  1 Reply Last reply Reply Quote 0
                                  • F
                                    Flukester
                                    last edited by

                                    Just a update since changes I made we now been up for 7days... too early to say yet for sure, but things looking good

                                    david11717D 1 Reply Last reply Reply Quote 0
                                    • david11717D
                                      david11717 @Flukester
                                      last edited by

                                      @flukester I made the same changes you detailed above and the issue still happens for me. At this point I've stopped playing with it and my cron job fixes the issue within about 60 seconds of it happening. It's kind of ridiculous, honestly.

                                      F 1 Reply Last reply Reply Quote 0
                                      • R
                                        romczak
                                        last edited by romczak

                                        I have the same problem with the firewall, so I purchased the Enterprise support. They pointed me to the link https://redmine.pfsense.org/issues/13014 and said to follow what is there... without any explanation... Working with Palo Alto and Cisco daily I was quite stunned to get typical online forum response.
                                        I guess you got what you pay for, but If anyone is considering buying the subscription I DO NOT recommend.

                                        Right now I am running on the above script except the service restarts don't fix IPSec, so I replaced it with system reboot. I am using it for IPSec only so when it goes down I already have outage.

                                        G 1 Reply Last reply Reply Quote 0
                                        • G
                                          gassyantelope @romczak
                                          last edited by gassyantelope

                                          @romczak Yeah, it's been pretty concerning how this whole situation has been handled. It's one thing if there's problems when using pfSense community edition, since that's always been free and open source and you can't expect everything to be fixed ASAP. It's another thing when Netgate sells devices, support, their paid edition of pfSense, and market them as being stable and robust, yet they can't even do IPsec reliably.

                                          I'd expect such common/core functionality to work properly when buying their devices and/or their "enterprise" pfSense edition. We're now going on 8+ months since the issue was submitted on Redmine, but there were other similar issues submitted over a year or two ago. It's a really bad look for them to take people's money for a product that lacks reliable IPsec functionality, something that I've never seen any other firewall software/company struggle with. It's just as bad that you can pay for support, but they don't seem to do much more than point you to the public issue tracker (Redmine, which is free for everyone) or the community forum.

                                          The good thing is that the issue on Redmine is finally getting some responses from some developers. Hopefully that means we'll have a real fix soon. Though, it still makes me worry about how future problems may be handled. If there's another issue with core functionality in the future, is it going to be another 8+ month wait to get that fixed as well?

                                          R 1 Reply Last reply Reply Quote 0
                                          • T
                                            Topogigio
                                            last edited by

                                            I'm leaving pfSense, this is the only solution. The product (considering it from all points of views) simply is now too bad to count on it for production environments and real life.

                                            M 1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.