• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

VPNs Disconnecting / Reported Memory Issue

IPsec
5
14
982
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • M
    marcquark @Kash
    last edited by marcquark May 24, 2020, 9:55 PM May 24, 2020, 9:54 PM

    Sounds incredibly familiar. I'm operating the same platform (SG-3100), and having exactly the same problem as you're describing.
    Are you running an HA cluster setup or just a single SG-3100? What platforms are used on the remote ends of your tunnels? How big is the environment for your SG-3100 (how many interfaces, VLANs, firewall rules, connected devices etc.)?
    Can you confirm that the frequency at which the bug occurrs increases with the number of tunnels? Are you using only VTI tunnels? No VTI at all? A mixture? Can the issue be narrowed down to VTI?

    We have a tunnel setup of SG-3100 <-> pfSense (x86 Dell HW) <-> Ubiquiti ER-X. The worst part about the bug is, when the first tunnel goes into this weird failed state, it actually even affects the second tunnel (between x86 pfSense and ER-X). This leads me to believe there may be a more general software problem. Though my last hope is still that it's SG-3100 specific and can somehow be fixed even if the fix is to replace the hardware...
    It's still concerning that a bug on one device (say an SG-3100 in a branch office) can have such severe consequences and potentially affect other tunnels running on a connected device (say an XG-1541 in an HQ or Datacenter). That's a big showstopper for any serious business/enterprise deployment...

    G 1 Reply Last reply May 27, 2020, 12:06 PM Reply Quote 0
    • V
      viktor_g Netgate
      last edited by May 27, 2020, 6:06 AM

      What version of pfSense? 2.4.4-p3 or 2.4.5?

      Please see https://redmine.pfsense.org/issues/10176

      G 1 Reply Last reply May 27, 2020, 12:03 PM Reply Quote 0
      • G
        Graham.Collinson @viktor_g
        last edited by May 27, 2020, 12:03 PM

        @viktor_g said in VPNs Disconnecting / Reported Memory Issue:

        What version of pfSense? 2.4.4-p3 or 2.4.5?

        Please see https://redmine.pfsense.org/issues/10176

        (I work with Kash)
        2.4.4-p3 We don't want to risk an upgrade at the moment to see if the problem is still present.
        I don't think the problem you linked to is relevant at the moment but I'll double check, thanks.

        I've set net.inet.ipsec.debug = 1 now and we saw the following at the time of the memory error:

        
        May 24 00:41:20 pfSense-le kernel: key_setsaval: unable to initialize SA type 3.
        

        This error appears to be output in sys/netipsec/key.c during xform_init

        I can't see anything further I can do to narrow down where in the kernel it's failing without building my own now.
        We're capturing some vmstat -m and vmstat -z output but I'm not seeing anything growing significantly at this point. The 'RADIX NODE' zone is gradually increasing but doesn't seem to be a problem.
        I'm guessing that as we see this problem on a 3100 but never saw it on the 2440 that it replaced that the problem is something to do with a crypto driver that is present on ARM only.

        1 Reply Last reply Reply Quote 0
        • G
          Graham.Collinson @marcquark
          last edited by May 27, 2020, 12:06 PM

          @marcquark said in VPNs Disconnecting / Reported Memory Issue:

          Sounds incredibly familiar. I'm operating the same platform (SG-3100), and having exactly the same problem as you're describing.
          Are you running an HA cluster setup or just a single SG-3100? What platforms are used on the remote ends of your tunnels? How big is the environment for your SG-3100 (how many interfaces, VLANs, firewall rules, connected devices etc.)?
          Can you confirm that the frequency at which the bug occurrs increases with the number of tunnels? Are you using only VTI tunnels? No VTI at all? A mixture? Can the issue be narrowed down to VTI?

          We have a tunnel setup of SG-3100 <-> pfSense (x86 Dell HW) <-> Ubiquiti ER-X. The worst part about the bug is, when the first tunnel goes into this weird failed state, it actually even affects the second tunnel (between x86 pfSense and ER-X). This leads me to believe there may be a more general software problem. Though my last hope is still that it's SG-3100 specific and can somehow be fixed even if the fix is to replace the hardware...
          It's still concerning that a bug on one device (say an SG-3100 in a branch office) can have such severe consequences and potentially affect other tunnels running on a connected device (say an XG-1541 in an HQ or Datacenter). That's a big showstopper for any serious business/enterprise deployment...

          We're not using VTI at all. No HA either. Are you seeing the "Cannot allocate memory" error we see?

          1 Reply Last reply Reply Quote 0
          • M
            marcquark
            last edited by May 27, 2020, 5:25 PM

            I'm in exactly the same boat, don't want to upgrade because of all the bug reports. Waiting at least for 2.4.5-p1.

            That log entry is an an interesting find!
            As mentioned earlier, the fact that the bug seemingly propagates makes me skeptical whether it's ARM-specific. It seems so illogical that a failed tunnel between (1) an SG-3100 and (2) an x86 platform can affect other tunnels on (2) to the point of complete loss of connectivity.

            I have a virtual lab with VTI between two pfSense VMs, FRR included for OSPF. It has been running for a little over a week now. In the beginning i consistently saw 2 SAs, today i noticed that number increased to 4. Will continue to observe. But it looks like the duplicate SA creation is reproducible.

            @Graham-Collinson said in VPNs Disconnecting / Reported Memory Issue:

            We're not using VTI at all. No HA either. Are you seeing the "Cannot allocate memory" error we see?

            Yes, at least on the SG-3100. Not on the x86 device though.

            1 Reply Last reply Reply Quote 0
            • G
              Graham.Collinson
              last edited by Jun 8, 2020, 11:11 AM

              I've been reading a lot of kernel source code. I can see there's a point when setting up an ipsec session that a decision is made on which driver to use. There's a setting that can be for a specific driver, hardware and/or software.
              The main difference between our SG-2440 and the 3100 for crypto appears to be that the 3100 has a "Marvell Cryptographic Engine and Security Accelerator" (CESA). We've never had any issue like this with 2440s so I started looking at the driver code for the CESA. I can't see anything obvious for why it might be leaking but I though I might as well try to eliminate it if I can.

              The setting you can use to influence the driver that will be used for crypto operations in ipsec is net.inet.ipsec.crypto_support. By default this is set to 50331648. In the source code I can see these 2 defintions:

              #define CRYPTO_FLAG_HARDWARE	0x01000000	/* hardware accelerated */
              #define CRYPTO_FLAG_SOFTWARE	0x02000000	/* software implementation */
              

              which in decimal is
              hardware = 16777216
              software = 33554432
              (50331648 total)

              so I changed this to be software only (33554432) in the pfsense front end under system->advanced->system tunables
              added as a new tunable with description “software only crypto - original value 50331648”

              We've been up for 7 days now with no reported memory problems. This is the longest we've had the firewall stay up for. Fingers crossed I've found the thing causing us our problems. I still can't track down exactly what is wrong with the CESA driver though.

              Note that the crypto hardware option under system->advanced->miscellaneous doesn't affect this ipsec crypto setting at all.

              1 Reply Last reply Reply Quote 1
              • G
                Graham.Collinson
                last edited by Jun 8, 2020, 11:13 AM

                (If hardware and software are set then hardware is preferred by the kernel code.)

                1 Reply Last reply Reply Quote 1
                • G
                  Graham.Collinson
                  last edited by Jun 11, 2020, 11:15 AM

                  reported as freebsd bug : https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247163

                  1 Reply Last reply Reply Quote 1
                  • M
                    marcquark
                    last edited by Jun 18, 2020, 7:02 PM

                    Just curious, are you still doing fine with disabled Hardware Crypto? I did what Jim proposed here https://redmine.pfsense.org/issues/10176 - set one side to responder only and applied the fix for VTIs. So far i haven't seen a duplicate SA again, and my tunnels are still working. It's too soon to build confidence though, we've had the setup seemingly stable before. So i'll keep observing but it looks promising.

                    G 1 Reply Last reply Jun 19, 2020, 1:14 PM Reply Quote 0
                    • G
                      Graham.Collinson @marcquark
                      last edited by Jun 19, 2020, 1:14 PM

                      @marcquark yep, all good for us now. Only change is the disabled hardware

                      1 Reply Last reply Reply Quote 0
                      • G
                        Graham.Collinson
                        last edited by Jun 21, 2020, 12:33 PM

                        Updates on the freebsd bug report. It looks as though cesa is restricted to a max of 64 sessions at a time. Somehow we're running out of sessions which causes the memory error we've seen. There is no way to tweak the max number of sessions without compiling your own version.
                        The memory management has changed in freebsd 12 such that this is probably not an issue. I'm not sure what the plans are for pfsense to use to freebsd 12.

                        https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247163

                        1 Reply Last reply Reply Quote 0
                        • M
                          marcquark
                          last edited by marcquark Jun 22, 2020, 11:47 AM Jun 22, 2020, 11:45 AM

                          I believe that a session limit of 64 could explain both of our situations. In my case, under certain circumstances, duplicate Child SAs would be created (even though with VTI there should only be one).

                          I imagine software crypto on the SG-3100 yields lousy performance, so if you're willing to experiment, try what Jim suggested in the bug ticket i referenced. If you use VTI you'll have to manually apply the patch he committed here https://redmine.pfsense.org/projects/pfsense/repository/revisions/9a69dd4b8ff6eeeaf5779b7388a10743afae8e91/diff/src/etc/inc/vpn.inc

                          K 1 Reply Last reply Jun 22, 2020, 5:02 PM Reply Quote 0
                          • K
                            Konstanti @marcquark
                            last edited by Konstanti Jun 22, 2020, 5:07 PM Jun 22, 2020, 5:02 PM

                            @marcquark
                            Hi
                            I also struggled with SA duplicates for a long time , and as a result I wrote my own small program ( С ) that interacts with FreeBSD kernel ( PF_KEY Key Management API ) and Strongswan ( Vici and Stroke Protocol ) .
                            This program receives messages from the kernel when a new SA is installed and checks how many are currently in the system . If it finds a duplicate, it sends a Strongswan message to delete the duplicate
                            Here's what it looks like

                            Jun 16 10:05:43 daemon sa_checker started 
                            Jun 16 10:05:43 [KNL] successfully connected to PF_KEY_V2 socket
                            Jun 16 10:05:54 [KNL] received message from kernel,type of message SADB_DELETE,errno = 0 ,seq number = 53850 
                            Jun 16 10:05:54 [KNL] parsed data : SPI = 0xcecde75d , SRC ADDRESS = 10.3.100.100 , DST ADDRESS = 10.3.100.1
                            Jun 16 10:05:54 [KNL] received message from kernel,type of message SADB_ADD,errno = 0 ,seq number = 53851 
                            Jun 16 10:05:54 [KNL] parsed data : REQID = 0x3e8, SPI = 0xcecde75d , SRC ADDRESS = 10.3.100.100 , DST ADDRESS = 10.3.100.1, SA TYPE = IPsec ESP 
                            Jun 16 10:05:54 [KNL] received message from kernel,type of message SADB_ADD,errno = 0 ,seq number = 53852 
                            Jun 16 10:05:54 [KNL] parsed data : REQID = 0x3e8, SPI = 0xc459c817 , SRC ADDRESS = 10.3.100.1 , DST ADDRESS = 10.3.100.100, SA TYPE = IPsec ESP 
                            Jun 16 10:05:55 [KNL] successfully connected to stroke socket
                            Jun 16 10:05:55 [KNL] stroke message successfully sended,waiting for reply ....
                            Jun 16 10:05:55 [KNL] server reply is '     con1000{243}:  INSTALLED, TUNNEL, reqid 1000, ESP SPIs: cb0c99c0_i c0d72295_o ' 
                            Jun 16 10:05:55 [KNL] server reply is '     con1000{244}:  INSTALLED, TUNNEL, reqid 1000, ESP SPIs: cecde75d_i c459c817_o ' 
                            Jun 16 10:05:55 [IKE] total child's SA for deleting 1 
                            Jun 16 10:05:55 [IKE] CHILD SA  for deleting 243 
                            Jun 16 10:05:55 [IKE] closing CHILD_SA con1000{243} with SPIs cb0c99c0_i (172 bytes) c0d72295_o (71116 bytes) and TS 0.0.0.0/0|/0 === 0.0.0.0/0|/0
                            Jun 16 10:05:55 [IKE] sending DELETE for ESP CHILD_SA with SPI cb0c99c0
                            Jun 16 10:05:55 [ENC] generating INFORMATIONAL request 1535 [ D ]
                            Jun 16 10:05:55 [NET] sending packet: from 10.3.100.1[500] to 10.3.100.100[500] (80 bytes)
                            Jun 16 10:05:55 [NET] received packet: from 10.3.100.100[500] to 10.3.100.1[500] (80 bytes)
                            Jun 16 10:05:55 [ENC] parsed INFORMATIONAL response 1535 [ D ]
                            Jun 16 10:05:55 [IKE] received DELETE for ESP CHILD_SA with SPI c0d72295
                            Jun 16 10:05:55 [IKE] CHILD_SA closed
                            Jun 16 10:05:55 [IKE] CHILD SA with uniqueid 243 has deleted successfully
                            Jun 16 10:05:55 [KNL] received message from kernel,type of message SADB_DELETE,errno = 0 ,seq number = 53853 
                            Jun 16 10:05:55 [KNL] parsed data : SPI = 0xc5cdc8dd , SRC ADDRESS = 10.3.100.102 , DST ADDRESS = 10.3.100.1
                            Jun 16 10:05:55 [KNL] received message from kernel,type of message SADB_ADD,errno = 0 ,seq number = 53854 
                            Jun 16 10:05:55 [KNL] parsed data : REQID = 0x7d0, SPI = 0xc5cdc8dd , SRC ADDRESS = 10.3.100.102 , DST ADDRESS = 10.3.100.1, SA TYPE = IPsec ESP 
                            Jun 16 10:05:55 [KNL] received message from kernel,type of message SADB_ADD,errno = 0 ,seq number = 53855 
                            Jun 16 10:05:55 [KNL] parsed data : REQID = 0x7d0, SPI = 0xcd2b03b8 , SRC ADDRESS = 10.3.100.1 , DST ADDRESS = 10.3.100.102, SA TYPE = IPsec ESP 
                            Jun 16 10:05:56 [KNL] successfully connected to stroke socket
                            Jun 16 10:05:56 [KNL] stroke message successfully sended,waiting for reply ....
                            Jun 16 10:05:56 [KNL] server reply is '     con2000{242}:  INSTALLED, TUNNEL, reqid 2000, ESP SPIs: ca05bb45_i c4a73853_o ' 
                            Jun 16 10:05:56 [KNL] server reply is '     con2000{245}:  INSTALLED, TUNNEL, reqid 2000, ESP SPIs: c5cdc8dd_i cd2b03b8_o ' 
                            Jun 16 10:05:56 [IKE] total child's SA for deleting 1 
                            Jun 16 10:05:56 [IKE] CHILD SA  for deleting 242 
                            Jun 16 10:05:56 [IKE] closing CHILD_SA con2000{242} with SPIs ca05bb45_i (0 bytes) c4a73853_o (216292 bytes) and TS 0.0.0.0/0|/0 === 0.0.0.0/0|/0
                            Jun 16 10:05:56 [IKE] sending DELETE for ESP CHILD_SA with SPI ca05bb45
                            Jun 16 10:05:56 [ENC] generating INFORMATIONAL request 127 [ D ]
                            Jun 16 10:05:56 [NET] sending packet: from 10.3.100.1[500] to 10.3.100.102[500] (80 bytes)
                            Jun 16 10:05:56 [NET] received packet: from 10.3.100.102[500] to 10.3.100.1[500] (80 bytes)
                            Jun 16 10:05:56 [ENC] parsed INFORMATIONAL response 127 [ D ]
                            Jun 16 10:05:56 [IKE] received DELETE for ESP CHILD_SA with SPI c4a73853
                            Jun 16 10:05:56 [IKE] CHILD_SA closed
                            Jun 16 10:05:56 [IKE] CHILD SA with uniqueid 242 has deleted successfully
                            

                            If you are interested, I can give you a program for the test

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.