VPNs Disconnecting / Reported Memory Issue
-
Hi
We have an issue with our Netgate SG-3100. We have multiple VPN's configured that eventually stop passing traffic and fail to reconnect, until a soft reboot of the firewall is issued. Once the reboot completes, all VPNs reconnect fine and start functioning as expected, until the issue is hit again (after roughly 2 days).
Having reviewing the ipsec logs, we've unearthed the following memory allocation error:
Apr 2 05:22:19 pfSense-le charon: 12[KNL] <con13000|200> unable to add SAD entry with SPI cb36cc85: Cannot allocate memory (12)
Oddly however, at the time that the issue is reported, there's plenty of memory available. We've got a suspicion that this may indicate an issue with Kernel allocating memory properly.
I did some searching in the strongswan code and found that this error messages appears to be getting passed back by the kernel after a pfkey_send call.
https://github.com/strongswan/strongswan/blob/master/src/libcharon/plugins/kernel_pfkey/kernel_pfkey_ipsec.c
if (pfkey_send(this, msg, &out, &len) != SUCCESS) { DBG1(DBG_KNL, "unable to add SAD entry with SPI %.8x", ntohl(id->spi)); return FAILED; } else if (out->sadb_msg_errno) { DBG1(DBG_KNL, "unable to add SAD entry with SPI %.8x: %s (%d)", ntohl(id->spi), strerror(out->sadb_msg_errno), out->sadb_msg_errno); free(out); return FAILED; }
we're getting that second form of the error message where there has been an error number passed back. So the question is - Why would the kernel pfkey run out of memory and how can we check for that and/or increase it?
-
Sounds incredibly familiar. I'm operating the same platform (SG-3100), and having exactly the same problem as you're describing.
Are you running an HA cluster setup or just a single SG-3100? What platforms are used on the remote ends of your tunnels? How big is the environment for your SG-3100 (how many interfaces, VLANs, firewall rules, connected devices etc.)?
Can you confirm that the frequency at which the bug occurrs increases with the number of tunnels? Are you using only VTI tunnels? No VTI at all? A mixture? Can the issue be narrowed down to VTI?We have a tunnel setup of SG-3100 <-> pfSense (x86 Dell HW) <-> Ubiquiti ER-X. The worst part about the bug is, when the first tunnel goes into this weird failed state, it actually even affects the second tunnel (between x86 pfSense and ER-X). This leads me to believe there may be a more general software problem. Though my last hope is still that it's SG-3100 specific and can somehow be fixed even if the fix is to replace the hardware...
It's still concerning that a bug on one device (say an SG-3100 in a branch office) can have such severe consequences and potentially affect other tunnels running on a connected device (say an XG-1541 in an HQ or Datacenter). That's a big showstopper for any serious business/enterprise deployment... -
What version of pfSense? 2.4.4-p3 or 2.4.5?
Please see https://redmine.pfsense.org/issues/10176
-
@viktor_g said in VPNs Disconnecting / Reported Memory Issue:
What version of pfSense? 2.4.4-p3 or 2.4.5?
Please see https://redmine.pfsense.org/issues/10176
(I work with Kash)
2.4.4-p3 We don't want to risk an upgrade at the moment to see if the problem is still present.
I don't think the problem you linked to is relevant at the moment but I'll double check, thanks.I've set net.inet.ipsec.debug = 1 now and we saw the following at the time of the memory error:
May 24 00:41:20 pfSense-le kernel: key_setsaval: unable to initialize SA type 3.
This error appears to be output in sys/netipsec/key.c during xform_init
I can't see anything further I can do to narrow down where in the kernel it's failing without building my own now.
We're capturing some vmstat -m and vmstat -z output but I'm not seeing anything growing significantly at this point. The 'RADIX NODE' zone is gradually increasing but doesn't seem to be a problem.
I'm guessing that as we see this problem on a 3100 but never saw it on the 2440 that it replaced that the problem is something to do with a crypto driver that is present on ARM only. -
@marcquark said in VPNs Disconnecting / Reported Memory Issue:
Sounds incredibly familiar. I'm operating the same platform (SG-3100), and having exactly the same problem as you're describing.
Are you running an HA cluster setup or just a single SG-3100? What platforms are used on the remote ends of your tunnels? How big is the environment for your SG-3100 (how many interfaces, VLANs, firewall rules, connected devices etc.)?
Can you confirm that the frequency at which the bug occurrs increases with the number of tunnels? Are you using only VTI tunnels? No VTI at all? A mixture? Can the issue be narrowed down to VTI?We have a tunnel setup of SG-3100 <-> pfSense (x86 Dell HW) <-> Ubiquiti ER-X. The worst part about the bug is, when the first tunnel goes into this weird failed state, it actually even affects the second tunnel (between x86 pfSense and ER-X). This leads me to believe there may be a more general software problem. Though my last hope is still that it's SG-3100 specific and can somehow be fixed even if the fix is to replace the hardware...
It's still concerning that a bug on one device (say an SG-3100 in a branch office) can have such severe consequences and potentially affect other tunnels running on a connected device (say an XG-1541 in an HQ or Datacenter). That's a big showstopper for any serious business/enterprise deployment...We're not using VTI at all. No HA either. Are you seeing the "Cannot allocate memory" error we see?
-
I'm in exactly the same boat, don't want to upgrade because of all the bug reports. Waiting at least for 2.4.5-p1.
That log entry is an an interesting find!
As mentioned earlier, the fact that the bug seemingly propagates makes me skeptical whether it's ARM-specific. It seems so illogical that a failed tunnel between (1) an SG-3100 and (2) an x86 platform can affect other tunnels on (2) to the point of complete loss of connectivity.I have a virtual lab with VTI between two pfSense VMs, FRR included for OSPF. It has been running for a little over a week now. In the beginning i consistently saw 2 SAs, today i noticed that number increased to 4. Will continue to observe. But it looks like the duplicate SA creation is reproducible.
@Graham-Collinson said in VPNs Disconnecting / Reported Memory Issue:
We're not using VTI at all. No HA either. Are you seeing the "Cannot allocate memory" error we see?
Yes, at least on the SG-3100. Not on the x86 device though.
-
I've been reading a lot of kernel source code. I can see there's a point when setting up an ipsec session that a decision is made on which driver to use. There's a setting that can be for a specific driver, hardware and/or software.
The main difference between our SG-2440 and the 3100 for crypto appears to be that the 3100 has a "Marvell Cryptographic Engine and Security Accelerator" (CESA). We've never had any issue like this with 2440s so I started looking at the driver code for the CESA. I can't see anything obvious for why it might be leaking but I though I might as well try to eliminate it if I can.The setting you can use to influence the driver that will be used for crypto operations in ipsec is net.inet.ipsec.crypto_support. By default this is set to 50331648. In the source code I can see these 2 defintions:
#define CRYPTO_FLAG_HARDWARE 0x01000000 /* hardware accelerated */ #define CRYPTO_FLAG_SOFTWARE 0x02000000 /* software implementation */
which in decimal is
hardware = 16777216
software = 33554432
(50331648 total)so I changed this to be software only (33554432) in the pfsense front end under system->advanced->system tunables
added as a new tunable with description “software only crypto - original value 50331648”We've been up for 7 days now with no reported memory problems. This is the longest we've had the firewall stay up for. Fingers crossed I've found the thing causing us our problems. I still can't track down exactly what is wrong with the CESA driver though.
Note that the crypto hardware option under system->advanced->miscellaneous doesn't affect this ipsec crypto setting at all.
-
(If hardware and software are set then hardware is preferred by the kernel code.)
-
reported as freebsd bug : https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247163
-
Just curious, are you still doing fine with disabled Hardware Crypto? I did what Jim proposed here https://redmine.pfsense.org/issues/10176 - set one side to responder only and applied the fix for VTIs. So far i haven't seen a duplicate SA again, and my tunnels are still working. It's too soon to build confidence though, we've had the setup seemingly stable before. So i'll keep observing but it looks promising.
-
@marcquark yep, all good for us now. Only change is the disabled hardware
-
Updates on the freebsd bug report. It looks as though cesa is restricted to a max of 64 sessions at a time. Somehow we're running out of sessions which causes the memory error we've seen. There is no way to tweak the max number of sessions without compiling your own version.
The memory management has changed in freebsd 12 such that this is probably not an issue. I'm not sure what the plans are for pfsense to use to freebsd 12.https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=247163
-
I believe that a session limit of 64 could explain both of our situations. In my case, under certain circumstances, duplicate Child SAs would be created (even though with VTI there should only be one).
I imagine software crypto on the SG-3100 yields lousy performance, so if you're willing to experiment, try what Jim suggested in the bug ticket i referenced. If you use VTI you'll have to manually apply the patch he committed here https://redmine.pfsense.org/projects/pfsense/repository/revisions/9a69dd4b8ff6eeeaf5779b7388a10743afae8e91/diff/src/etc/inc/vpn.inc
-
@marcquark
Hi
I also struggled with SA duplicates for a long time , and as a result I wrote my own small program ( С ) that interacts with FreeBSD kernel ( PF_KEY Key Management API ) and Strongswan ( Vici and Stroke Protocol ) .
This program receives messages from the kernel when a new SA is installed and checks how many are currently in the system . If it finds a duplicate, it sends a Strongswan message to delete the duplicate
Here's what it looks likeJun 16 10:05:43 daemon sa_checker started Jun 16 10:05:43 [KNL] successfully connected to PF_KEY_V2 socket Jun 16 10:05:54 [KNL] received message from kernel,type of message SADB_DELETE,errno = 0 ,seq number = 53850 Jun 16 10:05:54 [KNL] parsed data : SPI = 0xcecde75d , SRC ADDRESS = 10.3.100.100 , DST ADDRESS = 10.3.100.1 Jun 16 10:05:54 [KNL] received message from kernel,type of message SADB_ADD,errno = 0 ,seq number = 53851 Jun 16 10:05:54 [KNL] parsed data : REQID = 0x3e8, SPI = 0xcecde75d , SRC ADDRESS = 10.3.100.100 , DST ADDRESS = 10.3.100.1, SA TYPE = IPsec ESP Jun 16 10:05:54 [KNL] received message from kernel,type of message SADB_ADD,errno = 0 ,seq number = 53852 Jun 16 10:05:54 [KNL] parsed data : REQID = 0x3e8, SPI = 0xc459c817 , SRC ADDRESS = 10.3.100.1 , DST ADDRESS = 10.3.100.100, SA TYPE = IPsec ESP Jun 16 10:05:55 [KNL] successfully connected to stroke socket Jun 16 10:05:55 [KNL] stroke message successfully sended,waiting for reply .... Jun 16 10:05:55 [KNL] server reply is ' con1000{243}: INSTALLED, TUNNEL, reqid 1000, ESP SPIs: cb0c99c0_i c0d72295_o ' Jun 16 10:05:55 [KNL] server reply is ' con1000{244}: INSTALLED, TUNNEL, reqid 1000, ESP SPIs: cecde75d_i c459c817_o ' Jun 16 10:05:55 [IKE] total child's SA for deleting 1 Jun 16 10:05:55 [IKE] CHILD SA for deleting 243 Jun 16 10:05:55 [IKE] closing CHILD_SA con1000{243} with SPIs cb0c99c0_i (172 bytes) c0d72295_o (71116 bytes) and TS 0.0.0.0/0|/0 === 0.0.0.0/0|/0 Jun 16 10:05:55 [IKE] sending DELETE for ESP CHILD_SA with SPI cb0c99c0 Jun 16 10:05:55 [ENC] generating INFORMATIONAL request 1535 [ D ] Jun 16 10:05:55 [NET] sending packet: from 10.3.100.1[500] to 10.3.100.100[500] (80 bytes) Jun 16 10:05:55 [NET] received packet: from 10.3.100.100[500] to 10.3.100.1[500] (80 bytes) Jun 16 10:05:55 [ENC] parsed INFORMATIONAL response 1535 [ D ] Jun 16 10:05:55 [IKE] received DELETE for ESP CHILD_SA with SPI c0d72295 Jun 16 10:05:55 [IKE] CHILD_SA closed Jun 16 10:05:55 [IKE] CHILD SA with uniqueid 243 has deleted successfully Jun 16 10:05:55 [KNL] received message from kernel,type of message SADB_DELETE,errno = 0 ,seq number = 53853 Jun 16 10:05:55 [KNL] parsed data : SPI = 0xc5cdc8dd , SRC ADDRESS = 10.3.100.102 , DST ADDRESS = 10.3.100.1 Jun 16 10:05:55 [KNL] received message from kernel,type of message SADB_ADD,errno = 0 ,seq number = 53854 Jun 16 10:05:55 [KNL] parsed data : REQID = 0x7d0, SPI = 0xc5cdc8dd , SRC ADDRESS = 10.3.100.102 , DST ADDRESS = 10.3.100.1, SA TYPE = IPsec ESP Jun 16 10:05:55 [KNL] received message from kernel,type of message SADB_ADD,errno = 0 ,seq number = 53855 Jun 16 10:05:55 [KNL] parsed data : REQID = 0x7d0, SPI = 0xcd2b03b8 , SRC ADDRESS = 10.3.100.1 , DST ADDRESS = 10.3.100.102, SA TYPE = IPsec ESP Jun 16 10:05:56 [KNL] successfully connected to stroke socket Jun 16 10:05:56 [KNL] stroke message successfully sended,waiting for reply .... Jun 16 10:05:56 [KNL] server reply is ' con2000{242}: INSTALLED, TUNNEL, reqid 2000, ESP SPIs: ca05bb45_i c4a73853_o ' Jun 16 10:05:56 [KNL] server reply is ' con2000{245}: INSTALLED, TUNNEL, reqid 2000, ESP SPIs: c5cdc8dd_i cd2b03b8_o ' Jun 16 10:05:56 [IKE] total child's SA for deleting 1 Jun 16 10:05:56 [IKE] CHILD SA for deleting 242 Jun 16 10:05:56 [IKE] closing CHILD_SA con2000{242} with SPIs ca05bb45_i (0 bytes) c4a73853_o (216292 bytes) and TS 0.0.0.0/0|/0 === 0.0.0.0/0|/0 Jun 16 10:05:56 [IKE] sending DELETE for ESP CHILD_SA with SPI ca05bb45 Jun 16 10:05:56 [ENC] generating INFORMATIONAL request 127 [ D ] Jun 16 10:05:56 [NET] sending packet: from 10.3.100.1[500] to 10.3.100.102[500] (80 bytes) Jun 16 10:05:56 [NET] received packet: from 10.3.100.102[500] to 10.3.100.1[500] (80 bytes) Jun 16 10:05:56 [ENC] parsed INFORMATIONAL response 127 [ D ] Jun 16 10:05:56 [IKE] received DELETE for ESP CHILD_SA with SPI c4a73853 Jun 16 10:05:56 [IKE] CHILD_SA closed Jun 16 10:05:56 [IKE] CHILD SA with uniqueid 242 has deleted successfully
If you are interested, I can give you a program for the test