Netgate 2100 IPsec S2S AES GCM and SafeXcel mbuf overload
-
Hello everyone,
i run into a mbuf overload after change the S2S Setting (Netgate 6100 – 2100) from AES256 to AES128-GCM.
If i start the NAS Backup and use GCM, the mbuf grows and grows and after 1-2h it reaches the limit and the SG-2100 didn’t respons anymore.Asynchronous Cryptography doesn’t matter, now it is on and all is working fine with CBC again. And i did a Reboot of the 2100 after a change.
Both Netgates running 22.01, 6100 use Intel QAT, 2100 use SafeXcel for Hardware Crypto support.
P1: AES256 SHA256 DH19, Mobike an DPD
P2 ESP,AES256, SHA256 DH19Packages used on both Sites:
The NAS backup job on the 6100 site uses TCP-BBR if that matters.
-
The 2100 may be physically limited and not capable of handling that scenario. Have you tried an x86 firewall, even a desktop with dual nics in-place of the 2100 to see if the issue persists?
-
The SG-1100 works before I have to upgrade with the SG-2100, because I need a SSD and the RAM.
The SoC is fine and the Hardware Crypto support is great: "AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS,SHA1,SHA256,SHA384,SHA512".The IPsec Tunnel works with CBC without a problem, I think there is a problem with the GCM and the NIC queue.
For me, looks like a Memory Leak in the Crypto Engie or the NIC Queue.
It’s not a Hardware problem, it’s a Software problem, but I am not a developer. Now I need help from a Developer but maybe I can help to troubleshot. -
@nocling If it is a reproduceable bug in the 2100, then you need to file a bug report in redmine.
https://redmine.pfsense.org/ -
@nocling said in Netgate 2100 S2S mbuf overload:
The SG-1100 works before I have to upgrade with the SG-2100, because I need a SSD and the RAM.
The SoC is fine and the Hardware Crypto support is great: "AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS,SHA1,SHA256,SHA384,SHA512".The IPsec Tunnel works with CBC without a problem, I think there is a problem with the GCM and the NIC queue.
For me, looks like a Memory Leak in the Crypto Engie or the NIC Queue.
It’s not a Hardware problem, it’s a Software problem, but I am not a developer. Now I need help from a Developer but maybe I can help to troubleshot.Interesting - I agree it looks like a memory leak, and likely connected to safeexcel acceleration of CGM.
Try disabling the SafeExcel accelerator and see if it becomes stable (albeit slower) with CGM. -
Ok, now it works with GCM MBUF 3% (1776/70608).
My parents' ISP changed the provisioning of the cable modem at the beginning of the week and since then the pfsense can no longer obtain an IPv6 IP Prefix.
Now the SG-2100 only works with IPv4 and the problem is not there.
SafeXcel Crypto: Yes (active)
Asynchronous Cryptography (active)When I encounter the error, the IPsec tunnel connects using the IPv6 WAN IP on both sides using port 500, now it is the IPv4 using port 4500.
Looks like it interacts with IPv6 in some way, which triggers the MBUF overload.
So now I have to see if I can fix the IPv6 problem in order to generate the error again.Interesting issue...
-
@nocling said in Netgate 2100 S2S mbuf overload:
Ok, now it works with GCM MBUF 3% (1776/70608).
My parents' ISP changed the provisioning of the cable modem at the beginning of the week and since then the pfsense can no longer obtain an IPv6 IP Prefix.
Now the SG-2100 only works with IPv4 and the problem is not there.
SafeXcel Crypto: Yes (active)
Asynchronous Cryptography (active)When I encounter the error, the IPsec tunnel connects using the IPv6 WAN IP on both sides using port 500, now it is the IPv4 using port 4500.
Looks like it interacts with IPv6 in some way, which triggers the MBUF overload.
So now I have to see if I can fix the IPv6 problem in order to generate the error again.Interesting issue...
Very good observation that it is related to tunneling over IPv6 vs IPv4. When connected on port 500 it’s using ESP directly where as on port 4500 its using NAT traversal. If that has influence or if its only the protocol version needs to be tested.
Good find!
-
Yesterday 8h of VPN Backup with GCM and SafeXcel MBUF Overload incoming.
At the moment it looks like SafeXcel is triggering the MBUF overload, but I'll watch it again for another 24 hours.
-
@nocling said in Netgate 2100 S2S AES GCM and SafeXcel mbuf overload:
Yesterday 8h of VPN Backup with GCM and SafeXcel MBUF Overload incoming.
At the moment it looks like SafeXcel is triggering the MBUF overload, but I'll watch it again for another 24 hours.
Over IPv6 again, or this time over IPv4?
-
IPv4, IPv6 is broken by the ISP and i don't have the time to investigate for a fix, IPsec is more important.
I had only changed SaveXcel to inactiv.
-
@nocling said in Netgate 2100 S2S AES GCM and SafeXcel mbuf overload:
IPv4, IPv6 is broken by the ISP and i don't have the time to investigate for a fix, IPsec is more important.
I had only changed SaveXcel to inactiv.
Okay, so it’s a general safeXcel issue when using GCM on the 2100 in your situation. Interesting if anyone can confirm this, or it’s some setting/parameter in your specific situation
-
1,5 Days without SaveXcel activ, no problem here:
-
AES-GCM-128 and SafeXcel active again, the MBUF is already running full again.
Also, the GUI is slower than usual when I access it through the S2S tunnel.
Now im back to CBC-256 and a Reboot to clear the MBUF.
-
@nocling said in Netgate 2100 S2S AES GCM and SafeXcel mbuf overload:
AES-GCM-128 and SafeXcel active again, the MBUF is already running full again.
Also, the GUI is slower than usual when I access it through the S2S tunnel.
Now im back to CBC-256 and a Reboot to clear the MBUF.
Seeams pretty clear where the issue is
Whats the speed difference (throughput) between CBC-256 and GCM-128 in your setup where i assume the SG-2100 is the bottleneck and not your WAN speed -
The WAN upload is the limit.
But with the nes appliances on both ends, it would be nice to be able to use GCM.
So it would be nice if someone from Netgate could now take a look at the whole thing and see if the error can be reproduced on them. -
Thanks for posting this! I have a Netgate 6100 connected to a 2100 through a VTI IPSec tunnel. Once there was medium+ traffic from the 6100 to the 2100, such as file transfer from one NAS to another, nothing too heavy, internet speed of 100 mbits, the entire VPN tunnel crashed and would not come up again until a reboot. I couldn't understand what on earth it was and tried every single setting and detail but it never fell on me that AES-GCM could be the issue.
I now changed from AES-GCM to AES-CBC on the site to site tunnel and it suddenly became rock stable.
So there is definitively something to the AES-GCM theory on the Netgate 2100
-
@phlmike said in Netgate 2100 S2S AES GCM and SafeXcel mbuf overload:
@nocling If it is a reproduceable bug in the 2100, then you need to file a bug report in redmine.
https://redmine.pfsense.org/Please you guys - remember to fill out the redmine bugreport. Otherwise this won’t get fixed.
-
It appears Bug #13074 ( https://redmine.pfsense.org/issues/13074 ) has been created for this.
-