10 GbE performance of site-to-site IPSec

morgank

I am attempting to test 10 GbE performance of site-to-site IPSec with pfSense on a SuperMicro E300-9D-8CN8TP mini server. The upshot is that I am measuring unexpectedly low performance, in particular with 1500 MTU. I have tried many different performance tuning techniques (see details below), but have had no success finding parameters that would improve performance.

A block diagram of my test setup is shown below. I have four SuperMicro E300 servers. Two have CentOS 7 installed and two have pfSense 2.4.4-p3 CE installed. The CentOS server on the right is running an iperf server. The CentOS on the left runs an iperf client and connects through a site-to-site IPSec tunnel created between the two pfSense servers.

+-----------------------+    +-----------------------+
| CentOS Server (left)  |    | CentOS Server (right) |
+-----------------------+    +-----------------------+
            |                            |            
+-----------------------+    +-----------------------+
| pfSense Server (left) |oooo| pfSense Server (right)|
+-----------------------+    +-----------------------+

The left CentOS server and left pfSense server have a 10GBase-T LAN link. The right CentOS server and right pfSense server also have a 10GBase-T LAN link. The two pfSense servers have a 10GBase-T WAN link through which the IPSec tunnel is formed.

The relevant specs of my specific SuperMicro E300 servers are as follows:

Intel Xeon processor D-2146NT, 8-Core, 16 Threads
AES-NI CPU crypto
16 GB ECC DDR4 memory
250 GB NVMe M.2
4x 1GbE, 2x 10GBase-T, 2x 10G SFP+

The pfSense 2.4.4-p3 servers are configured as follows:

Hyperthreading enabled
Packet Filtering (pf) enabled
KPI enabled
IPSec:
- Phase 1: IKEv1, Mutual RSA, Main Mode, AES-128, SHA-256, Diffie Hellman Group 14 (2048 bit)
- Phase 2: ESP, AES-128 GCM, SHA-256, PFS Key Group 14 (2048 bit)

I ran several tests varying the MTU and the number sof parallel streams in iperf. The results are shown in the table below.

+------------------+---------------+
| MTU  |   streams |    throughput |
+------------------+---------------+
|      |         1 |     1.30 Gbps |
| 1500 |-----------+---------------+
|      |         8 |     1.81 Gbps |
+----------------- +---------------+
|      |         1 |     6.13 Gbps |
| 9000 |-----------+---------------+
|      |         8 |     8.76 Gbps |
+------------------+---------------+

My first thought was that this is a performance tuning issue, but I have tried the suggestions at the following links and I could not find a combination that seemed to make any difference.

[1] https://docs.netgate.com/pfsense/en/latest/hardware/tuning-and-troubleshooting-network-cards.html
[2] https://rerepi.wordpress.com/2008/04/19/tuning-freebsd-sysoev-rit/
[3] https://forum.netgate.com/topic/131645/10gbe-tuning-do-net-inet-tcp-recvspace-kern-ipc-maxsockbuf-etc-matter
[4] https://calomel.org/freebsd_network_tuning.html
[5] https://calomel.org/network_performance.html
[6] https://forum.netgate.com/topic/122619/solved-10gb-link-1gb-speeds
[7] https://forum.netgate.com/topic/131517/10gbps-performance-issue
[8] https://forum.netgate.com/topic/65103/10gbe-tuning
[9] https://forum.netgate.com/topic/136352/performance-tuning-for-10gb-connection
[10] https://forum.netgate.com/topic/132394/10gbit-performance-testing
[11] https://forums.freebsd.org/threads/high-cpu-interrupts-on-the-router-igb-driver-how-to-fix.28219/

Now I suspect that the interrupt processing is the biggest bottleneck. Here is a snapshot of the output from "top" on the left pfSense server during the 1500 MTU test. Note that one of the CPUs is doing 100% interrupt processing.

[2.4.4-RELEASE][admin@left]/root: top -P
last pid: 49108;  load averages:  3.22,  0.96,  0.37    up 0+00:10:29  11:30:39
50 processes:  3 running, 47 sleeping
CPU 0:   0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
CPU 1:   0.0% user,  0.0% nice,  0.0% system, 20.9% interrupt, 79.1% idle
CPU 2:   6.7% user,  0.0% nice, 28.7% system,  2.8% interrupt, 61.8% idle
CPU 3:   0.8% user,  0.0% nice, 13.4% system, 18.1% interrupt, 67.7% idle
CPU 4:   2.4% user,  0.0% nice, 16.9% system,  8.3% interrupt, 72.4% idle
CPU 5:   5.5% user,  0.0% nice, 33.9% system,  3.5% interrupt, 57.1% idle
CPU 6:   4.3% user,  0.0% nice, 22.8% system,  1.2% interrupt, 71.7% idle
CPU 7:   3.5% user,  0.0% nice, 31.1% system,  3.5% interrupt, 61.8% idle
CPU 8:   5.1% user,  0.0% nice, 27.6% system,  0.8% interrupt, 66.5% idle
CPU 9:   5.9% user,  0.0% nice, 24.8% system,  3.9% interrupt, 65.4% idle
CPU 10:  4.3% user,  0.0% nice, 29.5% system,  3.1% interrupt, 63.0% idle
CPU 11:  6.3% user,  0.0% nice, 33.5% system,  2.4% interrupt, 57.9% idle
CPU 12:  6.7% user,  0.0% nice, 26.4% system,  2.8% interrupt, 64.2% idle
CPU 13:  2.4% user,  0.0% nice, 29.5% system,  2.8% interrupt, 65.4% idle
CPU 14:  6.3% user,  0.0% nice, 24.8% system,  2.8% interrupt, 66.1% idle
CPU 15:  7.9% user,  0.0% nice, 22.8% system,  2.4% interrupt, 66.9% idle
Mem: 114M Active, 48M Inact, 496M Wired, 18M Buf, 15G Free
Swap: 3979M Total, 3979M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 7942 root         17  52    0 54284K 18156K sigwai 14   1:02 111.66% charon
36554 root          2 101    0 12400K 12520K CPU5    5   0:53  94.69% ntpd

As an aside, what in the world is ntpd doing that requires so much CPU? I tried disabling ntpd and it didn't affect the throughput performance, but it is still curious why ntpd needs so much CPU.

Here is the dmesg information about one of the 10 GbE NICs:

ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.9.9-k> mem 0xfa000000-0xfaffffff,0xfb018000-0xfb01ffff irq 46 at device 0.0 numa-domain 0on pci14
ixl0: using 1024 tx descriptors and 1024 rx descriptors
ixl0: fw 3.1.57069 api 1.5 nvm 3.33 etid 80001006 oem 1.262.0
ixl0: PF-ID[0]: VFs 32, MSIX 129, VF MSIX 5, QPs 384, MDIO shared
ixl0: Using MSIX interrupts with 9 vectors
ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active
ixl0: Ethernet address: ac:1f:6b:7d:82:ca
ixl0: SR-IOV ready
queues is 0xfffffe000199a000
ixl0: netmap queues/slots: TX 8/1024, RX 8/1024

Here is the contents of /boot/loader.conf (everything should be at default values):

kern.cam.boot_delay=10000
kern.ipc.nmbclusters="1000000"
kern.ipc.nmbjumbop="524288"
kern.ipc.nmbjumbo9="524288"
boot_multicons="YES"
boot_serial="YES"
console="comconsole.efi"
comconsole_speed="115200"
autobot_delay="3"
hw.usb.no_pf="1"

I am happy to provide sysctl outputs or any other information that might be useful.

morgank

UPDATE: It seems that the ixl driver has been ported to iflib and is available in FreeBSD 12. As pfSense is adopting FreeBSD 12 for its 2.5 release, I decided to try a development snapshot of pfSense 2.5 (pfSense-CE-memstick-2.5.0-DEVELOPMENT-amd64-20191031-1313.img).

Below are the updated results for both pfSense 2.4.4-p3 and pfSense 2.5.0-dev:

+------------------+---------------+---------------+
| MTU  |   streams | pfSense 2.4.4 | pfSense 2.5.0 |
+------------------+---------------+---------------+
|      |         1 |     1.30 Gbps |     1.66 Gbps |
| 1500 |-----------+---------------+---------------+
|      |         8 |     1.81 Gbps |     2.52 Gbps |
+----------------- +---------------+---------------+
|      |         1 |     6.13 Gbps |     2.92 Gbps*|
| 9000 |-----------+---------------+---------------+
|      |         8 |     8.76 Gbps |     9.77 Gbps |
+------------------+---------------+---------------+

*The result for pfSense 2.5.0 with 9000 MTU and 1 stream is suspect. Performance decreased dramatically with pfSense 2.5.0 for this test, while performance improved with pfSense 2.5.0 for all of the other tests. I did see some curious behavior during this specific tests as compared to the others. In all other tests the throughput was relatively constant on a second-by-second basis. During this test the throughput each second varied wildly from 1 to 6 Gbps.

The interrupt processing definitely improved with the move to pfSense 2.5.0. Below is a snapshot of the output from "top" on the left pfSense server during the 1500 MTU test. The interrupt processing and CPU usage seems to be distributed nicely across all of the cores. In fact, most of the CPUs seem to be underutilized. So I am not sure now what the bottleneck might be that is limiting performance.

[2.5.0-DEVELOPMENT][admin@left]/root: top -P
last pid: 46970;  load averages:  1.65,  1.10,  0.97    up 0+16:56:20  08:44:13
49 processes:  2 running, 47 sleeping
CPU 0:   3.5% user,  0.0% nice, 15.7% system,  3.1% interrupt, 77.6% idle
CPU 1:   2.4% user,  0.0% nice, 14.2% system,  2.0% interrupt, 81.5% idle
CPU 2:   0.8% user,  0.0% nice, 23.1% system,  2.0% interrupt, 74.1% idle
CPU 3:   2.7% user,  0.0% nice, 18.0% system,  5.5% interrupt, 73.7% idle
CPU 4:   3.5% user,  0.0% nice, 20.4% system,  2.4% interrupt, 73.7% idle
CPU 5:   1.6% user,  0.0% nice, 23.1% system,  4.3% interrupt, 71.0% idle
CPU 6:   2.4% user,  0.0% nice, 16.9% system,  1.6% interrupt, 79.2% idle
CPU 7:   7.5% user,  0.0% nice, 52.5% system,  0.8% interrupt, 39.2% idle
CPU 8:   5.1% user,  0.0% nice, 24.7% system,  1.6% interrupt, 68.6% idle
CPU 9:   2.7% user,  0.0% nice, 16.9% system,  2.4% interrupt, 78.0% idle
CPU 10:  3.1% user,  0.0% nice, 16.1% system,  3.1% interrupt, 77.6% idle
CPU 11:  2.7% user,  0.0% nice, 24.3% system,  2.4% interrupt, 70.6% idle
CPU 12:  4.7% user,  0.0% nice, 21.2% system,  2.4% interrupt, 71.8% idle
CPU 13:  2.4% user,  0.0% nice, 22.0% system,  2.0% interrupt, 73.7% idle
CPU 14:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
CPU 15:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Mem: 77M Active, 39M Inact, 522M Wired, 69M Buf, 15G Free
Swap: 3979M Total, 3979M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
 8645 root         17  52    0    70M    20M sigwai   7   8:08 111.07% charon
26414 root          2  92    0    18M  6380K CPU0     0   6:52  99.59% ntpd