10 GbE performance of site-to-site IPSec



  • I am attempting to test 10 GbE performance of site-to-site IPSec with pfSense on a SuperMicro E300-9D-8CN8TP mini server. The upshot is that I am measuring unexpectedly low performance, in particular with 1500 MTU. I have tried many different performance tuning techniques (see details below), but have had no success finding parameters that would improve performance.

    A block diagram of my test setup is shown below. I have four SuperMicro E300 servers. Two have CentOS 7 installed and two have pfSense 2.4.4-p3 CE installed. The CentOS server on the right is running an iperf server. The CentOS on the left runs an iperf client and connects through a site-to-site IPSec tunnel created between the two pfSense servers.

    +-----------------------+    +-----------------------+
    | CentOS Server (left)  |    | CentOS Server (right) |
    +-----------------------+    +-----------------------+
                |                            |            
    +-----------------------+    +-----------------------+
    | pfSense Server (left) |oooo| pfSense Server (right)|
    +-----------------------+    +-----------------------+
    

    The left CentOS server and left pfSense server have a 10GBase-T LAN link. The right CentOS server and right pfSense server also have a 10GBase-T LAN link. The two pfSense servers have a 10GBase-T WAN link through which the IPSec tunnel is formed.

    The relevant specs of my specific SuperMicro E300 servers are as follows:

    • Intel Xeon processor D-2146NT, 8-Core, 16 Threads
    • AES-NI CPU crypto
    • 16 GB ECC DDR4 memory
    • 250 GB NVMe M.2
    • 4x 1GbE, 2x 10GBase-T, 2x 10G SFP+

    The pfSense 2.4.4-p3 servers are configured as follows:

    • Hyperthreading enabled
    • Packet Filtering (pf) enabled
    • KPI enabled
    • IPSec:
      • Phase 1: IKEv1, Mutual RSA, Main Mode, AES-128, SHA-256, Diffie Hellman Group 14 (2048 bit)
      • Phase 2: ESP, AES-128 GCM, SHA-256, PFS Key Group 14 (2048 bit)

    I ran several tests varying the MTU and the number sof parallel streams in iperf. The results are shown in the table below.

    +------------------+---------------+
    | MTU  |   streams |    throughput |
    +------------------+---------------+
    |      |         1 |     1.30 Gbps |
    | 1500 |-----------+---------------+
    |      |         8 |     1.81 Gbps |
    +----------------- +---------------+
    |      |         1 |     6.13 Gbps |
    | 9000 |-----------+---------------+
    |      |         8 |     8.76 Gbps |
    +------------------+---------------+
    

    My first thought was that this is a performance tuning issue, but I have tried the suggestions at the following links and I could not find a combination that seemed to make any difference.

    [1] https://docs.netgate.com/pfsense/en/latest/hardware/tuning-and-troubleshooting-network-cards.html
    [2] https://rerepi.wordpress.com/2008/04/19/tuning-freebsd-sysoev-rit/
    [3] https://forum.netgate.com/topic/131645/10gbe-tuning-do-net-inet-tcp-recvspace-kern-ipc-maxsockbuf-etc-matter
    [4] https://calomel.org/freebsd_network_tuning.html
    [5] https://calomel.org/network_performance.html
    [6] https://forum.netgate.com/topic/122619/solved-10gb-link-1gb-speeds
    [7] https://forum.netgate.com/topic/131517/10gbps-performance-issue
    [8] https://forum.netgate.com/topic/65103/10gbe-tuning
    [9] https://forum.netgate.com/topic/136352/performance-tuning-for-10gb-connection
    [10] https://forum.netgate.com/topic/132394/10gbit-performance-testing
    [11] https://forums.freebsd.org/threads/high-cpu-interrupts-on-the-router-igb-driver-how-to-fix.28219/

    Now I suspect that the interrupt processing is the biggest bottleneck. Here is a snapshot of the output from "top" on the left pfSense server during the 1500 MTU test. Note that one of the CPUs is doing 100% interrupt processing.

    [2.4.4-RELEASE][admin@left]/root: top -P
    last pid: 49108;  load averages:  3.22,  0.96,  0.37    up 0+00:10:29  11:30:39
    50 processes:  3 running, 47 sleeping
    CPU 0:   0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
    CPU 1:   0.0% user,  0.0% nice,  0.0% system, 20.9% interrupt, 79.1% idle
    CPU 2:   6.7% user,  0.0% nice, 28.7% system,  2.8% interrupt, 61.8% idle
    CPU 3:   0.8% user,  0.0% nice, 13.4% system, 18.1% interrupt, 67.7% idle
    CPU 4:   2.4% user,  0.0% nice, 16.9% system,  8.3% interrupt, 72.4% idle
    CPU 5:   5.5% user,  0.0% nice, 33.9% system,  3.5% interrupt, 57.1% idle
    CPU 6:   4.3% user,  0.0% nice, 22.8% system,  1.2% interrupt, 71.7% idle
    CPU 7:   3.5% user,  0.0% nice, 31.1% system,  3.5% interrupt, 61.8% idle
    CPU 8:   5.1% user,  0.0% nice, 27.6% system,  0.8% interrupt, 66.5% idle
    CPU 9:   5.9% user,  0.0% nice, 24.8% system,  3.9% interrupt, 65.4% idle
    CPU 10:  4.3% user,  0.0% nice, 29.5% system,  3.1% interrupt, 63.0% idle
    CPU 11:  6.3% user,  0.0% nice, 33.5% system,  2.4% interrupt, 57.9% idle
    CPU 12:  6.7% user,  0.0% nice, 26.4% system,  2.8% interrupt, 64.2% idle
    CPU 13:  2.4% user,  0.0% nice, 29.5% system,  2.8% interrupt, 65.4% idle
    CPU 14:  6.3% user,  0.0% nice, 24.8% system,  2.8% interrupt, 66.1% idle
    CPU 15:  7.9% user,  0.0% nice, 22.8% system,  2.4% interrupt, 66.9% idle
    Mem: 114M Active, 48M Inact, 496M Wired, 18M Buf, 15G Free
    Swap: 3979M Total, 3979M Free
    
      PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
     7942 root         17  52    0 54284K 18156K sigwai 14   1:02 111.66% charon
    36554 root          2 101    0 12400K 12520K CPU5    5   0:53  94.69% ntpd
    

    As an aside, what in the world is ntpd doing that requires so much CPU? I tried disabling ntpd and it didn't affect the throughput performance, but it is still curious why ntpd needs so much CPU.

    Here is the dmesg information about one of the 10 GbE NICs:

    ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.9.9-k> mem 0xfa000000-0xfaffffff,0xfb018000-0xfb01ffff irq 46 at device 0.0 numa-domain 0on pci14
    ixl0: using 1024 tx descriptors and 1024 rx descriptors
    ixl0: fw 3.1.57069 api 1.5 nvm 3.33 etid 80001006 oem 1.262.0
    ixl0: PF-ID[0]: VFs 32, MSIX 129, VF MSIX 5, QPs 384, MDIO shared
    ixl0: Using MSIX interrupts with 9 vectors
    ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active
    ixl0: Ethernet address: ac:1f:6b:7d:82:ca
    ixl0: SR-IOV ready
    queues is 0xfffffe000199a000
    ixl0: netmap queues/slots: TX 8/1024, RX 8/1024
    

    Here is the contents of /boot/loader.conf (everything should be at default values):

    kern.cam.boot_delay=10000
    kern.ipc.nmbclusters="1000000"
    kern.ipc.nmbjumbop="524288"
    kern.ipc.nmbjumbo9="524288"
    boot_multicons="YES"
    boot_serial="YES"
    console="comconsole.efi"
    comconsole_speed="115200"
    autobot_delay="3"
    hw.usb.no_pf="1"
    

    I am happy to provide sysctl outputs or any other information that might be useful.



  • UPDATE: It seems that the ixl driver has been ported to iflib and is available in FreeBSD 12. As pfSense is adopting FreeBSD 12 for its 2.5 release, I decided to try a development snapshot of pfSense 2.5 (pfSense-CE-memstick-2.5.0-DEVELOPMENT-amd64-20191031-1313.img).

    Below are the updated results for both pfSense 2.4.4-p3 and pfSense 2.5.0-dev:

    +------------------+---------------+---------------+
    | MTU  |   streams | pfSense 2.4.4 | pfSense 2.5.0 |
    +------------------+---------------+---------------+
    |      |         1 |     1.30 Gbps |     1.66 Gbps |
    | 1500 |-----------+---------------+---------------+
    |      |         8 |     1.81 Gbps |     2.52 Gbps |
    +----------------- +---------------+---------------+
    |      |         1 |     6.13 Gbps |     2.92 Gbps*|
    | 9000 |-----------+---------------+---------------+
    |      |         8 |     8.76 Gbps |     9.77 Gbps |
    +------------------+---------------+---------------+
    

    *The result for pfSense 2.5.0 with 9000 MTU and 1 stream is suspect. Performance decreased dramatically with pfSense 2.5.0 for this test, while performance improved with pfSense 2.5.0 for all of the other tests. I did see some curious behavior during this specific tests as compared to the others. In all other tests the throughput was relatively constant on a second-by-second basis. During this test the throughput each second varied wildly from 1 to 6 Gbps.

    The interrupt processing definitely improved with the move to pfSense 2.5.0. Below is a snapshot of the output from "top" on the left pfSense server during the 1500 MTU test. The interrupt processing and CPU usage seems to be distributed nicely across all of the cores. In fact, most of the CPUs seem to be underutilized. So I am not sure now what the bottleneck might be that is limiting performance.

    [2.5.0-DEVELOPMENT][admin@left]/root: top -P
    last pid: 46970;  load averages:  1.65,  1.10,  0.97    up 0+16:56:20  08:44:13
    49 processes:  2 running, 47 sleeping
    CPU 0:   3.5% user,  0.0% nice, 15.7% system,  3.1% interrupt, 77.6% idle
    CPU 1:   2.4% user,  0.0% nice, 14.2% system,  2.0% interrupt, 81.5% idle
    CPU 2:   0.8% user,  0.0% nice, 23.1% system,  2.0% interrupt, 74.1% idle
    CPU 3:   2.7% user,  0.0% nice, 18.0% system,  5.5% interrupt, 73.7% idle
    CPU 4:   3.5% user,  0.0% nice, 20.4% system,  2.4% interrupt, 73.7% idle
    CPU 5:   1.6% user,  0.0% nice, 23.1% system,  4.3% interrupt, 71.0% idle
    CPU 6:   2.4% user,  0.0% nice, 16.9% system,  1.6% interrupt, 79.2% idle
    CPU 7:   7.5% user,  0.0% nice, 52.5% system,  0.8% interrupt, 39.2% idle
    CPU 8:   5.1% user,  0.0% nice, 24.7% system,  1.6% interrupt, 68.6% idle
    CPU 9:   2.7% user,  0.0% nice, 16.9% system,  2.4% interrupt, 78.0% idle
    CPU 10:  3.1% user,  0.0% nice, 16.1% system,  3.1% interrupt, 77.6% idle
    CPU 11:  2.7% user,  0.0% nice, 24.3% system,  2.4% interrupt, 70.6% idle
    CPU 12:  4.7% user,  0.0% nice, 21.2% system,  2.4% interrupt, 71.8% idle
    CPU 13:  2.4% user,  0.0% nice, 22.0% system,  2.0% interrupt, 73.7% idle
    CPU 14:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
    CPU 15:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
    Mem: 77M Active, 39M Inact, 522M Wired, 69M Buf, 15G Free
    Swap: 3979M Total, 3979M Free
    
      PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
     8645 root         17  52    0    70M    20M sigwai   7   8:08 111.07% charon
    26414 root          2  92    0    18M  6380K CPU0     0   6:52  99.59% ntpd
    

Log in to reply