Poor i710 / ixl network performance

ghostnet

I recently upgraded to new router hardware. The new hardware is very similar to the Netgate 1541 (my system is a Supermicro 5019D-4C-FN8TP) with a couple of minor exceptions:

4 core Xeon instead of 8 core
4x x710 10GbE ports instead of 2x
4x i350 1GbE ports instead of 2x

The new router has 16GB of DDR4 ECC memory. The status dashboard shows:
State Table size: 0%
MBUF Usage 1% (10850/1000000)
CPU usage: 1%
Memory usage: 8%
SWAP usage: 0%

I started with pfsense CE 2.6 and then upgraded to pfsense plus 22.01. Everything is working fine and settings from previous pfsense router imported without issue.

So what’s the problem? I was testing the connection speeds to / from the pfsense router with iperf3 and do not get 10GbE speeds to other 10GbE devices on the network (two Linux servers). All are plugged into the same 10GbE switch and the two Linux boxes can client / server using iperf3 at 10GbE speeds. However, if I make the pfsense system a client or a server the iperf performance drops to ~3.8GBytes/sec.

Doing some testing, the unchecked “Disable hardware TCP segmentation offload” and “Disable hardware large receive offload” to see if that improved things (Yes, I know that is not recommended for a router, I tried it for testing only). After doing this, with pfsense as an iperf3 client connecting to a Linux iperf3 server, I got 10GbE performance. However, the reverse test still left me with 3.8GBytes/sec.

I have the following tunables set per the documentation:
kern.ipc.nmbclusters="1000000"
kern.ipc.nmbjumbop="524288"
hw.intr_storm_threshold="10000"

The LAN interface shows this configuration:

ixl1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
       description: LAN
       options=e100bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,RXCSUM_IPV6,
TXCSUM_IPV6>
       ether xx:xx:xx:xx:xx:xx
       inet6 xxxx::xxxx:xxxx:xxxx:xxxx%ixl1 prefixlen 64 scopeid 0x6
       inet 192.168.1.1 netmask 0xfffff000 broadcast 192.168.15.255
       inet 10.10.10.1 netmask 0xffffffff broadcast 10.10.10.1
       media: Ethernet autoselect (10Gbase-T <full-duplex>)
       status: active
       nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

Driver / Firmware info:
dev.ixl.1.fw_version: fw 3.1.57069 api 1.5 nvm 3.33 etid 80001006 oem 1.262.0
dev.ixl.1.iflib.driver_version: 2.3.1-k

It seems to me the pfsense 10GbE ports should run at 10GbE testing with iperf3. What am I missing?

bmeeks

It has been stated many times here that pfSense is not designed to be a TCP endpoint. It is designed to be a router. When you make it an iperf3 endpoint (either client or server), it incurs additional overhead that sucks up CPU cycles. Those cycles then can't be used to help move traffic around.

If you search the forum for iperf3 tests, you will find several posts from even the last 12 months where users experience what you say, but then test "through" pfSense and speeds are fine. And the reponses from the Netgate team have always been the same -- pfSense is optimized for routing, but not as an endpoint.

So set your pfSense box up as router and do iperf3 tests between the two Linux boxes, but through pfSense, and you should get full speeds. In your setup, this might require moving one of the Linux boxes to a different port on pfSense and giving the Linux box a temp new IP address so pfSense can "route" between the two boxes.

ghostnet

Thanks for the reply and testing tips for routing. I will try that once I get an additional SFP.
However, I did search for information on this before I posted and while I did find the “it’s a router not an endpoint” responses, there was really no details as to why this is the case. I was looking for some deeper discussion why “router” is a special case that limits a standard testing tool like iperf3 from showing true performance. In my use case, the hardware is way oversized for the job and gets no where near it’s limits during iperf3 testing AFAICT.

My previous pfsense router had 1GbE interfaces and ran wire speeds as client or server using iperf3 (Intel i5-3470 and max CPU usage was 31%). The new pfsense hardware is now 10GbE interfaces but can only manage 3.8GBytes as an iperf3 client or server (Xeon D2123IT and max CPU usage was 32%). I do find it interesting that both routers seemed to max out around 32% CPU utilization. And in the case of the 10GbE router why additional CPU resources weren’t used. Too many IRQ’s incoming? Single thread process preventing faster processing?

This is not a complaint, pfsense is fantastic and I am fortunate to be able to run it as my router. I’m just curious as to what’s going on under the covers.

bmeeks

@ghostnet said in Poor i710 / ixl network performance:

Thanks for the reply and testing tips for routing. I will try that once I get an additional SFP.
However, I did search for information on this before I posted and while I did find the “it’s a router not an endpoint” responses, there was really no details as to why this is the case. I was looking for some deeper discussion why “router” is a special case that limits a standard testing tool like iperf3 from showing true performance. In my use case, the hardware is way oversized for the job and gets no where near it’s limits during iperf3 testing AFAICT.

My previous pfsense router had 1GbE interfaces and ran wire speeds as client or server using iperf3 (Intel i5-3470 and max CPU usage was 31%). The new pfsense hardware is now 10GbE interfaces but can only manage 3.8GBytes as an iperf3 client or server (Xeon D2123IT and max CPU usage was 32%). I do find it interesting that both routers seemed to max out around 32% CPU utilization. And in the case of the 10GbE router why additional CPU resources weren’t used. Too many IRQ’s incoming? Single thread process preventing faster processing?

This is not a complaint, pfsense is fantastic and I am fortunate to be able to run it as my router. I’m just curious as to what’s going on under the covers.

10GbE requires quite a bit more processing power than 1GbE. And you are correct to surmise that threading comes into play much more once you surpass 1GbE rates. And how well threading works is dependent on the NIC, whether or not RSS is optimized in the kernel for that NIC, and exactly what type of traffic is flowing. A large elephant flow (meaning speed tests using a single network flow) will hobble any attempt at multithreading as the traffic from a given network flow has to be processed by the same CPU core.

Running iperf3 with 1GbE hardware can hide more "sins". Those get exposed at 10GbE speeds and higher, though. Also, by default, iperf3 runs as a single-threaded process. This is true even when using multiple parallel streams. All the streams from a given iperf3 process use the same thread and thus the same CPU core. Here is a link explaining this, and also showing a way to launch multiple iperf3 processes to obtain multithreaded operation: https://fasterdata.es.net/performance-testing/network-troubleshooting-tools/iperf/multi-stream-iperf3/.

pyrodex

@ghostnet said in Poor i710 / ixl network performance:

Supermicro 5019D-4C-FN8TP

I have the the same processor as you with a different motherboard, mine is a X11SDW-4C-TP13F for reference. Now my NIC chipset is a X722 but that shouldn't be too bad.

I had problems getting full line speed myself until I made the following changes:

Uncheck Disable Hardware TCP Segmentation Offloading
Uncheck Disable Hardware Large Receive Offloading
Check/Enable Kernel PTI
Set MDS Mode to Mitigation Disabled

Before those changes I was getting around 5-6Gbit/s and after well, see for yourself:

[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-60.00  sec  16.3 GBytes  2.33 Gbits/sec  654             sender
[  5]   0.00-60.11  sec  16.3 GBytes  2.33 Gbits/sec                  receiver
[  7]   0.00-60.00  sec  5.24 GBytes   750 Mbits/sec  231             sender
[  7]   0.00-60.11  sec  5.23 GBytes   748 Mbits/sec                  receiver
[  9]   0.00-60.00  sec  6.96 GBytes   996 Mbits/sec  243             sender
[  9]   0.00-60.11  sec  6.96 GBytes   994 Mbits/sec                  receiver
[ 11]   0.00-60.00  sec  5.05 GBytes   723 Mbits/sec  707             sender
[ 11]   0.00-60.11  sec  5.05 GBytes   721 Mbits/sec                  receiver
[ 13]   0.00-60.00  sec  6.09 GBytes   872 Mbits/sec  357             sender
[ 13]   0.00-60.11  sec  6.09 GBytes   870 Mbits/sec                  receiver
[ 15]   0.00-60.00  sec  8.27 GBytes  1.18 Gbits/sec  373             sender
[ 15]   0.00-60.11  sec  8.27 GBytes  1.18 Gbits/sec                  receiver
[ 17]   0.00-60.00  sec  9.46 GBytes  1.35 Gbits/sec  231             sender
[ 17]   0.00-60.11  sec  9.46 GBytes  1.35 Gbits/sec                  receiver
[ 19]   0.00-60.00  sec  8.15 GBytes  1.17 Gbits/sec  350             sender
[ 19]   0.00-60.11  sec  8.15 GBytes  1.17 Gbits/sec                  receiver
[SUM]   0.00-60.00  sec  65.5 GBytes  9.38 Gbits/sec  3146             sender
[SUM]   0.00-60.11  sec  65.5 GBytes  9.36 Gbits/sec                  receiver

During the run here is my CPU usage:

CPU:  0.9% user,  0.0% nice, 14.4% system,  0.0% interrupt, 84.7% idle

Hi together,

All are plugged into the same 10GbE switch and the
two Linux boxes can client / server using iperf3 at
10GbE speeds.
As @bmeeks was answering, test it through the pfSense
and not from or to it.

Routers are connecting one or more networks and firewalls
are separating one or more networks using rules. And pending on the rule set it is narrowing down step by
step with any new rule and/or installed service (packet)
like snort or other it will be using more cpu and electric power to hold that line speed.

And on top of all the switch is perhaps layer2 and this is faster than routing in layer3, please don´t forget it.

Tweaking or tuning adapters will be here and there
more pending on the entire art of network traffic.

Sometimes you get success if you narrow down
numbers (mbuf) or disable something and sometimes
you high up that numbers or activate some different
points to get your wished result. So it is not so easy
to say do this or that, it is more pending on each
single use case as I was getting it out here often in
the forum.