10Gbit Performance Testing

tman222

Hi all,

I recently added a 10Gbit capability to my network and have started doing some basic 10Gbit performance testing using pfSense and two host computers. Wanted to share my experiences up to now below and also gather some feedback on the results from the community here:

Testing setup is as follows:

Host 1: i7 4790K based machine with 32GB RAM, Intel X550 NIC running Debian Linux 9.4
Host 2: i5 7600 based machine with 16GB RAM, Intel X550 NIC, running Windows 10
Switch: Ubiquiti ES-16-XG
pfSense: Supermicro 5018D-F8NT server with 16GB RAM, and additional Chelsio T520-SO-CR SFP+ add-on card.

Host 1 and Host 2 are on separate network networks segments (let's call them VLAN 1 and VLAN 2), and VLAN 1 is allowed to talk to VLAN 2 across the firewall without restrictions. Snort is active on both VLAN 1 and VLAN 2.

Physically, VLAN 1 uses port 1 on the Chelsio SFP+ card and VLAN 2 uses port 2. Connections to the switch are done via optical cables, but from the switch to both hosts standard CAT6 copper cabling is used, with the longest run being maybe close to 75ft.

Results:

The initial set of tests were done using iperf3 between Host 1 and Host 2 (Host 1 is client, Host 2 is server) using standard size ethernet frames (i.e. no jumbo frames). Using a single stream the max transfer speed I could achieve between Host 1 and 2 was just north of 3 Gbit/s. Scaling up the number of parallel streams allowed me to max out the interfaces: With 12-16 parallel streams I'm able to transmit 9.45 - 9.50 Gbit/s between Host 1 and Host 2 across the firewall using iperf3. At that point, however, the firewall's CPU is basically pegged at 100%.

Do these results seem reasonable/expected to you guys? I've tried tweaking the window size and buffer length but I am still not able to get more than about 3 Gbit/s for a single stream. Is there anything else I would tweak (i.e. turning off Snort perhaps)?

What other tests would you recommend for me to run to test performance north of 1Gbit/s?

Thanks in advance for your help and feedback, I appreciate it.

heper

i doubt you will get more out of it. once you have to go through NAT it'll reduce further.
reducing the # of firewall rules is probably the only thing you can do.

stephenw10

Is that 100% load on all cores? try running top aSH to see how the load is spread across the cores, especially in the single stream case.

Steve

jwt

@tman222 said in 10Gbit Performance Testing:

pfSense: Supermicro 5018D-F8NT server with 16GB RAM, and additional Chelsio T520-SO-CR SFP+ add-on card.

...

Host 1 and Host 2 are on separate network networks segments (let’s call them VLAN 1 and VLAN 2), and VLAN 1 is allowed to talk to VLAN 2 across the firewall without restrictions. Snort is active on both VLAN 1 and VLAN 2.

...

[3gbps]

This is exactly why TNSR was written.

If you want to really explore the depth of 'why', decrease the packet size from 1500, into a more realistic, let's say, "Simple IMIX" test.

The reason, as I've explained before, is that per-packet overheads are too high with kernel networking. When you start enabling 'pf', (both filtering and NAT), and Snort, those overheads are even higher.

As the packet size decreases, you need more packets per second to 'fill' the link. At 1500 byte frames, you only need around 880,000 packets per second to fill the link. At the opposite end, with 64-byte frames (the smallest technically allowed), you need around 14,880,000 packets per second to fill the link.

We put linux on a 10C Intel i7-6950X with a dual-port Intel xl710 40Gbps card, and did a similar 3-host test.

One stream: 804 kpps
Four streams: 2.93 Mpps
Eight streams: 5.16 Mpps

We ran TNSR on the same box:
One stream: 14.05 Mpps
Four streams: 32.23 Mpps
Eight streams: 42.60 Mpps (there are some limitations to the xl7x0 Intel cards)

One core on TNSR can forward at 14.05 Mpps. Start turning on ACLs and NAT at things slow down some, but they're still far faster (10X) than kernel networking.

tman222

Thanks guys, I really appreciate all the feedback! I took the suggestions from @jwt and tried to figure out how many packets I could push through my pfSense firewall between host 1 and host 2.

Essentially, I started decreasing the MSS size on iperf3 and increasing the number of parallel streams until I hit a plateau in terms of PPS transferred. There is a cool utility in Linux called BMON that allows you to monitor packets per second transferred alongside bandwidth, and this proved to be quite helpful from my testing.

After doing some additional testing I found that the maximum of packets that I could forward through the firewall between host 1 and host 2 were:

With Snort Enabled on VLAN 1 and VLAN 2: 0.91- 0.94 Mpps
With Snort Disabled on VLAN 1 and VLAN 2: 1.0 - 1.15 Mpps

A those levels, the all CPU cores were pegged at 100%. I have a few additional add-on packages running as well along with NAT of course. I suppose if I disabled those I might see a little higher pps values.

Doing some back of the envelope for different packet sizes then yields:

With Snort: 1500 byte packets: ~11Gbit/s (so limited by interface bandwidth), all the way down to ~474 Mbit/s for 64 byte packets.
Without Snort: 1500 byte packets: ~13Gbit/s (so limited by interface bandwidth), all the way down to ~550 Mbit/s for 64 byte packets.

For simple IMIX - https://en.wikipedia.org/wiki/Internet_Mix - I assume a weighted average of 340 bytes per packet for simplicity:

With Snort: ~2.5Gbit/s
Without Snort: ~3Gbit/s

Do you guys have any thoughts on these results? That is, do they seem reasonable given the specs of my pfSense box?

Thanks again for pointing out to me that what really matters for performance measurement is not just bandwidth, but the amount of packets per second that can be transferred across the firewall.

stephenw10

That certainly seems within the realm of what I'd expect. Especially if that was with all the cores pegged at 100%.
Hard to say with Snort due to the many variables involved.

Steve

tman222

Thanks @stephenw10 - I really appreciate the feedback.

I actually did a bit more research on and came to the conclusion that my results were a little on the low side given the hardware I was using.

I started to think what other pfSense add-on's or services I might have enabled that would be inspecting traffic and thus creating additional overhead. I then realized that besides Snort, I also had ntopng running. Looking into ntopng performance a little, I found this article back from 2016:

https://www.ntop.org/ntopng/best-practices-for-running-ntopng/

This part was especially interesting/insightful:

"Packet capture in ntopng has been designed to be as efficient as possible. We decided to have one processing thread per interface configured in ntopng. Depending on a) the CPU power b) number of hosts/flows, and c) packet capture technology, the number of packets-per-second ntopng can process can change. On a x86 server with PF_RING (non ZC) you can expect to process about 1 Mpps/interface, with PF_RING ZC at least 2/3 Mpps/interface (usually much more but typically not more than 5 Mpps)."

Realizing that ntopng was likely limiting performance, I went ahead and disabled ntopng and ran tests again - here are updated results:

With ntopng Disabled and Snort Enabled on VLAN 1 and VLAN 2: 1.10 - 1.20 Mpps
With ntopng Disabled and Snort Disabled on VLAN 1 and VLAN 2: 1.35 - 1.45 Mpps

Do these results seem reasonable to you guys? I was quite surprised what impact Snort and ntopng have on throughput. I guess these results suggest that with the smallest 64 byte packets my hardware can move about 1Gbit/s across the firewall and bandwidth goes up from there as packet size increases.

One question I do have: I was looking at some vendor routers and came across this model comparison page from Ubiquiti:

https://www.ubnt.com/edgemax/comparison/

Can someone shed light on how it is possible that their routers can move more pps with lesser hardware specs? Does it all just boil down to fast path routing making the performance difference?

Thanks again for your help and advice, I really appreciate it.

heper

because they use asics to offload it all. they don't have to deal with a general purpose kernel getting in the way of need for speed ;)
3yrs ago the pfsense project is/was working on netmap-fwd | ( blogpost) to work around the kernel for routing. This could potentially get true 10gbit routing to pfsense.

unfortunately i don't see it happening soon because it appears they decided to go for the 'tnsr' project to deal with high speeds.

i guess that is the downside when you got mouths to feed ... evolution is delayed because people moneys to survive :D

stephenw10

Yes, we're working on GM devs who can photosynthesize directly to get around that problem.

And also, yes, those units off-load the routing to dedicated hardware. Which is fine until you want to do something that there isn't a path for in the hardware and it goes back to kernel routing destroying the throughput.

Anyway that's all good data, thanks for testing and reporting back.

Steve

tman222

Thanks @heper and @stephenw10 that makes a lot more sense. What did you mean by GM devs?

I guess all this testing brings me to the $64,000 question: What likely path do you guys think will pfSense take above 1 Gbit/s? Will the base code be modified with something like netmap-fwd to allow more pps throughput, or is everything moving towards TNSR? If TNSR, will there be a nice GUI frontend built for it, or could it be possible that an open source or free version of it will be released at some point down the road?

Thanks again for all your help and feedback.

stephenw10

Genetically Modified (sorry!)

jwt

@tman222

Can someone shed light on how it is possible that their routers can move more pps with lesser hardware specs? Does it all just boil down to fast path routing making the performance difference?

The Cavium part has a way to accelerate IPv4 routing, but this quickly falls apart with any type of option processing, shaping, etc.

So, for that Tolly report there were two scenarios:

Without firewall: IP forwarding was enabled, but firewall and connection tracking features were disabled.
With firewall: IP forwarding and firewall features were enabled, but connection tracking features were still disabled.

The result is that any type of traffic shaping or even simple NAT (Masquerade) will take you out of the “fast path”, meaning now you’re running on a relatively slow (500MHz) pair of MIPS cores, (CN5020) and most of the code is so single-threaded, you’re only using one core.

This is why our focus is on VPP for the future. We can blow the doors off (>= 10X performance) the entire market using 64-bit Intel or ARM64 cores.

tman222

Thanks guys, I really appreciate it. Do you guys have any thoughts on the changes that might becoming to pfSense to make the platform work better at 10Gbit/s speeds? Thanks again.

tman222

Hi all,

I thought I would follow up on this topic with some additional testing results. Besides just running an iperf3 test across the firewall I also decided to run a few FLENT (https://flent.org/) tests, which I thought would be a bit more taxing compared to iperf3.

Testing setup is very similar with the exception of Host 2 now also running Debian Linux (i.e. host 1 and host 2 are now both Linux machines);

Test Results:

Flent RRUL Test (Snort Enabled, 4 Test Average): ~10.7 Gbit/s (sum of upload and download)
Flent RRUL Test (Snort Disabled, 4 Test Average): ~13 Gbit/s (sum of upload and download)

Flent RRUL Best Effort Test (Snort Disabled, 8 flows up/down, 1 Test): ~16 Gbit/s (sum of upload and download)
Flent RRUL Best Effort Test (Snort Enabled, 8 flows up/down, 1 Test): ~13.2 Gbit/s (sum of upload and download)
Flent RRUL Best Effort Test (Snort Disabled, 16 flows up/down, 1 Test): ~15.3 Gbit/s (sum of upload and download)

Flent tests send traffic in both directions in an effort to max out an interface/connection, so I thought it would make the most sense to provide the sum. It's interesting to see that these results fall more or less in line with the throughput I had predicted above. During these tests, the firewall's CPU was essentially pegged so I have reason to believe that this is the limit of what the machine can handle. Latencies did spike during these tests, but overall were surprisingly well behaved. As it would be expected, with Snort enabled, average latencies were worse than without.

Do guys have any thoughts on these results?

Thanks again for your feedback.

jwt

During these tests, the firewall's CPU was essentially pegged so I have reason to believe that this is the limit of what the machine can handle.

With kernel networking, yes.

tman222

Also wanted to share my additional discoveries which I posted in this thread:

https://forum.netgate.com/topic/105502/new-tcp-congestion-algorithm-bbr

It's impressive to see how much difference a simple algorithm change can make in terms of performance on a high speed link.

tman222

I wanted to follow up on this thread from a couple years and share some updated 10Gbit performance statistics using the latest version of pfSense (2.5.0 at the time of this writing). Overall, I have to say that I'm quite impressed and seeing 10 - 25% increases in performance (packet throughput) compared to when I posted this back in 2018. The testing setup is essentially the same (i.e. the only change I have made hardware wise is switch out the 2 port Chelsio T520 with its bigger brother, the 4 port Chelsio T540):

Host 1: i7 4790K based machine with 32GB RAM, Intel X550 NIC running Debian Linux 10.8
Host 2: i5 7600 based machine with 16GB RAM, Intel X550 NIC, running Debian Linux 10.8
Switch: Ubiquiti ES-16-XG
pfSense: Supermicro 5018D-F8NT server with 16GB RAM, and additional Chelsio T540-SO-CR SFP+ add-on card.

Host 1 and Host 2 are on separate network networks segments (let's call them VLAN 1 and VLAN 2), and VLAN 1 is allowed to talk to VLAN 2 across the firewall without restrictions. Snort is active on both VLAN 1 and VLAN 2.

Even with Snort enabled, I'm now seeing 1.3 - 1.5 million packets of throughput across the firewall when running a Flent RRUL test. The average is probably closer to 1.35 - 1.40 million packets. I like the Flent RRUL test because it is full duplex, i.e. tests upload and download at the same time (4 parallel RX streams and 4 parallel TX streams, tested for 60 seconds):

https://flent.org/tests.html#the-realtime-response-under-load-rrul-test

Flent RRUL Test Results:

Please ignore the avg ping values, these don't appear to be accurate latency calculations by the test.

One of the top test results:

                             avg       median          
 Ping (ms) ICMP   :         7.14         4.53 ms              
 Ping (ms) UDP BE :       545.44         4.05 ms              
 Ping (ms) UDP BK :       516.01         5.16 ms             
 Ping (ms) UDP EF :       743.00         2.83 ms              
 Ping (ms) avg    :       601.48         4.44 ms              
 TCP download BE  :      1714.08      1775.34 Mbits/s         
 TCP download BK  :      2416.47      2488.45 Mbits/s         
 TCP download CS5 :      2377.02      2407.92 Mbits/s         
 TCP download EF  :      2289.63      2323.54 Mbits/s         
 TCP download avg :      2199.30      2223.70 Mbits/s         
 TCP download sum :      8797.20      8894.63 Mbits/s         
 TCP totals       :     17689.26     17899.85 Mbits/s         
 TCP upload BE    :      2318.72      2407.06 Mbits/s         
 TCP upload BK    :      1867.99      1952.64 Mbits/s         
 TCP upload CS5   :      2375.49      2423.72 Mbits/s         
 TCP upload EF    :      2329.86      2427.93 Mbits/s         
 TCP upload avg   :      2223.01      2255.39 Mbits/s         
 TCP upload sum   :      8892.06      9019.27 Mbits/s

Closer to average:

                             avg       median         
 Ping (ms) ICMP   :         3.02         1.79 ms           
 Ping (ms) UDP BE :       693.13         2.92 ms              
 Ping (ms) UDP BK :       693.84         2.49 ms             
 Ping (ms) UDP EF :       701.94         2.65 ms             
 Ping (ms) avg    :       696.30         2.71 ms              
 TCP download BE  :      1371.36      1323.14 Mbits/s         
 TCP download BK  :      2508.64      2556.88 Mbits/s         
 TCP download CS5 :      2356.60      2475.18 Mbits/s         
 TCP download EF  :      1318.73      1310.40 Mbits/s         
 TCP download avg :      1888.83      1929.42 Mbits/s         
 TCP download sum :      7555.33      7717.70 Mbits/s         
 TCP totals       :     16483.28     16708.92 Mbits/s         
 TCP upload BE    :      1882.59      2066.45 Mbits/s         
 TCP upload BK    :      2427.06      2475.99 Mbits/s         
 TCP upload CS5   :      2195.50      2269.01 Mbits/s         
 TCP upload EF    :      2422.80      2523.43 Mbits/s         
 TCP upload avg   :      2231.99      2265.72 Mbits/s         
 TCP upload sum   :      8927.95      9062.89 Mbits/s

Overall, I'm very happy with these results and it shows me that there is a lot of life left in this Xeon-D hardware (purchased back in 2017), if / when multi-gigabit internet service is available.