10Gbit Performance Testing

heper

because they use asics to offload it all. they don't have to deal with a general purpose kernel getting in the way of need for speed ;)
3yrs ago the pfsense project is/was working on netmap-fwd | ( blogpost) to work around the kernel for routing. This could potentially get true 10gbit routing to pfsense.

unfortunately i don't see it happening soon because it appears they decided to go for the 'tnsr' project to deal with high speeds.

i guess that is the downside when you got mouths to feed ... evolution is delayed because people moneys to survive :D

stephenw10

Yes, we're working on GM devs who can photosynthesize directly to get around that problem.

And also, yes, those units off-load the routing to dedicated hardware. Which is fine until you want to do something that there isn't a path for in the hardware and it goes back to kernel routing destroying the throughput.

Anyway that's all good data, thanks for testing and reporting back.

Steve

tman222

Thanks @heper and @stephenw10 that makes a lot more sense. What did you mean by GM devs?

I guess all this testing brings me to the $64,000 question: What likely path do you guys think will pfSense take above 1 Gbit/s? Will the base code be modified with something like netmap-fwd to allow more pps throughput, or is everything moving towards TNSR? If TNSR, will there be a nice GUI frontend built for it, or could it be possible that an open source or free version of it will be released at some point down the road?

Thanks again for all your help and feedback.

stephenw10

Genetically Modified (sorry!)

jwt

@tman222

Can someone shed light on how it is possible that their routers can move more pps with lesser hardware specs? Does it all just boil down to fast path routing making the performance difference?

The Cavium part has a way to accelerate IPv4 routing, but this quickly falls apart with any type of option processing, shaping, etc.

So, for that Tolly report there were two scenarios:

Without firewall: IP forwarding was enabled, but firewall and connection tracking features were disabled.
With firewall: IP forwarding and firewall features were enabled, but connection tracking features were still disabled.

The result is that any type of traffic shaping or even simple NAT (Masquerade) will take you out of the “fast path”, meaning now you’re running on a relatively slow (500MHz) pair of MIPS cores, (CN5020) and most of the code is so single-threaded, you’re only using one core.

This is why our focus is on VPP for the future. We can blow the doors off (>= 10X performance) the entire market using 64-bit Intel or ARM64 cores.

tman222

Thanks guys, I really appreciate it. Do you guys have any thoughts on the changes that might becoming to pfSense to make the platform work better at 10Gbit/s speeds? Thanks again.

tman222

Hi all,

I thought I would follow up on this topic with some additional testing results. Besides just running an iperf3 test across the firewall I also decided to run a few FLENT (https://flent.org/) tests, which I thought would be a bit more taxing compared to iperf3.

Testing setup is very similar with the exception of Host 2 now also running Debian Linux (i.e. host 1 and host 2 are now both Linux machines);

Test Results:

Flent RRUL Test (Snort Enabled, 4 Test Average): ~10.7 Gbit/s (sum of upload and download)
Flent RRUL Test (Snort Disabled, 4 Test Average): ~13 Gbit/s (sum of upload and download)

Flent RRUL Best Effort Test (Snort Disabled, 8 flows up/down, 1 Test): ~16 Gbit/s (sum of upload and download)
Flent RRUL Best Effort Test (Snort Enabled, 8 flows up/down, 1 Test): ~13.2 Gbit/s (sum of upload and download)
Flent RRUL Best Effort Test (Snort Disabled, 16 flows up/down, 1 Test): ~15.3 Gbit/s (sum of upload and download)

Flent tests send traffic in both directions in an effort to max out an interface/connection, so I thought it would make the most sense to provide the sum. It's interesting to see that these results fall more or less in line with the throughput I had predicted above. During these tests, the firewall's CPU was essentially pegged so I have reason to believe that this is the limit of what the machine can handle. Latencies did spike during these tests, but overall were surprisingly well behaved. As it would be expected, with Snort enabled, average latencies were worse than without.

Do guys have any thoughts on these results?

Thanks again for your feedback.

jwt

During these tests, the firewall's CPU was essentially pegged so I have reason to believe that this is the limit of what the machine can handle.

With kernel networking, yes.

tman222

Also wanted to share my additional discoveries which I posted in this thread:

https://forum.netgate.com/topic/105502/new-tcp-congestion-algorithm-bbr

It's impressive to see how much difference a simple algorithm change can make in terms of performance on a high speed link.

tman222

I wanted to follow up on this thread from a couple years and share some updated 10Gbit performance statistics using the latest version of pfSense (2.5.0 at the time of this writing). Overall, I have to say that I'm quite impressed and seeing 10 - 25% increases in performance (packet throughput) compared to when I posted this back in 2018. The testing setup is essentially the same (i.e. the only change I have made hardware wise is switch out the 2 port Chelsio T520 with its bigger brother, the 4 port Chelsio T540):

Host 1: i7 4790K based machine with 32GB RAM, Intel X550 NIC running Debian Linux 10.8
Host 2: i5 7600 based machine with 16GB RAM, Intel X550 NIC, running Debian Linux 10.8
Switch: Ubiquiti ES-16-XG
pfSense: Supermicro 5018D-F8NT server with 16GB RAM, and additional Chelsio T540-SO-CR SFP+ add-on card.

Host 1 and Host 2 are on separate network networks segments (let's call them VLAN 1 and VLAN 2), and VLAN 1 is allowed to talk to VLAN 2 across the firewall without restrictions. Snort is active on both VLAN 1 and VLAN 2.

Even with Snort enabled, I'm now seeing 1.3 - 1.5 million packets of throughput across the firewall when running a Flent RRUL test. The average is probably closer to 1.35 - 1.40 million packets. I like the Flent RRUL test because it is full duplex, i.e. tests upload and download at the same time (4 parallel RX streams and 4 parallel TX streams, tested for 60 seconds):

https://flent.org/tests.html#the-realtime-response-under-load-rrul-test

Flent RRUL Test Results:

Please ignore the avg ping values, these don't appear to be accurate latency calculations by the test.

One of the top test results:

                             avg       median          
 Ping (ms) ICMP   :         7.14         4.53 ms              
 Ping (ms) UDP BE :       545.44         4.05 ms              
 Ping (ms) UDP BK :       516.01         5.16 ms             
 Ping (ms) UDP EF :       743.00         2.83 ms              
 Ping (ms) avg    :       601.48         4.44 ms              
 TCP download BE  :      1714.08      1775.34 Mbits/s         
 TCP download BK  :      2416.47      2488.45 Mbits/s         
 TCP download CS5 :      2377.02      2407.92 Mbits/s         
 TCP download EF  :      2289.63      2323.54 Mbits/s         
 TCP download avg :      2199.30      2223.70 Mbits/s         
 TCP download sum :      8797.20      8894.63 Mbits/s         
 TCP totals       :     17689.26     17899.85 Mbits/s         
 TCP upload BE    :      2318.72      2407.06 Mbits/s         
 TCP upload BK    :      1867.99      1952.64 Mbits/s         
 TCP upload CS5   :      2375.49      2423.72 Mbits/s         
 TCP upload EF    :      2329.86      2427.93 Mbits/s         
 TCP upload avg   :      2223.01      2255.39 Mbits/s         
 TCP upload sum   :      8892.06      9019.27 Mbits/s

Closer to average:

                             avg       median         
 Ping (ms) ICMP   :         3.02         1.79 ms           
 Ping (ms) UDP BE :       693.13         2.92 ms              
 Ping (ms) UDP BK :       693.84         2.49 ms             
 Ping (ms) UDP EF :       701.94         2.65 ms             
 Ping (ms) avg    :       696.30         2.71 ms              
 TCP download BE  :      1371.36      1323.14 Mbits/s         
 TCP download BK  :      2508.64      2556.88 Mbits/s         
 TCP download CS5 :      2356.60      2475.18 Mbits/s         
 TCP download EF  :      1318.73      1310.40 Mbits/s         
 TCP download avg :      1888.83      1929.42 Mbits/s         
 TCP download sum :      7555.33      7717.70 Mbits/s         
 TCP totals       :     16483.28     16708.92 Mbits/s         
 TCP upload BE    :      1882.59      2066.45 Mbits/s         
 TCP upload BK    :      2427.06      2475.99 Mbits/s         
 TCP upload CS5   :      2195.50      2269.01 Mbits/s         
 TCP upload EF    :      2422.80      2523.43 Mbits/s         
 TCP upload avg   :      2231.99      2265.72 Mbits/s         
 TCP upload sum   :      8927.95      9062.89 Mbits/s

Overall, I'm very happy with these results and it shows me that there is a lot of life left in this Xeon-D hardware (purchased back in 2017), if / when multi-gigabit internet service is available.