Solved - 10GB link 1GB speeds
-
It would seem that any sort of tuning actually makes it run slower. I have no marked improvements from the pfsense defaults.
It would seem I need to get some better hardware that is more suited for the task. Good news is a 40G card is coming along with a 40G switch (6 ports).
I am curious to see what kind of hurt I can put on this box with 40G gear.
You can mark this thread closed, I am moving on to more important things. I will open a new one when the 40G gear get here and I have a chance to tinker.
-
So i have an update. The 40G nic from mellanox performs wonderfully on vanilla FreeBSD and Linux, however I see the same performance with pfSense that I was getting with the 10GB nics. I would like to know what the differences are from the raw BSD kernel.
I really love pfSense, it makes my life so easy to do otherwise complicated stuff. But these performance issues should be addressed.
-
@johnkeates:
Before you throw it all out: try polling. This isn't always the solution, but if you are starving due to interrupts, polling might solve some of it.
I am not familiar? any good places to start?
edit
ifconfig mlxen0 pollingClient connecting to ..., TCP port 5001
TCP window size: 85.0 KByte (default)
–----------------------------------------------------------
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 110 MBytes 922 Mbits/sec
[ 4] 0.0- 1.0 sec 64.6 MBytes 542 Mbits/sec
[ 5] 0.0- 1.0 sec 53.2 MBytes 447 Mbits/sec
[SUM] 0.0- 1.0 sec 228 MBytes 1.91 Gbits/sec
[ 3] 1.0- 2.0 sec 110 MBytes 925 Mbits/sec
[ 5] 1.0- 2.0 sec 57.4 MBytes 481 Mbits/sec
[ 4] 1.0- 2.0 sec 56.5 MBytes 474 Mbits/sec
[SUM] 1.0- 2.0 sec 224 MBytes 1.88 Gbits/sec
[ 3] 2.0- 3.0 sec 112 MBytes 936 Mbits/sec
[ 4] 2.0- 3.0 sec 54.5 MBytes 457 Mbits/sec
[ 5] 2.0- 3.0 sec 59.9 MBytes 502 Mbits/sec
[SUM] 2.0- 3.0 sec 226 MBytes 1.90 Gbits/sec
[ 4] 3.0- 4.0 sec 52.8 MBytes 442 Mbits/sec
[ 3] 3.0- 4.0 sec 113 MBytes 948 Mbits/sec
[ 5] 3.0- 4.0 sec 62.1 MBytes 521 Mbits/sec
[SUM] 3.0- 4.0 sec 228 MBytes 1.91 Gbits/secifconfig mlxen0 -polling
–----------------------------------------------------------
Client connecting to ..., TCP port 5001
TCP window size: 85.0 KByte (default)[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 108 MBytes 905 Mbits/sec
[ 5] 0.0- 1.0 sec 109 MBytes 915 Mbits/sec
[ 4] 0.0- 1.0 sec 107 MBytes 898 Mbits/sec
[SUM] 0.0- 1.0 sec 324 MBytes 2.72 Gbits/sec
[ 5] 1.0- 2.0 sec 108 MBytes 904 Mbits/sec
[ 4] 1.0- 2.0 sec 107 MBytes 898 Mbits/sec
[ 3] 1.0- 2.0 sec 107 MBytes 901 Mbits/sec
[SUM] 1.0- 2.0 sec 322 MBytes 2.70 Gbits/sec
[ 5] 2.0- 3.0 sec 108 MBytes 910 Mbits/sec
[ 4] 2.0- 3.0 sec 107 MBytes 900 Mbits/sec
[ 3] 2.0- 3.0 sec 108 MBytes 906 Mbits/sec
[SUM] 2.0- 3.0 sec 324 MBytes 2.72 Gbits/sec -
So i have an update. The 40G nic from mellanox performs wonderfully on vanilla FreeBSD and Linux, however I see the same performance with pfSense that I was getting with the 10GB nics. I would like to know what the differences are from the raw BSD kernel.
pfSense is using the pf (packet filter) and NAT as a point later in the pf process, and this will be not done
in the FreeBSD and Linux OS!!!! So if you want to compare then against this will be the most matching
answer and on top of this it might be also pending on the used hardware, if you are using a Xeon E3 or high scaling
Xeon E3 CPU (3,7GHz 7C/8T) you will perhaps get more throughput out of this then using a C2758 based machine.I really love pfSense, it makes my life so easy to do otherwise complicated stuff. But these performance issues should be addressed.
Take hardware with more horse power, or stronger sorted CPUs (and RAM) so there is nothing that have addressed to.
-
To debug this a bit more try setting up pfSense as a test with no NAT enabled. At the same time, disable pf in the advanced settings. With that done, try a iperf test again. If we're gonna figure out why this is happening, we're gonna need to start excluding stuff.
On the other hand, if you need this to work, you might be better off buying support at Netgate since they build pfSense.
-
I agree with your point, and these are not complaints. If I wanted this to just work, I would stick with Fedora. However, I’m just trying to get to the bottom of what appears to be a pfsense specific issue. With pfctl -d I still only get around 5g and high cpu/ interrupts. Are there settings that I am missing. This is a clean install with default settings.
On FreeBSD and Linux there is almost no cpu utilization, as it’s mostly offloaded to the nic. However I’m not seeing this reflected in the pfsense build.
Thanks all for you input and time.
~/D -
@BlueKobold:
So i have an update. The 40G nic from mellanox performs wonderfully on vanilla FreeBSD and Linux, however I see the same performance with pfSense that I was getting with the 10GB nics. I would like to know what the differences are from the raw BSD kernel.
pfSense is using the pf (packet filter) and NAT as a point later in the pf process, and this will be not done
in the FreeBSD and Linux OS!!!! So if you want to compare then against this will be the most matching
answer and on top of this it might be also pending on the used hardware, if you are using a Xeon E3 or high scaling
Xeon E3 CPU (3,7GHz 7C/8T) you will perhaps get more throughput out of this then using a C2758 based machine.I’m only routing packets, no NAT. Also with pf fully disabled I still get very high utilization numbers.
I really love pfSense, it makes my life so easy to do otherwise complicated stuff. But these performance issues should be addressed.
Take hardware with more horse power, or stronger sorted CPUs (and RAM) so there is nothing that have addressed to.
There isn’t really a need for better equipment, it works fine with other options.
-
Have you tried to run VyOS on your hardware? With basic NAT and firewalling enabled it will allow you to assess what your hardware is really capable of as a basic gateway/firewall.
-
Hmm, next would probably be comparing sysctl output (I guess just getting both sysctl outputs and running a diff on them will do), and perhaps kernel/driver build configs (again, a diff should suffice).
-
There are some cheap ways to increase the throughput.
1. Increase MTU
If you are lucky you can use jumbo-frames throughout your environment (this will lead to a factor of 6 in throughput, assuming MTU of 9000 (maximum which is usable in vmware) instead of 1500). However if you speak to the outside-world you are likely to create a bottleneck due to the need to fragment.2. Packet Rates
For high packet rates with small packets this will not help. There is a limit within the packet processing within FreeBSD which might be lower than in other network-stacks: Compare for example:
http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-networking/
A valid source seems the Freebsd-Router-Project:
https://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-crThey also give figures for pf.
3. Real World examples
Remember always to measure through the device:[ Pc1 ] –- > [pfsense-system] –- > [Pc2]
I can give some real world examples: ESXi-Guests with 8 CPUs (2.6 GHz) allow pushing of 5 Gbit/s with MTU 1500. Therefore i assume that real hardware should be able to achive higher throughputs.
The main problem seems to be the high interrupt-rate.
I did some measurements on a X710 40 Gbit/s Card (8 CPUs, > 2 GHz) and i was able to reach throughputs around 12.3 Gbit/s.
As far as i heared with commodity hardware the limit seems to be 26 Gbit/s,
https://www.ntop.org/products/packet-capture/pf_ring/pf_ring-zc-zero-copy/ -
There are some cheap ways to increase the throughput.
1. Increase MTU
If you are lucky you can use jumbo-frames throughout your environment (this will lead to a factor of 6 in throughput, assuming MTU of 9000 (maximum which is usable in vmware) instead of 1500). However if you speak to the outside-world you are likely to create a bottleneck due to the need to fragment.2. Packet Rates
For high packet rates with small packets this will not help. There is a limit within the packet processing within FreeBSD which might be lower than in other network-stacks: Compare for example:
http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-networking/
A valid source seems the Freebsd-Router-Project:
https://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-crThey also give figures for pf.
3. Real World examples
Remember always to measure through the device:[ Pc1 ] –- > [pfsense-system] –- > [Pc2]
I can give some real world examples: ESXi-Guests with 8 CPUs (2.6 GHz) allow pushing of 5 Gbit/s with MTU 1500. Therefore i assume that real hardware should be able to achive higher throughputs.
The main problem seems to be the high interrupt-rate.
I did some measurements on a X710 40 Gbit/s Card (8 CPUs, > 2 GHz) and i was able to reach throughputs around 12.3 Gbit/s.
As far as i heared with commodity hardware the limit seems to be 26 Gbit/s,
https://www.ntop.org/products/packet-capture/pf_ring/pf_ring-zc-zero-copy/The 'problem' isn't in FreeBSD. He tried a plain FreeBSD install and it works fine there. It is in some difference between the settings in pfSense and FreeBSD, probably pf config, interface config, kernel config or sysctl changes.
-
I am not sure i understand the problem right.
Your setup looks like this:
[System 1 (network 1)] –- > [Device under test] –-> [System 2(network 2)]
Right ?
You use a freebsd system as router/firewall and achive a higher throughput than using the pfsense ?
If this is the case you should check all network settings / drivers / sysctrl etc., maybe there is a setting which is
not identical.
Therefore using this settings should lead to a higher throughput.If you are just measuring speed via iperf3 to the pfsense system, a huge difference is given if hw-acceleration is in place, which is not recommend for a system doing routing. Check the flags (LRO, TSO, etc. to name a few options which can give huge differences) and usually also needs a reboot to be in place.
-
The 'problem' isn't in FreeBSD. He tried a plain FreeBSD install and it works fine there. It is in some difference between the settings in pfSense and FreeBSD, probably pf config, interface config, kernel config or sysctl changes.
I am pretty sure, that pfSense is not only something on top of FreeBSD since the version 2.2.x it is more and more
special or custom build based on the original kernel but with many many changes.If the netgate team or the pfSense team was able to push ~40 GBit/s over a IPSec tunnel using an Intel QAT card, and
that card came without any ports on them, so it must be able to handle that speed over the pfSense too in my opinion.
For sure also ports that are supporting and/or allowing that entire speed or throughput rate. -
I will pull the defaults from FreeBSD. I’m confident pfSense is fully capable of what I’m looking for. I’m just missing something.
It is looking like an offload issue, as in seemingly nothing is offload to the nic. I have tried 3 different cards {intel x520, chelsio t5, Mellanox x3 40G}, all with nearly identical results. The limit of this gear with no offloads would seem to be around 4G.
On a recent Linux kernel (Fedora 26) there is almost no cpu load as it’s all being done in the card.
Thanks for the continued help and interest in this post. Yet another reason to push forward with pfSense. This is a great community.
-
There are some cheap ways to increase the throughput.
1. Increase MTU
If you are lucky you can use jumbo-frames throughout your environment (this will lead to a factor of 6 in throughput, assuming MTU of 9000 (maximum which is usable in vmware) instead of 1500). However if you speak to the outside-world you are likely to create a bottleneck due to the need to fragment.2. Packet Rates
For high packet rates with small packets this will not help. There is a limit within the packet processing within FreeBSD which might be lower than in other network-stacks: Compare for example:
http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-networking/
A valid source seems the Freebsd-Router-Project:
https://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-crThey also give figures for pf.
3. Real World examples
Remember always to measure through the device:[ Pc1 ] –- > [pfsense-system] –- > [Pc2]
I can give some real world examples: ESXi-Guests with 8 CPUs (2.6 GHz) allow pushing of 5 Gbit/s with MTU 1500. Therefore i assume that real hardware should be able to achive higher throughputs.
The main problem seems to be the high interrupt-rate.
I did some measurements on a X710 40 Gbit/s Card (8 CPUs, > 2 GHz) and i was able to reach throughputs around 12.3 Gbit/s.
As far as i heared with commodity hardware the limit seems to be 26 Gbit/s,
https://www.ntop.org/products/packet-capture/pf_ring/pf_ring-zc-zero-copy/From [device] <–--> [device]
I get wire line speedFrom [device]–-->[pfsense]–-> [device]
This is where the issue resides
I would be happy with something close to half wire line on 10G because this device is doing more than just routing traffic. However I am really quite a distance from that without 100% interrupts
-
Here are stats from the same link on the same router using centos 7.4. These are with the factory defaults and no iptables enabled.
–----------------------------------------------------------
Client connecting to ..., TCP port 5001
TCP window size: 85.0 KByte (default)[ ID] Interval Transfer Bandwidth
[ 5] 0.0- 1.0 sec 256 MBytes 2.15 Gbits/sec
[ 4] 0.0- 1.0 sec 270 MBytes 2.26 Gbits/sec
[ 3] 0.0- 1.0 sec 258 MBytes 2.17 Gbits/sec
[ 6] 0.0- 1.0 sec 327 MBytes 2.75 Gbits/sec
[SUM] 0.0- 1.0 sec 1.09 GBytes 9.32 Gbits/sec
[ 5] 1.0- 2.0 sec 242 MBytes 2.03 Gbits/sec
[ 4] 1.0- 2.0 sec 251 MBytes 2.11 Gbits/sec
[ 3] 1.0- 2.0 sec 281 MBytes 2.36 Gbits/sec
[ 6] 1.0- 2.0 sec 337 MBytes 2.83 Gbits/sec
[SUM] 1.0- 2.0 sec 1.09 GBytes 9.33 Gbits/sec
^C[ 5] 0.0- 2.6 sec 679 MBytes 2.15 Gbits/sec
[ 4] 0.0- 2.6 sec 715 MBytes 2.27 Gbits/sec
[ 3] 0.0- 2.6 sec 718 MBytes 2.28 Gbits/sec
[ 6] 0.0- 2.6 sec 818 MBytes 2.60 Gbits/sec
[SUM] 0.0- 2.6 sec 2.86 GBytes 9.29 Gbits/secThe CPU utilization is almost zero.
-
And these are the default options that are turned on for the nic in linux.
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ipv6: on
scatter-gather: on
tx-scatter-gather: on
tx-tcp-segmentation: on
tx-tcp6-segmentation: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
rx-vlan-stag-hw-parse: on
rx-vlan-stag-filter: on [fixed]
busy-poll: on [fixed]I have no idea how to translate these to bsd options. But I am thinking my issue lies here - what is offloaded for the nic to handle.
-
I think in BSD those settings are still set with ifconfig using the + and - options. If the cards need firmware to run (and most do), perhaps we should also take that into account.
Currently, we know that by default, the hardware should be capable of pushing 2Gbit+ with no high loads. So it's not a hardware issue and we know it's not a BSD issue either since it works with FreeBSD.
This leaves us with:
- compile-time options in the kernel/drivers
- firmware versions if the drivers differ in version and have different firmware blobs
- syssctl
Try getting sysctl -A from freebsd and from pfsense and compare those. Also check pci messages.
-
Well the good news is I have have managed to get around 4G with pf enabled, and nearly wireline with pf disabled. That is solid progress.
There were a couple of options i had to enable in loader.conf.local
compat.linuxkpi.mlx4_enable_sys_tune="1"
net.link.ifqmaxlen="2048"
net.inet.tcp.soreceive_stream="1"
net.inet.tcp.hostcache.cachelimit="0"
compat.linuxkpi.mlx4_inline_thold="0"
compat.linuxkpi.mlx4_log_num_mgm_entry_size="7"
compat.linuxkpi.mlx4_high_rate_steer="1"These options seem to be helping in making solid progress. I am 1G away from my goal of 5G per second with pf enabled.
I think those are really quite reasonable numbers for this machine, expecting anything else is asking for a bit much.
I checked the sysctl's from the freebsd box they are nearly identical.
Thanks all for your time and help. It is genuinely appreciated.
I will keep tinkering and post updates.
-
Are those sysctl's the same on the FreeBSD install?