Solved - 10GB link 1GB speeds
-
Here are stats from the same link on the same router using centos 7.4. These are with the factory defaults and no iptables enabled.
–----------------------------------------------------------
Client connecting to ..., TCP port 5001
TCP window size: 85.0 KByte (default)[ ID] Interval Transfer Bandwidth
[ 5] 0.0- 1.0 sec 256 MBytes 2.15 Gbits/sec
[ 4] 0.0- 1.0 sec 270 MBytes 2.26 Gbits/sec
[ 3] 0.0- 1.0 sec 258 MBytes 2.17 Gbits/sec
[ 6] 0.0- 1.0 sec 327 MBytes 2.75 Gbits/sec
[SUM] 0.0- 1.0 sec 1.09 GBytes 9.32 Gbits/sec
[ 5] 1.0- 2.0 sec 242 MBytes 2.03 Gbits/sec
[ 4] 1.0- 2.0 sec 251 MBytes 2.11 Gbits/sec
[ 3] 1.0- 2.0 sec 281 MBytes 2.36 Gbits/sec
[ 6] 1.0- 2.0 sec 337 MBytes 2.83 Gbits/sec
[SUM] 1.0- 2.0 sec 1.09 GBytes 9.33 Gbits/sec
^C[ 5] 0.0- 2.6 sec 679 MBytes 2.15 Gbits/sec
[ 4] 0.0- 2.6 sec 715 MBytes 2.27 Gbits/sec
[ 3] 0.0- 2.6 sec 718 MBytes 2.28 Gbits/sec
[ 6] 0.0- 2.6 sec 818 MBytes 2.60 Gbits/sec
[SUM] 0.0- 2.6 sec 2.86 GBytes 9.29 Gbits/secThe CPU utilization is almost zero.
-
And these are the default options that are turned on for the nic in linux.
rx-checksumming: on
tx-checksumming: on
tx-checksum-ipv4: on
tx-checksum-ipv6: on
scatter-gather: on
tx-scatter-gather: on
tx-tcp-segmentation: on
tx-tcp6-segmentation: on
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on [fixed]
rx-vlan-stag-hw-parse: on
rx-vlan-stag-filter: on [fixed]
busy-poll: on [fixed]I have no idea how to translate these to bsd options. But I am thinking my issue lies here - what is offloaded for the nic to handle.
-
I think in BSD those settings are still set with ifconfig using the + and - options. If the cards need firmware to run (and most do), perhaps we should also take that into account.
Currently, we know that by default, the hardware should be capable of pushing 2Gbit+ with no high loads. So it's not a hardware issue and we know it's not a BSD issue either since it works with FreeBSD.
This leaves us with:
- compile-time options in the kernel/drivers
- firmware versions if the drivers differ in version and have different firmware blobs
- syssctl
Try getting sysctl -A from freebsd and from pfsense and compare those. Also check pci messages.
-
Well the good news is I have have managed to get around 4G with pf enabled, and nearly wireline with pf disabled. That is solid progress.
There were a couple of options i had to enable in loader.conf.local
compat.linuxkpi.mlx4_enable_sys_tune="1"
net.link.ifqmaxlen="2048"
net.inet.tcp.soreceive_stream="1"
net.inet.tcp.hostcache.cachelimit="0"
compat.linuxkpi.mlx4_inline_thold="0"
compat.linuxkpi.mlx4_log_num_mgm_entry_size="7"
compat.linuxkpi.mlx4_high_rate_steer="1"These options seem to be helping in making solid progress. I am 1G away from my goal of 5G per second with pf enabled.
I think those are really quite reasonable numbers for this machine, expecting anything else is asking for a bit much.
I checked the sysctl's from the freebsd box they are nearly identical.
Thanks all for your time and help. It is genuinely appreciated.
I will keep tinkering and post updates.
-
Are those sysctl's the same on the FreeBSD install?
-
No, they were not required on the FreeBSD install or the linux install. The defaults just seem to work. I also didn't have a real ruleset in pf with FreeBSD like i do on this box, so that will surely effect performance numbers.
[ 15] 26.0-27.0 sec 31.1 MBytes 261 Mbits/sec
[ 3] 26.0-27.0 sec 49.9 MBytes 418 Mbits/sec
[ 8] 26.0-27.0 sec 53.9 MBytes 452 Mbits/sec
[ 11] 26.0-27.0 sec 35.4 MBytes 297 Mbits/sec
[ 16] 26.0-27.0 sec 43.1 MBytes 362 Mbits/sec
[ 17] 26.0-27.0 sec 48.1 MBytes 404 Mbits/sec
[ 14] 26.0-27.0 sec 54.8 MBytes 459 Mbits/sec
[ 4] 26.0-27.0 sec 45.5 MBytes 382 Mbits/sec
[ 10] 26.0-27.0 sec 62.0 MBytes 520 Mbits/sec
[ 6] 26.0-27.0 sec 24.2 MBytes 203 Mbits/sec
[ 7] 26.0-27.0 sec 14.2 MBytes 120 Mbits/sec
[ 9] 26.0-27.0 sec 38.0 MBytes 319 Mbits/sec
[ 18] 26.0-27.0 sec 33.2 MBytes 279 Mbits/sec
[ 13] 26.0-27.0 sec 16.8 MBytes 141 Mbits/sec
[ 12] 26.0-27.0 sec 30.6 MBytes 257 Mbits/sec
[ 5] 26.0-27.0 sec 23.8 MBytes 199 Mbits/sec
[SUM] 26.0-27.0 sec 605 MBytes 5.07 Gbits/sec
[ 3] 27.0-28.0 sec 51.4 MBytes 431 Mbits/sec
[ 16] 27.0-28.0 sec 43.1 MBytes 362 Mbits/sec
[ 15] 27.0-28.0 sec 31.0 MBytes 260 Mbits/sec
[ 4] 27.0-28.0 sec 47.9 MBytes 402 Mbits/sec
[ 10] 27.0-28.0 sec 57.6 MBytes 483 Mbits/sec
[ 8] 27.0-28.0 sec 49.2 MBytes 413 Mbits/sec
[ 13] 27.0-28.0 sec 16.1 MBytes 135 Mbits/sec
[ 17] 27.0-28.0 sec 46.6 MBytes 391 Mbits/sec
[ 14] 27.0-28.0 sec 55.6 MBytes 467 Mbits/sec
[ 6] 27.0-28.0 sec 23.0 MBytes 193 Mbits/sec
[ 12] 27.0-28.0 sec 29.2 MBytes 245 Mbits/sec
[ 18] 27.0-28.0 sec 34.8 MBytes 292 Mbits/sec
[ 5] 27.0-28.0 sec 23.1 MBytes 194 Mbits/sec
[ 7] 27.0-28.0 sec 11.9 MBytes 99.6 Mbits/sec
[ 9] 27.0-28.0 sec 41.0 MBytes 344 Mbits/sec
[ 11] 27.0-28.0 sec 42.0 MBytes 352 Mbits/sec
[SUM] 27.0-28.0 sec 604 MBytes 5.06 Gbits/secSo with iperf running 16 threads I can reach my 5G target with pf enabled. Which is the limit of my system with its current configuration.
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
0 root -92 - 0K 5328K - 0 3:23 94.68% [kernel{mlxen0 rx cq}]
0 root -92 - 0K 5328K - 5 2:14 94.68% [kernel{mlxen0 rx cq}]
0 root -92 - 0K 5328K - 6 3:48 94.58% [kernel{mlxen0 rx cq}]
0 root -92 - 0K 5328K - 3 4:10 94.38% [kernel{mlxen0 rx cq}]
0 root -92 - 0K 5328K - 2 3:36 93.99% [kernel{mlxen0 rx cq}]
0 root -92 - 0K 5328K - 1 3:44 90.58% [kernel{mlxen0 rx cq}]
0 root -92 - 0K 5328K - 7 2:14 67.58% [kernel{mlxen0 rx cq}]I don't know what rx cq means, so I don't know what to tinker with.
-
That's the receive queue AFAIK. It seems the defaults on FreeBSD vs. pfSense must be different then. If the ifconfig status output different as well?
For example, I have an interface that's set with:
en4: flags=8863 <up,broadcast,smart,running,simplex,multicast>mtu 1500
options=10b <rxcsum,txcsum,vlan_hwtagging,av>If you compare your ifconfig settings on FreeBSD vs. pfSense there might be a change there as well. Also, the driver settings could differ, but I'm not sure where they are stored for the Mellanox card.</rxcsum,txcsum,vlan_hwtagging,av></up,broadcast,smart,running,simplex,multicast> -
Ok now I can confirm wireline speeds with this nic.
Its my pf ruleset that is holding it back at this point.
[ ID] Interval Transfer Bandwidth
[ 4] 0.0- 1.0 sec 74.6 MBytes 626 Mbits/sec
[ 6] 0.0- 1.0 sec 152 MBytes 1.28 Gbits/sec
[ 8] 0.0- 1.0 sec 163 MBytes 1.37 Gbits/sec
[ 9] 0.0- 1.0 sec 76.2 MBytes 640 Mbits/sec
[ 13] 0.0- 1.0 sec 42.6 MBytes 358 Mbits/sec
[ 10] 0.0- 1.0 sec 58.4 MBytes 490 Mbits/sec
[ 12] 0.0- 1.0 sec 66.6 MBytes 559 Mbits/sec
[ 16] 0.0- 1.0 sec 63.2 MBytes 531 Mbits/sec
[ 14] 0.0- 1.0 sec 32.9 MBytes 276 Mbits/sec
[ 17] 0.0- 1.0 sec 37.4 MBytes 314 Mbits/sec
[ 18] 0.0- 1.0 sec 79.0 MBytes 663 Mbits/sec
[ 3] 0.0- 1.0 sec 57.5 MBytes 482 Mbits/sec
[ 5] 0.0- 1.0 sec 52.4 MBytes 439 Mbits/sec
[ 7] 0.0- 1.0 sec 29.1 MBytes 244 Mbits/sec
[ 15] 0.0- 1.0 sec 75.5 MBytes 633 Mbits/sec
[ 11] 0.0- 1.0 sec 71.1 MBytes 597 Mbits/sec
[SUM] 0.0- 1.0 sec 1.11 GBytes 9.50 Gbits/sec
[ 18] 1.0- 2.0 sec 49.0 MBytes 411 Mbits/sec
[ 6] 1.0- 2.0 sec 152 MBytes 1.28 Gbits/sec
[ 8] 1.0- 2.0 sec 127 MBytes 1.07 Gbits/sec
[ 10] 1.0- 2.0 sec 70.2 MBytes 589 Mbits/sec
[ 12] 1.0- 2.0 sec 70.4 MBytes 590 Mbits/sec
[ 15] 1.0- 2.0 sec 70.6 MBytes 592 Mbits/sec
[ 14] 1.0- 2.0 sec 25.9 MBytes 217 Mbits/sec
[ 11] 1.0- 2.0 sec 68.0 MBytes 570 Mbits/sec
[ 7] 1.0- 2.0 sec 61.0 MBytes 512 Mbits/sec
[ 13] 1.0- 2.0 sec 55.9 MBytes 469 Mbits/sec
[ 16] 1.0- 2.0 sec 73.0 MBytes 612 Mbits/sec
[ 17] 1.0- 2.0 sec 30.8 MBytes 258 Mbits/sec
[ 4] 1.0- 2.0 sec 81.5 MBytes 684 Mbits/sec
[ 3] 1.0- 2.0 sec 41.0 MBytes 344 Mbits/sec
[ 5] 1.0- 2.0 sec 47.1 MBytes 395 Mbits/sec
[ 9] 1.0- 2.0 sec 81.5 MBytes 684 Mbits/sec
[SUM] 1.0- 2.0 sec 1.08 GBytes 9.27 Gbits/sec
[ 18] 2.0- 3.0 sec 48.0 MBytes 403 Mbits/sec
[ 4] 2.0- 3.0 sec 84.9 MBytes 712 Mbits/sec
[ 3] 2.0- 3.0 sec 47.6 MBytes 400 Mbits/sec
[ 5] 2.0- 3.0 sec 49.0 MBytes 411 Mbits/sec
[ 6] 2.0- 3.0 sec 163 MBytes 1.37 Gbits/sec
[ 7] 2.0- 3.0 sec 65.5 MBytes 549 Mbits/sec
[ 8] 2.0- 3.0 sec 119 MBytes 997 Mbits/sec
[ 10] 2.0- 3.0 sec 90.2 MBytes 757 Mbits/sec
[ 9] 2.0- 3.0 sec 82.6 MBytes 693 Mbits/sec
[ 13] 2.0- 3.0 sec 59.9 MBytes 502 Mbits/sec
[ 12] 2.0- 3.0 sec 57.8 MBytes 484 Mbits/sec
[ 16] 2.0- 3.0 sec 55.5 MBytes 466 Mbits/sec
[ 15] 2.0- 3.0 sec 57.6 MBytes 483 Mbits/sec
[ 11] 2.0- 3.0 sec 66.2 MBytes 556 Mbits/sec
[ 14] 2.0- 3.0 sec 33.9 MBytes 284 Mbits/sec
[ 17] 2.0- 3.0 sec 33.4 MBytes 280 Mbits/sec
[SUM] 2.0- 3.0 sec 1.09 GBytes 9.34 Gbits/sec
[ 18] 3.0- 4.0 sec 42.1 MBytes 353 Mbits/sec
[ 4] 3.0- 4.0 sec 94.5 MBytes 793 Mbits/sec
[ 3] 3.0- 4.0 sec 43.4 MBytes 364 Mbits/sec
[ 5] 3.0- 4.0 sec 47.4 MBytes 397 Mbits/sec
[ 6] 3.0- 4.0 sec 171 MBytes 1.44 Gbits/sec
[ 7] 3.0- 4.0 sec 65.1 MBytes 546 Mbits/sec
[ 8] 3.0- 4.0 sec 92.8 MBytes 778 Mbits/sec
[ 9] 3.0- 4.0 sec 82.9 MBytes 695 Mbits/sec
[ 16] 3.0- 4.0 sec 60.4 MBytes 506 Mbits/sec
[ 15] 3.0- 4.0 sec 57.4 MBytes 481 Mbits/sec
[ 11] 3.0- 4.0 sec 69.4 MBytes 582 Mbits/sec
[ 13] 3.0- 4.0 sec 67.2 MBytes 564 Mbits/sec
[ 10] 3.0- 4.0 sec 91.8 MBytes 770 Mbits/sec
[ 14] 3.0- 4.0 sec 30.9 MBytes 259 Mbits/sec
[ 17] 3.0- 4.0 sec 36.6 MBytes 307 Mbits/sec
[ 12] 3.0- 4.0 sec 57.5 MBytes 482 Mbits/sec
[SUM] 3.0- 4.0 sec 1.08 GBytes 9.31 Gbits/secWe can go ahead and mark this thread solved, my box will run at wire (near) for the 10G test machines.
The fix was as follows/boot/loader.conf.local
compat.linuxkpi.mlx4_enable_sys_tune="1"
net.link.ifqmaxlen="2048"
net.inet.tcp.soreceive_stream="1"
net.inet.tcp.hostcache.cachelimit="0"
compat.linuxkpi.mlx4_inline_thold="0"
compat.linuxkpi.mlx4_high_rate_steer="1"
compat.linuxkpi.mlx4_log_num_mgm_entry_size="7"sysctls
hw.mlxen0.conf.rx_size 2048
hw.mlxen0.conf.tx_size 2048
kern.ipc.maxsockbuf Maximum socket buffer size 16777216
net.link.vlan.mtag_pcp Retain VLAN PCP information as packets are passed up the stack 0
net.route.netisr_maxqlen maximum routing socket dispatch queue length 2048
net.inet.ip.intr_queue_maxlen Maximum size of the IP input queue 2048
net.inet.tcp.recvspace Initial receive socket buffer size 131072
net.inet.tcp.sendspace Initial send socket buffer size 131072Next I will measure actual throughput in pps, because in doing this testing i learned wire speed doesn't seem to mean much. That was pointed out to me a couple times, i was just obsessed with starting from a place that is equal(ish) with linux. I'm sure someone else will find these useful for a mellanox connectx-3 adapter.
Should put my chelsio t5 back, i know this hardware will do what I am asking given the right tuning?
Thanks again!
-
Yes, so PPS means how much you can actually process as a bottom bound. If you can process a billion packets per second on tiny packets, then any packet that is bigger will just get you even more bandwidth.
Also, it seems that at least half of the tuning settings are for the hardware driver (mlx) itself, so I imagine that if you use a Chelsio card you'll need to find the settings for that driver as well.
-
Nice to see that the issue is resolved.
I will check if some of these settings are usefull to increase throughput through vms aswell.
Thanks for sharing. -
I know this is an old thread, sorry to rezz.
I work with the "other popular" software based firewall.
I just got finished running 10Gb testing on 28 core HP DL380G9s running 4 10Gb NICs.
I ran into very similar speed constraints during my testing. Out of the box the security gateway would only push 3 to 4 Gb/s via iperf3 (24 streams). By using sim_affinity, (bind a nic to a specific CPU core) I was able to get the box to run at 6 to 8 Gbp/s.
https://sc1.checkpoint.com/documents/R77/CP_R77_PerformanceTuning_WebAdmin/6731.htm (search for "sim" to jump to the sim affinity section.)This, of course, was not good enough, because the goal was to reach 20Gb/s using bonded NICs.
It turns out that the fix was to enable "multi-queue" instead. This allows each interface to have a multiple queues that is serviced by the number of licensed cores.
https://sc1.checkpoint.com/documents/R77/CP_R77_Firewall_WebAdmin/92711.htmThis allowed us to max out 20Gb/s easily, and I suspect even 40Gb/s would be easily maxed out as well.
So.. My question is... Does PFSense have a similar setting to allow a single interface to be serviced by multiple CPU cores? There could be, of course, issues with enabling this. One issue is packet reordering, but since this is an Internet gateway, I don't see that as a big deal..
Thoughts?
Edit:
Not sure if this is the same, but it seems similar:
https://www.netgate.com/docs/pfsense/hardware/tuning-and-troubleshooting-network-cards.html -
Multiple queues should exist for ix or ixl interfaces by default. You can configure a fixed number using those options if you wish otherwise the system will add as many as the driver supports or your have cores for.
You should see the queues in
top -aSH
at the command line.Steve