Solved - 10GB link 1GB speeds



  • I will start by saying how thankful I am pfsense exists. It has been a core piece of my infra for many years.

    Now to my issue, which I just can’t seems to shake out.

    I have a c2758 based machine that I am trying to get 10gb of throughput on. However it doesn’t seem to be anywhere near that. I started with a chelsio t4 card and thought that was the issue. So I got an intel x520-sr2 and a chelsio t5. Results from the three all net closely the same performance.

    iperf tests from server to server on my network only yields 1.12gb at best when routing.

    When testing iperf from the same machines on the same l2 I get around 7gb which is a issue all its own.

    Overall I would be happy with 5-6gb but I am a far cry from that.

    I have tried all of the googleable tuning guides to no end.

    I also thought it could be pfsense, so I gave openbsd a spin… with similar results.

    Is this server even capable of 10gb. It’s nearly the same as what you buy from netgate(should have just bought it from them).

    I have 5 of these servers
    One running fedora 26 - mellanox connect x
    two centos 7 - mellanox connect x
    one pfsense - Chelsio T520
    one openbsd - Intel X520
    One freenas - Chelsio S320 (not a c2758 - intel xeon)

    All connected via fiber to an arista 10GB switch

    Everything expect the pfsense and openbsd boxes will get nearly 10G (8ish)
    It makes no sense why these two machines want to be weirdos about their line rate, but it gets the full line rate of a 1GB link.

    Things I have tried so far
    Swapping drives from known 10GB working machines to the pfsense machine
    swapping SFP+ modules
    new fiber cables
    swapping fibre cables
    moving to known working 10G ports on the switch

    I am totally at a loss here.. I just can't make sense of it.

    This is a repost from the routing section



  • I have a c2758 based machine that I am trying to get 10gb of throughput on.

    Using iPerf will show you up what is really able deliver, but under real world conditions and
    protocol pending you will mostly only see something between 2 GBit/s and 4 GBit/s right through.

    However it doesn’t seem to be anywhere near that. I started with a chelsio t4 card and
    thought that was the issue. So I got an intel x520-sr2 and a chelsio t5. Results from the
    three all net closely the same performance.

    If would try out the following, insert each card under pfSense and do another iPerf speed test but
    with using more streams to saturate this NIC and the entire board really and then install one time
    CentOS please and do this test again so you will be outsourcing that not the hardware it self and the
    drivers will be the problem here, because in both cases we are not able to help you out!

    iperf tests from server to server on my network only yields 1.12gb at best when routing.

    Mostly this is not really saturating the hardware or in short please use more streams to find out when this
    pice of hardware or the card will be saturated. I would prefer to start with 8 streams and high up then.

    When testing iperf from the same machines on the same l2 I get around 7gb which is a issue all its own.

    The bets will be showing up from server to client, this might be sounding strong but what should it be
    if you get your wished result out from a test and under real world conditions you will never reach it!?

    Overall I would be happy with 5-6gb but I am a far cry from that.

    To get out 1 GBit/s on the pfSense website you should be using server grade hardware, modern
    PCIe NICs and >2,0GHz if so, you will be at the range to get 1 GBit/s and not 10 GBit/s with
    pfSense which is based on FreeBSD.

    I have tried all of the googleable tuning guides to no end.

    This might be an option for you if we could be sure this is not hardware related failure.

    I also thought it could be pfsense, so I gave openbsd a spin… with similar results.

    I would better take CentOS 7.x the latest stable one, and test it again.

    Is this server even capable of 10gb. It’s nearly the same as what you buy from netgate
    (should have just bought it from them).

    Perhaps not now or at this days, but with a OS that knows to use all cpu cores, QAT, netmap-fwd or
    another function or option given by the hardware or the software it would be surely able to do so. And
    if not, you could try out stronger hardware such Intel Core i7, Xeon E3 or Xeon E5 CPUs.

    I would be really want to know how the both C2758 will be sending and receiving using CentOS or
    another Linux variant. If they are able to realize more then it is perhaps pending on the driver, OS
    or some tunings should be done. Search the forum will be also giving out some interesting thread
    how other users were able to get around 9.6 GBIt/s during an iPerf test.



  • Thanks for the reply

    I have 5 of these C2758 servers, two already have centos on them.  One OpenBSD which gives me the same numbers as PFSense.

    With FreeBSD 11 and the settings for everything I am seeing nearly 3GB per second using iperf. I do understand that there are other tools out there, but its what I have all my metrics in.

    When I use the linux box as a router, I see roughly the same numbers as vanilla BSD.

    I would be happy with even half the throughput I get client to server.

    Thanks again.

    If I find a solution to this issue, I will report back.

    Next I am going to try a fresh install of pfsense, the one I have now has been upgraded for the last several years.

    PFSense has been just fantastic, and until I started doing performance tuning, i really didn't even notice…

    Seems as how I have a 10GB network, I would really like to get the most out of every device on it.



  • So using the latest development version of pfsense I am getting 1.4GB with pf on and the default rule set, and 3.0GB with pf turned off.

    A bios update was worth about 400M



  • Looks like this equipment just isn't capable of doing 10GB or even 5GB routing traffic. Looking at the system interrupts it's reaching nearly 90% when I run this test.

    /0  /1  /2  /3  /4  /5  /6  /7  /8  /9  /10
        Load Average  |||||

    /0%  /10  /20  /30  /40  /50  /60  /70  /80  /90  /100
    root          idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    root          idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    root          idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    root          idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    root          idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    root          idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    root          idle XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    root          intr XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    root          idle XXXXXX 
    root          intr X

    Guess I will just put pfsense on some better gear and hope for better results



  • Before you throw it all out: try polling. This isn't always the solution, but if you are starving due to interrupts, polling might solve some of it.



  • With FreeBSD 11 and the settings for everything I am seeing nearly 3GB per second using iperf. I do understand that there are other tools out there, but its what I have all my metrics in.

    iPerf is ok, but you should use parallel streams (-P) option in iPerf to produce more parallel streams running across
    the wire. Would this be do the trick? Or perhaps it is a little bit underpowered to get real 10 Gbit/s out of there.



  • Try some of the changes in here. If they don't help, change them back to default.

    https://forum.pfsense.org/index.php?topic=113496.msg631076#msg631076

    Those are some crazy high interrupts for only 1.4Gb/s. I'm getting 25%-30% cpu usage when doing 2Gb/s(1Gb bi-directional) through pfSense, 4Gb/s total, with 64byte packets with HFSC traffic shaping enabled on LAN and WAN.

    I also find it interesting that nearly all the load is on one core. My load is evenly distributed, unless you're using a single stream to test.



  • I am only using single streams for testing. It could be that my testing is flawed, however when I run the exact same test on another machine that is the nearly the exact same config I get the results I am seeking.

    For instance on a linux installed machine I get the throughput I am looking for on a single old mellanox link. It also doesn't seem to tax the machine as much.

    I have a ton of gear, and so I am setting this up on one of my blades… which is proving to be a challenge of its own.

    Thanks all for the help, but this hassle just isn't worth the 4 days I have put into it.



  • Thanks all for the help, but this hassle just isn't worth the 4 days I have put into it.

    • pfSense tuning for 10 Gbit Throughput
      Frequency of my cpu is 2.6 Ghz, scaling to 3.8 Ghz (Xeon E3-1275 Turboboos) is a linear factor of 1,46 -> 5,0 Gbit/s -> 7,3 Gbit/s

    • 10Gbe Tuning?
      I set the MTU on these to 9000 yesterday and 9000 on the iperf servers I'm using and was able to saturate (9.5Gb/s) the link.  So I'm pretty sure I'm hitting just one interface.

    • 10gbe firewall using open source tools
      We're using Xeon E3 boxes (1260L) with Intel 10 GbE nic's (520 series) and PFSense 2.0.1 and it's working really well. We peak around 9000 Mbps at 55% CPU utilization.

    I don´t know them personally and was not in place in the moment where the tests where done
    but I am pretty sure with the today given options like HT, Speed Step and TurboBoost and perhaps
    with no PPPoE at the WAN or 10 GBit/s at the LAN will be able to realize, to get nearly 10 GBit/s out.
    But perhaps it is also pending on the used hardware. If your FreeNAS is able to deliver such numbers
    what should be then the angle point on pfSense? The packet filter, the rules, something else. I really
    don´t know it, but from time to time we will see more and more threads here in the forum about that,
    perhaps there will be at one day someone able to deliver some results and tips that is matching then
    for all others too.



  • So I moved my pfsense machine to one of my blades. It's not new or anything fancy, but should have yielded better performance.
    And I was correct, the performance was 3x of the C2758

    [  3]  0.0- 1.0 sec  370 MBytes  3.10 Gbits/sec
    [  3]  1.0- 2.0 sec  363 MBytes  3.05 Gbits/sec
    [  3]  2.0- 3.0 sec  365 MBytes  3.06 Gbits/sec
    [  3]  3.0- 4.0 sec  366 MBytes  3.07 Gbits/sec
    [  3]  4.0- 5.0 sec  368 MBytes  3.08 Gbits/sec
    [  3]  5.0- 6.0 sec  372 MBytes  3.12 Gbits/sec
    [  3]  6.0- 7.0 sec  373 MBytes  3.13 Gbits/sec
    [  3]  7.0- 8.0 sec  373 MBytes  3.13 Gbits/sec
    [  3]  8.0- 9.0 sec  375 MBytes  3.15 Gbits/sec
    [  3]  9.0-10.0 sec  373 MBytes  3.13 Gbits/sec

    I have hyperthreading disabled and the bios performance level set to maximum.
    However with this type of equipment, I expected things to move about twice as fast. I have a fairly simple ruleset.
    With PF turned off, I do get performance that is closer in line with the other systems I have on my network.

    For instance I have openstack routers (which are just linux/SNAT/iptables) that run around 6.5 to 7 G on the other blades.

    Now with multiple threads (4) I can get a little closer to my mark at around 6G per second.

    I know there are people out there getting near wireline speeds from their gear, I just don't know how they are doing it.



  • Try disabling all offloading options, and using polling and having bigger buffers.



  • It would seem that any sort of tuning actually makes it run slower. I have no marked improvements from the pfsense defaults.

    It would seem I need to get some better hardware that is more suited for the task. Good news is a 40G card is coming along with a 40G switch (6 ports).

    I am curious to see what kind of hurt I can put on this box with 40G gear.

    You can mark this thread closed, I am moving on to more important things. I will open a new one when the 40G gear get here and I have a chance to tinker.



  • So i have an update. The 40G nic from mellanox performs wonderfully on vanilla FreeBSD and Linux, however I see the same performance with pfSense that I was getting with the 10GB nics. I would like to know what the differences are from the raw BSD kernel.

    I really love pfSense, it makes my life so easy to do otherwise complicated stuff. But these performance issues should be addressed.



  • @johnkeates:

    Before you throw it all out: try polling. This isn't always the solution, but if you are starving due to interrupts, polling might solve some of it.

    I am not familiar? any good places to start?

    edit
    ifconfig mlxen0 polling

    Client connecting to ..., TCP port 5001
    TCP window size: 85.0 KByte (default)
    –----------------------------------------------------------
    [ ID] Interval      Transfer    Bandwidth
    [  3]  0.0- 1.0 sec  110 MBytes  922 Mbits/sec
    [  4]  0.0- 1.0 sec  64.6 MBytes  542 Mbits/sec
    [  5]  0.0- 1.0 sec  53.2 MBytes  447 Mbits/sec
    [SUM]  0.0- 1.0 sec  228 MBytes  1.91 Gbits/sec
    [  3]  1.0- 2.0 sec  110 MBytes  925 Mbits/sec
    [  5]  1.0- 2.0 sec  57.4 MBytes  481 Mbits/sec
    [  4]  1.0- 2.0 sec  56.5 MBytes  474 Mbits/sec
    [SUM]  1.0- 2.0 sec  224 MBytes  1.88 Gbits/sec
    [  3]  2.0- 3.0 sec  112 MBytes  936 Mbits/sec
    [  4]  2.0- 3.0 sec  54.5 MBytes  457 Mbits/sec
    [  5]  2.0- 3.0 sec  59.9 MBytes  502 Mbits/sec
    [SUM]  2.0- 3.0 sec  226 MBytes  1.90 Gbits/sec
    [  4]  3.0- 4.0 sec  52.8 MBytes  442 Mbits/sec
    [  3]  3.0- 4.0 sec  113 MBytes  948 Mbits/sec
    [  5]  3.0- 4.0 sec  62.1 MBytes  521 Mbits/sec
    [SUM]  3.0- 4.0 sec  228 MBytes  1.91 Gbits/sec

    ifconfig mlxen0 -polling
    –----------------------------------------------------------
    Client connecting to ..., TCP port 5001
    TCP window size: 85.0 KByte (default)

    [ ID] Interval      Transfer    Bandwidth
    [  3]  0.0- 1.0 sec  108 MBytes  905 Mbits/sec
    [  5]  0.0- 1.0 sec  109 MBytes  915 Mbits/sec
    [  4]  0.0- 1.0 sec  107 MBytes  898 Mbits/sec
    [SUM]  0.0- 1.0 sec  324 MBytes  2.72 Gbits/sec
    [  5]  1.0- 2.0 sec  108 MBytes  904 Mbits/sec
    [  4]  1.0- 2.0 sec  107 MBytes  898 Mbits/sec
    [  3]  1.0- 2.0 sec  107 MBytes  901 Mbits/sec
    [SUM]  1.0- 2.0 sec  322 MBytes  2.70 Gbits/sec
    [  5]  2.0- 3.0 sec  108 MBytes  910 Mbits/sec
    [  4]  2.0- 3.0 sec  107 MBytes  900 Mbits/sec
    [  3]  2.0- 3.0 sec  108 MBytes  906 Mbits/sec
    [SUM]  2.0- 3.0 sec  324 MBytes  2.72 Gbits/sec



  • So i have an update. The 40G nic from mellanox performs wonderfully on vanilla FreeBSD and Linux, however I see the same performance with pfSense that I was getting with the 10GB nics. I would like to know what the differences are from the raw BSD kernel.

    pfSense is using the pf (packet filter) and NAT as a point later in the pf process, and this will be not done
    in the FreeBSD and Linux OS!!!! So if you want to compare then against this will be the most matching
    answer and on top of this it might be also pending on the used hardware, if you are using a Xeon E3 or high scaling
    Xeon E3 CPU (3,7GHz 7C/8T) you will perhaps get more throughput out of this then using a  C2758 based machine.

    I really love pfSense, it makes my life so easy to do otherwise complicated stuff. But these performance issues should be addressed.

    Take hardware with more horse power, or stronger sorted CPUs (and RAM) so there is nothing that have addressed to.



  • To debug this a bit more try setting up pfSense as a test with no NAT enabled. At the same time, disable pf in the advanced settings. With that done, try a iperf test again. If we're gonna figure out why this is happening, we're gonna need to start excluding stuff.

    On the other hand, if you need this to work, you might be better off buying support at Netgate since they build pfSense.



  • I agree with your point, and these are not complaints. If I wanted this to just work, I would stick with Fedora. However, I’m just trying to get to the bottom of what appears to be a pfsense specific issue. With pfctl -d I still only get around 5g and high cpu/ interrupts. Are there settings that I am missing. This is a clean install with default settings.

    On FreeBSD and Linux there is almost no cpu utilization, as it’s mostly offloaded to the nic. However I’m not seeing this reflected in the pfsense build.

    Thanks all for you input and time.
    ~/D



  • @BlueKobold:

    So i have an update. The 40G nic from mellanox performs wonderfully on vanilla FreeBSD and Linux, however I see the same performance with pfSense that I was getting with the 10GB nics. I would like to know what the differences are from the raw BSD kernel.

    pfSense is using the pf (packet filter) and NAT as a point later in the pf process, and this will be not done
    in the FreeBSD and Linux OS!!!! So if you want to compare then against this will be the most matching
    answer and on top of this it might be also pending on the used hardware, if you are using a Xeon E3 or high scaling
    Xeon E3 CPU (3,7GHz 7C/8T) you will perhaps get more throughput out of this then using a  C2758 based machine.

    I’m only routing packets, no NAT. Also with pf fully disabled I still get very high utilization numbers.

    I really love pfSense, it makes my life so easy to do otherwise complicated stuff. But these performance issues should be addressed.

    Take hardware with more horse power, or stronger sorted CPUs (and RAM) so there is nothing that have addressed to.

    There isn’t really a need for better equipment, it works fine with other options.



  • Have you tried to run VyOS on your hardware? With basic NAT and firewalling enabled it will allow you to assess what your hardware is really capable of as a basic gateway/firewall.



  • Hmm, next would probably be comparing sysctl output (I guess just getting both sysctl outputs and running a diff on them will do), and perhaps kernel/driver build configs (again, a diff should suffice).



  • There are some cheap ways to increase the throughput.

    1. Increase MTU
    If you are lucky you can use jumbo-frames throughout your environment (this will lead to a factor of 6 in throughput, assuming MTU of 9000 (maximum which is usable in vmware) instead of 1500). However if you speak to the outside-world you are likely to create a bottleneck due to the need to fragment.

    2. Packet Rates
    For high packet rates with small packets this will not help. There is a limit within the packet processing within FreeBSD which might be lower than in other network-stacks: Compare for example:
    http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-networking/
    A valid source seems the Freebsd-Router-Project:
    https://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr

    They also give figures for pf.

    3. Real World examples
    Remember always to measure through the device:

    [ Pc1 ]  –- > [pfsense-system] –- > [Pc2]

    I can give some real world examples: ESXi-Guests with 8 CPUs (2.6 GHz) allow pushing of 5 Gbit/s with MTU 1500. Therefore i assume that real hardware should be able to achive higher throughputs.

    The main problem seems to be the high interrupt-rate.

    I did some measurements on a X710 40 Gbit/s Card (8 CPUs, > 2 GHz) and i was able to reach throughputs around 12.3 Gbit/s.
    As far as i heared with commodity hardware the limit seems to be 26 Gbit/s,
    https://www.ntop.org/products/packet-capture/pf_ring/pf_ring-zc-zero-copy/



  • @fwcheck:

    There are some cheap ways to increase the throughput.

    1. Increase MTU
    If you are lucky you can use jumbo-frames throughout your environment (this will lead to a factor of 6 in throughput, assuming MTU of 9000 (maximum which is usable in vmware) instead of 1500). However if you speak to the outside-world you are likely to create a bottleneck due to the need to fragment.

    2. Packet Rates
    For high packet rates with small packets this will not help. There is a limit within the packet processing within FreeBSD which might be lower than in other network-stacks: Compare for example:
    http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-networking/
    A valid source seems the Freebsd-Router-Project:
    https://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr

    They also give figures for pf.

    3. Real World examples
    Remember always to measure through the device:

    [ Pc1 ]  –- > [pfsense-system] –- > [Pc2]

    I can give some real world examples: ESXi-Guests with 8 CPUs (2.6 GHz) allow pushing of 5 Gbit/s with MTU 1500. Therefore i assume that real hardware should be able to achive higher throughputs.

    The main problem seems to be the high interrupt-rate.

    I did some measurements on a X710 40 Gbit/s Card (8 CPUs, > 2 GHz) and i was able to reach throughputs around 12.3 Gbit/s.
    As far as i heared with commodity hardware the limit seems to be 26 Gbit/s,
    https://www.ntop.org/products/packet-capture/pf_ring/pf_ring-zc-zero-copy/

    The 'problem' isn't in FreeBSD. He tried a plain FreeBSD install and it works fine there. It is in some difference between the settings in pfSense and FreeBSD, probably pf config, interface config, kernel config or sysctl changes.



  • I am not sure i understand the problem right.

    Your setup looks like this:

    [System 1 (network 1)]  –- >  [Device under test]  –-> [System 2(network 2)]

    Right ?

    You use a freebsd system as router/firewall and achive a higher throughput than using the pfsense ?
    If this is the case you should check all network settings / drivers / sysctrl etc., maybe there is a setting which is
    not identical.
    Therefore using this settings should lead to a higher throughput.

    If you are just measuring speed via iperf3 to the pfsense system, a huge difference is given if hw-acceleration is in place, which is not recommend for a system doing routing. Check the flags (LRO, TSO, etc. to name a few options which can give huge differences) and usually also needs a reboot to be in place.



  • The 'problem' isn't in FreeBSD. He tried a plain FreeBSD install and it works fine there. It is in some difference between the settings in pfSense and FreeBSD, probably pf config, interface config, kernel config or sysctl changes.

    I am pretty sure, that pfSense is not only something on top of FreeBSD since the version 2.2.x it is more and more
    special or custom build based on the original kernel but with many many changes.

    If the netgate team or the pfSense team was able to push ~40 GBit/s over a IPSec tunnel using an Intel QAT card, and
    that card came without any ports on them, so it must be able to handle that speed over the pfSense too in my opinion.
    For sure also ports that are supporting and/or allowing that entire speed or throughput rate.



  • I will pull the defaults from FreeBSD. I’m confident pfSense is fully capable of what I’m looking for. I’m just missing something.

    It is looking like an offload issue, as in seemingly nothing is offload to the nic. I have tried 3 different cards {intel x520, chelsio t5, Mellanox x3 40G}, all with nearly identical results. The limit of this gear with no offloads would seem to be around 4G.

    On a recent Linux kernel (Fedora 26) there is almost no cpu load as it’s all being done in the card.

    Thanks for the continued help and interest in this post. Yet another reason to push forward with pfSense. This is a great community.



  • @fwcheck:

    There are some cheap ways to increase the throughput.

    1. Increase MTU
    If you are lucky you can use jumbo-frames throughout your environment (this will lead to a factor of 6 in throughput, assuming MTU of 9000 (maximum which is usable in vmware) instead of 1500). However if you speak to the outside-world you are likely to create a bottleneck due to the need to fragment.

    2. Packet Rates
    For high packet rates with small packets this will not help. There is a limit within the packet processing within FreeBSD which might be lower than in other network-stacks: Compare for example:
    http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-networking/
    A valid source seems the Freebsd-Router-Project:
    https://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr

    They also give figures for pf.

    3. Real World examples
    Remember always to measure through the device:

    [ Pc1 ]  –- > [pfsense-system] –- > [Pc2]

    I can give some real world examples: ESXi-Guests with 8 CPUs (2.6 GHz) allow pushing of 5 Gbit/s with MTU 1500. Therefore i assume that real hardware should be able to achive higher throughputs.

    The main problem seems to be the high interrupt-rate.

    I did some measurements on a X710 40 Gbit/s Card (8 CPUs, > 2 GHz) and i was able to reach throughputs around 12.3 Gbit/s.
    As far as i heared with commodity hardware the limit seems to be 26 Gbit/s,
    https://www.ntop.org/products/packet-capture/pf_ring/pf_ring-zc-zero-copy/

    From [device] <–--> [device]
    I get wire line speed

    From [device]–-->[pfsense]–-> [device]

    This is where the issue resides

    I would be happy with something close to half wire line on 10G because this device is doing more than just routing traffic. However I am really quite a distance from that without 100% interrupts



  • Here are stats from the same link on the same router using centos 7.4. These are with the factory defaults and no iptables enabled.

    –----------------------------------------------------------
    Client connecting to ..., TCP port 5001
    TCP window size: 85.0 KByte (default)

    [ ID] Interval      Transfer    Bandwidth
    [  5]  0.0- 1.0 sec  256 MBytes  2.15 Gbits/sec
    [  4]  0.0- 1.0 sec  270 MBytes  2.26 Gbits/sec
    [  3]  0.0- 1.0 sec  258 MBytes  2.17 Gbits/sec
    [  6]  0.0- 1.0 sec  327 MBytes  2.75 Gbits/sec
    [SUM]  0.0- 1.0 sec  1.09 GBytes  9.32 Gbits/sec
    [  5]  1.0- 2.0 sec  242 MBytes  2.03 Gbits/sec
    [  4]  1.0- 2.0 sec  251 MBytes  2.11 Gbits/sec
    [  3]  1.0- 2.0 sec  281 MBytes  2.36 Gbits/sec
    [  6]  1.0- 2.0 sec  337 MBytes  2.83 Gbits/sec
    [SUM]  1.0- 2.0 sec  1.09 GBytes  9.33 Gbits/sec
    ^C[  5]  0.0- 2.6 sec  679 MBytes  2.15 Gbits/sec
    [  4]  0.0- 2.6 sec  715 MBytes  2.27 Gbits/sec
    [  3]  0.0- 2.6 sec  718 MBytes  2.28 Gbits/sec
    [  6]  0.0- 2.6 sec  818 MBytes  2.60 Gbits/sec
    [SUM]  0.0- 2.6 sec  2.86 GBytes  9.29 Gbits/sec

    The CPU utilization is almost zero.




  • And these are the default options that are turned on for the nic in linux.

    rx-checksumming: on
    tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ipv6: on
    scatter-gather: on
    tx-scatter-gather: on
    tx-tcp-segmentation: on
    tx-tcp6-segmentation: on
    receive-hashing: on
    highdma: on [fixed]
    rx-vlan-filter: on [fixed]
    rx-vlan-stag-hw-parse: on
    rx-vlan-stag-filter: on [fixed]
    busy-poll: on [fixed]

    I have no idea how to translate these to bsd options. But I am thinking my issue lies here - what is offloaded for the nic to handle.



  • I think in BSD those settings are still set with ifconfig using the + and - options. If the cards need firmware to run (and most do), perhaps we should also take that into account.

    Currently, we know that by default, the hardware should be capable of pushing 2Gbit+ with no high loads. So it's not a hardware issue and we know it's not a BSD issue either since it works with FreeBSD.

    This leaves us with:

    • compile-time options in the kernel/drivers
    • firmware versions if the drivers differ in version and have different firmware blobs
    • syssctl

    Try getting sysctl -A from freebsd and from pfsense and compare those. Also check pci messages.



  • Well the good news is I have have managed to get around 4G with pf  enabled, and nearly wireline with pf disabled. That is solid progress.

    There were a couple of options i had to enable in loader.conf.local

    compat.linuxkpi.mlx4_enable_sys_tune="1"
    net.link.ifqmaxlen="2048"
    net.inet.tcp.soreceive_stream="1"
    net.inet.tcp.hostcache.cachelimit="0"
    compat.linuxkpi.mlx4_inline_thold="0"
    compat.linuxkpi.mlx4_log_num_mgm_entry_size="7"
    compat.linuxkpi.mlx4_high_rate_steer="1"

    These options seem to be helping in making solid progress. I am 1G away from my goal of 5G per second with pf enabled.

    I think those are really quite reasonable numbers for this machine, expecting anything else is asking for a bit much.

    I checked the sysctl's from the freebsd box they are nearly identical.

    Thanks all for your time and help. It is genuinely appreciated.

    I will keep tinkering and post updates.



  • Are those sysctl's the same on the FreeBSD install?



  • No, they were not required on the FreeBSD install or the linux install. The defaults just seem to work.  I also didn't have a real ruleset in pf with FreeBSD like i do on this box, so that will surely effect performance numbers.

    [ 15] 26.0-27.0 sec  31.1 MBytes  261 Mbits/sec
    [  3] 26.0-27.0 sec  49.9 MBytes  418 Mbits/sec
    [  8] 26.0-27.0 sec  53.9 MBytes  452 Mbits/sec
    [ 11] 26.0-27.0 sec  35.4 MBytes  297 Mbits/sec
    [ 16] 26.0-27.0 sec  43.1 MBytes  362 Mbits/sec
    [ 17] 26.0-27.0 sec  48.1 MBytes  404 Mbits/sec
    [ 14] 26.0-27.0 sec  54.8 MBytes  459 Mbits/sec
    [  4] 26.0-27.0 sec  45.5 MBytes  382 Mbits/sec
    [ 10] 26.0-27.0 sec  62.0 MBytes  520 Mbits/sec
    [  6] 26.0-27.0 sec  24.2 MBytes  203 Mbits/sec
    [  7] 26.0-27.0 sec  14.2 MBytes  120 Mbits/sec
    [  9] 26.0-27.0 sec  38.0 MBytes  319 Mbits/sec
    [ 18] 26.0-27.0 sec  33.2 MBytes  279 Mbits/sec
    [ 13] 26.0-27.0 sec  16.8 MBytes  141 Mbits/sec
    [ 12] 26.0-27.0 sec  30.6 MBytes  257 Mbits/sec
    [  5] 26.0-27.0 sec  23.8 MBytes  199 Mbits/sec
    [SUM] 26.0-27.0 sec  605 MBytes  5.07 Gbits/sec
    [  3] 27.0-28.0 sec  51.4 MBytes  431 Mbits/sec
    [ 16] 27.0-28.0 sec  43.1 MBytes  362 Mbits/sec
    [ 15] 27.0-28.0 sec  31.0 MBytes  260 Mbits/sec
    [  4] 27.0-28.0 sec  47.9 MBytes  402 Mbits/sec
    [ 10] 27.0-28.0 sec  57.6 MBytes  483 Mbits/sec
    [  8] 27.0-28.0 sec  49.2 MBytes  413 Mbits/sec
    [ 13] 27.0-28.0 sec  16.1 MBytes  135 Mbits/sec
    [ 17] 27.0-28.0 sec  46.6 MBytes  391 Mbits/sec
    [ 14] 27.0-28.0 sec  55.6 MBytes  467 Mbits/sec
    [  6] 27.0-28.0 sec  23.0 MBytes  193 Mbits/sec
    [ 12] 27.0-28.0 sec  29.2 MBytes  245 Mbits/sec
    [ 18] 27.0-28.0 sec  34.8 MBytes  292 Mbits/sec
    [  5] 27.0-28.0 sec  23.1 MBytes  194 Mbits/sec
    [  7] 27.0-28.0 sec  11.9 MBytes  99.6 Mbits/sec
    [  9] 27.0-28.0 sec  41.0 MBytes  344 Mbits/sec
    [ 11] 27.0-28.0 sec  42.0 MBytes  352 Mbits/sec
    [SUM] 27.0-28.0 sec  604 MBytes  5.06 Gbits/sec

    So with iperf running 16 threads I can reach my 5G target with pf enabled. Which is the limit of my system with its current configuration.

    PID USERNAME  PRI NICE  SIZE    RES STATE  C  TIME    WCPU COMMAND
        0 root      -92    -    0K  5328K -      0  3:23  94.68% [kernel{mlxen0 rx cq}]
        0 root      -92    -    0K  5328K -      5  2:14  94.68% [kernel{mlxen0 rx cq}]
        0 root      -92    -    0K  5328K -      6  3:48  94.58% [kernel{mlxen0 rx cq}]
        0 root      -92    -    0K  5328K -      3  4:10  94.38% [kernel{mlxen0 rx cq}]
        0 root      -92    -    0K  5328K -      2  3:36  93.99% [kernel{mlxen0 rx cq}]
        0 root      -92    -    0K  5328K -      1  3:44  90.58% [kernel{mlxen0 rx cq}]
        0 root      -92    -    0K  5328K -      7  2:14  67.58% [kernel{mlxen0 rx cq}]

    I don't know what rx cq means, so I don't know what to tinker with.



  • That's the receive queue AFAIK. It seems the defaults on FreeBSD vs. pfSense must be different then. If the ifconfig status output different as well?

    For example, I have an interface that's set with:

    en4: flags=8863 <up,broadcast,smart,running,simplex,multicast>mtu 1500
    options=10b <rxcsum,txcsum,vlan_hwtagging,av>If you compare your ifconfig settings on FreeBSD vs. pfSense there might be a change there as well. Also, the driver settings could differ, but I'm not sure where they are stored for the Mellanox card.</rxcsum,txcsum,vlan_hwtagging,av></up,broadcast,smart,running,simplex,multicast>



  • Ok now I can confirm wireline speeds with this nic.

    Its my pf ruleset that is holding it back at this point.

    [ ID] Interval      Transfer    Bandwidth
    [  4]  0.0- 1.0 sec  74.6 MBytes  626 Mbits/sec
    [  6]  0.0- 1.0 sec  152 MBytes  1.28 Gbits/sec
    [  8]  0.0- 1.0 sec  163 MBytes  1.37 Gbits/sec
    [  9]  0.0- 1.0 sec  76.2 MBytes  640 Mbits/sec
    [ 13]  0.0- 1.0 sec  42.6 MBytes  358 Mbits/sec
    [ 10]  0.0- 1.0 sec  58.4 MBytes  490 Mbits/sec
    [ 12]  0.0- 1.0 sec  66.6 MBytes  559 Mbits/sec
    [ 16]  0.0- 1.0 sec  63.2 MBytes  531 Mbits/sec
    [ 14]  0.0- 1.0 sec  32.9 MBytes  276 Mbits/sec
    [ 17]  0.0- 1.0 sec  37.4 MBytes  314 Mbits/sec
    [ 18]  0.0- 1.0 sec  79.0 MBytes  663 Mbits/sec
    [  3]  0.0- 1.0 sec  57.5 MBytes  482 Mbits/sec
    [  5]  0.0- 1.0 sec  52.4 MBytes  439 Mbits/sec
    [  7]  0.0- 1.0 sec  29.1 MBytes  244 Mbits/sec
    [ 15]  0.0- 1.0 sec  75.5 MBytes  633 Mbits/sec
    [ 11]  0.0- 1.0 sec  71.1 MBytes  597 Mbits/sec
    [SUM]  0.0- 1.0 sec  1.11 GBytes  9.50 Gbits/sec
    [ 18]  1.0- 2.0 sec  49.0 MBytes  411 Mbits/sec
    [  6]  1.0- 2.0 sec  152 MBytes  1.28 Gbits/sec
    [  8]  1.0- 2.0 sec  127 MBytes  1.07 Gbits/sec
    [ 10]  1.0- 2.0 sec  70.2 MBytes  589 Mbits/sec
    [ 12]  1.0- 2.0 sec  70.4 MBytes  590 Mbits/sec
    [ 15]  1.0- 2.0 sec  70.6 MBytes  592 Mbits/sec
    [ 14]  1.0- 2.0 sec  25.9 MBytes  217 Mbits/sec
    [ 11]  1.0- 2.0 sec  68.0 MBytes  570 Mbits/sec
    [  7]  1.0- 2.0 sec  61.0 MBytes  512 Mbits/sec
    [ 13]  1.0- 2.0 sec  55.9 MBytes  469 Mbits/sec
    [ 16]  1.0- 2.0 sec  73.0 MBytes  612 Mbits/sec
    [ 17]  1.0- 2.0 sec  30.8 MBytes  258 Mbits/sec
    [  4]  1.0- 2.0 sec  81.5 MBytes  684 Mbits/sec
    [  3]  1.0- 2.0 sec  41.0 MBytes  344 Mbits/sec
    [  5]  1.0- 2.0 sec  47.1 MBytes  395 Mbits/sec
    [  9]  1.0- 2.0 sec  81.5 MBytes  684 Mbits/sec
    [SUM]  1.0- 2.0 sec  1.08 GBytes  9.27 Gbits/sec
    [ 18]  2.0- 3.0 sec  48.0 MBytes  403 Mbits/sec
    [  4]  2.0- 3.0 sec  84.9 MBytes  712 Mbits/sec
    [  3]  2.0- 3.0 sec  47.6 MBytes  400 Mbits/sec
    [  5]  2.0- 3.0 sec  49.0 MBytes  411 Mbits/sec
    [  6]  2.0- 3.0 sec  163 MBytes  1.37 Gbits/sec
    [  7]  2.0- 3.0 sec  65.5 MBytes  549 Mbits/sec
    [  8]  2.0- 3.0 sec  119 MBytes  997 Mbits/sec
    [ 10]  2.0- 3.0 sec  90.2 MBytes  757 Mbits/sec
    [  9]  2.0- 3.0 sec  82.6 MBytes  693 Mbits/sec
    [ 13]  2.0- 3.0 sec  59.9 MBytes  502 Mbits/sec
    [ 12]  2.0- 3.0 sec  57.8 MBytes  484 Mbits/sec
    [ 16]  2.0- 3.0 sec  55.5 MBytes  466 Mbits/sec
    [ 15]  2.0- 3.0 sec  57.6 MBytes  483 Mbits/sec
    [ 11]  2.0- 3.0 sec  66.2 MBytes  556 Mbits/sec
    [ 14]  2.0- 3.0 sec  33.9 MBytes  284 Mbits/sec
    [ 17]  2.0- 3.0 sec  33.4 MBytes  280 Mbits/sec
    [SUM]  2.0- 3.0 sec  1.09 GBytes  9.34 Gbits/sec
    [ 18]  3.0- 4.0 sec  42.1 MBytes  353 Mbits/sec
    [  4]  3.0- 4.0 sec  94.5 MBytes  793 Mbits/sec
    [  3]  3.0- 4.0 sec  43.4 MBytes  364 Mbits/sec
    [  5]  3.0- 4.0 sec  47.4 MBytes  397 Mbits/sec
    [  6]  3.0- 4.0 sec  171 MBytes  1.44 Gbits/sec
    [  7]  3.0- 4.0 sec  65.1 MBytes  546 Mbits/sec
    [  8]  3.0- 4.0 sec  92.8 MBytes  778 Mbits/sec
    [  9]  3.0- 4.0 sec  82.9 MBytes  695 Mbits/sec
    [ 16]  3.0- 4.0 sec  60.4 MBytes  506 Mbits/sec
    [ 15]  3.0- 4.0 sec  57.4 MBytes  481 Mbits/sec
    [ 11]  3.0- 4.0 sec  69.4 MBytes  582 Mbits/sec
    [ 13]  3.0- 4.0 sec  67.2 MBytes  564 Mbits/sec
    [ 10]  3.0- 4.0 sec  91.8 MBytes  770 Mbits/sec
    [ 14]  3.0- 4.0 sec  30.9 MBytes  259 Mbits/sec
    [ 17]  3.0- 4.0 sec  36.6 MBytes  307 Mbits/sec
    [ 12]  3.0- 4.0 sec  57.5 MBytes  482 Mbits/sec
    [SUM]  3.0- 4.0 sec  1.08 GBytes  9.31 Gbits/sec

    We can go ahead and mark this thread solved, my box will run at wire (near) for the 10G test machines.
    The fix was as follows

    /boot/loader.conf.local
    compat.linuxkpi.mlx4_enable_sys_tune="1"
    net.link.ifqmaxlen="2048"
    net.inet.tcp.soreceive_stream="1"
    net.inet.tcp.hostcache.cachelimit="0"
    compat.linuxkpi.mlx4_inline_thold="0"
    compat.linuxkpi.mlx4_high_rate_steer="1"
    compat.linuxkpi.mlx4_log_num_mgm_entry_size="7"

    sysctls

    hw.mlxen0.conf.rx_size                                                                                         2048
    hw.mlxen0.conf.tx_size                                                                                         2048
    kern.ipc.maxsockbuf Maximum socket buffer size                                                         16777216
    net.link.vlan.mtag_pcp Retain VLAN PCP information as packets are passed up the stack 0
    net.route.netisr_maxqlen maximum routing socket dispatch queue length                         2048
    net.inet.ip.intr_queue_maxlen Maximum size of the IP input queue                                 2048
    net.inet.tcp.recvspace Initial receive socket buffer size                                                 131072
    net.inet.tcp.sendspace Initial send socket buffer size                                                 131072

    Next I will measure actual throughput in pps, because in doing this testing i learned wire speed doesn't seem to mean much. That was pointed out to me a couple times, i was just obsessed with starting from a place that is equal(ish) with linux. I'm sure someone else will find these useful for a mellanox connectx-3 adapter.

    Should put my chelsio t5 back, i know this hardware will do what I am asking given the right tuning?

    Thanks again!



  • Yes, so PPS means how much you can actually process as a bottom bound. If you can process a billion packets per second on tiny packets, then any packet that is bigger will just get you even more bandwidth.

    Also, it seems that at least half of the tuning settings are for the hardware driver (mlx) itself, so I imagine that if you use a Chelsio card you'll need to find the settings for that driver as well.



  • Nice to see that the issue is resolved.
    I will check if some of these settings are usefull to increase throughput through vms aswell.
    Thanks for sharing.



  • I know this is an old thread, sorry to rezz.

    I work with the "other popular" software based firewall.
    I just got finished running 10Gb testing on 28 core HP DL380G9s running 4 10Gb NICs.
    I ran into very similar speed constraints during my testing. Out of the box the security gateway would only push 3 to 4 Gb/s via iperf3 (24 streams). By using sim_affinity, (bind a nic to a specific CPU core) I was able to get the box to run at 6 to 8 Gbp/s.
    https://sc1.checkpoint.com/documents/R77/CP_R77_PerformanceTuning_WebAdmin/6731.htm (search for "sim" to jump to the sim affinity section.)

    This, of course, was not good enough, because the goal was to reach 20Gb/s using bonded NICs.

    It turns out that the fix was to enable "multi-queue" instead. This allows each interface to have a multiple queues that is serviced by the number of licensed cores.
    https://sc1.checkpoint.com/documents/R77/CP_R77_Firewall_WebAdmin/92711.htm

    This allowed us to max out 20Gb/s easily, and I suspect even 40Gb/s would be easily maxed out as well.

    So.. My question is... Does PFSense have a similar setting to allow a single interface to be serviced by multiple CPU cores? There could be, of course, issues with enabling this. One issue is packet reordering, but since this is an Internet gateway, I don't see that as a big deal..

    Thoughts?

    Edit:
    Not sure if this is the same, but it seems similar:
    https://www.netgate.com/docs/pfsense/hardware/tuning-and-troubleshooting-network-cards.html


  • Netgate Administrator

    Multiple queues should exist for ix or ixl interfaces by default. You can configure a fixed number using those options if you wish otherwise the system will add as many as the driver supports or your have cores for.

    You should see the queues in top -aSH at the command line.

    Steve