Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Solved - 10GB link 1GB speeds

    Hardware
    6
    39
    13.7k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D
      donnydavis
      last edited by

      Here are stats from the same link on the same router using centos 7.4. These are with the factory defaults and no iptables enabled.

      –----------------------------------------------------------
      Client connecting to ..., TCP port 5001
      TCP window size: 85.0 KByte (default)

      [ ID] Interval      Transfer    Bandwidth
      [  5]  0.0- 1.0 sec  256 MBytes  2.15 Gbits/sec
      [  4]  0.0- 1.0 sec  270 MBytes  2.26 Gbits/sec
      [  3]  0.0- 1.0 sec  258 MBytes  2.17 Gbits/sec
      [  6]  0.0- 1.0 sec  327 MBytes  2.75 Gbits/sec
      [SUM]  0.0- 1.0 sec  1.09 GBytes  9.32 Gbits/sec
      [  5]  1.0- 2.0 sec  242 MBytes  2.03 Gbits/sec
      [  4]  1.0- 2.0 sec  251 MBytes  2.11 Gbits/sec
      [  3]  1.0- 2.0 sec  281 MBytes  2.36 Gbits/sec
      [  6]  1.0- 2.0 sec  337 MBytes  2.83 Gbits/sec
      [SUM]  1.0- 2.0 sec  1.09 GBytes  9.33 Gbits/sec
      ^C[  5]  0.0- 2.6 sec  679 MBytes  2.15 Gbits/sec
      [  4]  0.0- 2.6 sec  715 MBytes  2.27 Gbits/sec
      [  3]  0.0- 2.6 sec  718 MBytes  2.28 Gbits/sec
      [  6]  0.0- 2.6 sec  818 MBytes  2.60 Gbits/sec
      [SUM]  0.0- 2.6 sec  2.86 GBytes  9.29 Gbits/sec

      The CPU utilization is almost zero.

      cpu-util-rtr.png
      cpu-util-rtr.png_thumb

      1 Reply Last reply Reply Quote 0
      • D
        donnydavis
        last edited by

        And these are the default options that are turned on for the nic in linux.

        rx-checksumming: on
        tx-checksumming: on
        tx-checksum-ipv4: on
        tx-checksum-ipv6: on
        scatter-gather: on
        tx-scatter-gather: on
        tx-tcp-segmentation: on
        tx-tcp6-segmentation: on
        receive-hashing: on
        highdma: on [fixed]
        rx-vlan-filter: on [fixed]
        rx-vlan-stag-hw-parse: on
        rx-vlan-stag-filter: on [fixed]
        busy-poll: on [fixed]

        I have no idea how to translate these to bsd options. But I am thinking my issue lies here - what is offloaded for the nic to handle.

        1 Reply Last reply Reply Quote 0
        • ?
          Guest
          last edited by

          I think in BSD those settings are still set with ifconfig using the + and - options. If the cards need firmware to run (and most do), perhaps we should also take that into account.

          Currently, we know that by default, the hardware should be capable of pushing 2Gbit+ with no high loads. So it's not a hardware issue and we know it's not a BSD issue either since it works with FreeBSD.

          This leaves us with:

          • compile-time options in the kernel/drivers
          • firmware versions if the drivers differ in version and have different firmware blobs
          • syssctl

          Try getting sysctl -A from freebsd and from pfsense and compare those. Also check pci messages.

          1 Reply Last reply Reply Quote 0
          • D
            donnydavis
            last edited by

            Well the good news is I have have managed to get around 4G with pf  enabled, and nearly wireline with pf disabled. That is solid progress.

            There were a couple of options i had to enable in loader.conf.local

            compat.linuxkpi.mlx4_enable_sys_tune="1"
            net.link.ifqmaxlen="2048"
            net.inet.tcp.soreceive_stream="1"
            net.inet.tcp.hostcache.cachelimit="0"
            compat.linuxkpi.mlx4_inline_thold="0"
            compat.linuxkpi.mlx4_log_num_mgm_entry_size="7"
            compat.linuxkpi.mlx4_high_rate_steer="1"

            These options seem to be helping in making solid progress. I am 1G away from my goal of 5G per second with pf enabled.

            I think those are really quite reasonable numbers for this machine, expecting anything else is asking for a bit much.

            I checked the sysctl's from the freebsd box they are nearly identical.

            Thanks all for your time and help. It is genuinely appreciated.

            I will keep tinkering and post updates.

            1 Reply Last reply Reply Quote 0
            • ?
              Guest
              last edited by

              Are those sysctl's the same on the FreeBSD install?

              1 Reply Last reply Reply Quote 0
              • D
                donnydavis
                last edited by

                No, they were not required on the FreeBSD install or the linux install. The defaults just seem to work.  I also didn't have a real ruleset in pf with FreeBSD like i do on this box, so that will surely effect performance numbers.

                [ 15] 26.0-27.0 sec  31.1 MBytes  261 Mbits/sec
                [  3] 26.0-27.0 sec  49.9 MBytes  418 Mbits/sec
                [  8] 26.0-27.0 sec  53.9 MBytes  452 Mbits/sec
                [ 11] 26.0-27.0 sec  35.4 MBytes  297 Mbits/sec
                [ 16] 26.0-27.0 sec  43.1 MBytes  362 Mbits/sec
                [ 17] 26.0-27.0 sec  48.1 MBytes  404 Mbits/sec
                [ 14] 26.0-27.0 sec  54.8 MBytes  459 Mbits/sec
                [  4] 26.0-27.0 sec  45.5 MBytes  382 Mbits/sec
                [ 10] 26.0-27.0 sec  62.0 MBytes  520 Mbits/sec
                [  6] 26.0-27.0 sec  24.2 MBytes  203 Mbits/sec
                [  7] 26.0-27.0 sec  14.2 MBytes  120 Mbits/sec
                [  9] 26.0-27.0 sec  38.0 MBytes  319 Mbits/sec
                [ 18] 26.0-27.0 sec  33.2 MBytes  279 Mbits/sec
                [ 13] 26.0-27.0 sec  16.8 MBytes  141 Mbits/sec
                [ 12] 26.0-27.0 sec  30.6 MBytes  257 Mbits/sec
                [  5] 26.0-27.0 sec  23.8 MBytes  199 Mbits/sec
                [SUM] 26.0-27.0 sec  605 MBytes  5.07 Gbits/sec
                [  3] 27.0-28.0 sec  51.4 MBytes  431 Mbits/sec
                [ 16] 27.0-28.0 sec  43.1 MBytes  362 Mbits/sec
                [ 15] 27.0-28.0 sec  31.0 MBytes  260 Mbits/sec
                [  4] 27.0-28.0 sec  47.9 MBytes  402 Mbits/sec
                [ 10] 27.0-28.0 sec  57.6 MBytes  483 Mbits/sec
                [  8] 27.0-28.0 sec  49.2 MBytes  413 Mbits/sec
                [ 13] 27.0-28.0 sec  16.1 MBytes  135 Mbits/sec
                [ 17] 27.0-28.0 sec  46.6 MBytes  391 Mbits/sec
                [ 14] 27.0-28.0 sec  55.6 MBytes  467 Mbits/sec
                [  6] 27.0-28.0 sec  23.0 MBytes  193 Mbits/sec
                [ 12] 27.0-28.0 sec  29.2 MBytes  245 Mbits/sec
                [ 18] 27.0-28.0 sec  34.8 MBytes  292 Mbits/sec
                [  5] 27.0-28.0 sec  23.1 MBytes  194 Mbits/sec
                [  7] 27.0-28.0 sec  11.9 MBytes  99.6 Mbits/sec
                [  9] 27.0-28.0 sec  41.0 MBytes  344 Mbits/sec
                [ 11] 27.0-28.0 sec  42.0 MBytes  352 Mbits/sec
                [SUM] 27.0-28.0 sec  604 MBytes  5.06 Gbits/sec

                So with iperf running 16 threads I can reach my 5G target with pf enabled. Which is the limit of my system with its current configuration.

                PID USERNAME  PRI NICE  SIZE    RES STATE  C  TIME    WCPU COMMAND
                    0 root      -92    -    0K  5328K -      0  3:23  94.68% [kernel{mlxen0 rx cq}]
                    0 root      -92    -    0K  5328K -      5  2:14  94.68% [kernel{mlxen0 rx cq}]
                    0 root      -92    -    0K  5328K -      6  3:48  94.58% [kernel{mlxen0 rx cq}]
                    0 root      -92    -    0K  5328K -      3  4:10  94.38% [kernel{mlxen0 rx cq}]
                    0 root      -92    -    0K  5328K -      2  3:36  93.99% [kernel{mlxen0 rx cq}]
                    0 root      -92    -    0K  5328K -      1  3:44  90.58% [kernel{mlxen0 rx cq}]
                    0 root      -92    -    0K  5328K -      7  2:14  67.58% [kernel{mlxen0 rx cq}]

                I don't know what rx cq means, so I don't know what to tinker with.

                1 Reply Last reply Reply Quote 0
                • ?
                  Guest
                  last edited by

                  That's the receive queue AFAIK. It seems the defaults on FreeBSD vs. pfSense must be different then. If the ifconfig status output different as well?

                  For example, I have an interface that's set with:

                  en4: flags=8863 <up,broadcast,smart,running,simplex,multicast>mtu 1500
                  options=10b <rxcsum,txcsum,vlan_hwtagging,av>If you compare your ifconfig settings on FreeBSD vs. pfSense there might be a change there as well. Also, the driver settings could differ, but I'm not sure where they are stored for the Mellanox card.</rxcsum,txcsum,vlan_hwtagging,av></up,broadcast,smart,running,simplex,multicast>

                  1 Reply Last reply Reply Quote 0
                  • D
                    donnydavis
                    last edited by

                    Ok now I can confirm wireline speeds with this nic.

                    Its my pf ruleset that is holding it back at this point.

                    [ ID] Interval      Transfer    Bandwidth
                    [  4]  0.0- 1.0 sec  74.6 MBytes  626 Mbits/sec
                    [  6]  0.0- 1.0 sec  152 MBytes  1.28 Gbits/sec
                    [  8]  0.0- 1.0 sec  163 MBytes  1.37 Gbits/sec
                    [  9]  0.0- 1.0 sec  76.2 MBytes  640 Mbits/sec
                    [ 13]  0.0- 1.0 sec  42.6 MBytes  358 Mbits/sec
                    [ 10]  0.0- 1.0 sec  58.4 MBytes  490 Mbits/sec
                    [ 12]  0.0- 1.0 sec  66.6 MBytes  559 Mbits/sec
                    [ 16]  0.0- 1.0 sec  63.2 MBytes  531 Mbits/sec
                    [ 14]  0.0- 1.0 sec  32.9 MBytes  276 Mbits/sec
                    [ 17]  0.0- 1.0 sec  37.4 MBytes  314 Mbits/sec
                    [ 18]  0.0- 1.0 sec  79.0 MBytes  663 Mbits/sec
                    [  3]  0.0- 1.0 sec  57.5 MBytes  482 Mbits/sec
                    [  5]  0.0- 1.0 sec  52.4 MBytes  439 Mbits/sec
                    [  7]  0.0- 1.0 sec  29.1 MBytes  244 Mbits/sec
                    [ 15]  0.0- 1.0 sec  75.5 MBytes  633 Mbits/sec
                    [ 11]  0.0- 1.0 sec  71.1 MBytes  597 Mbits/sec
                    [SUM]  0.0- 1.0 sec  1.11 GBytes  9.50 Gbits/sec
                    [ 18]  1.0- 2.0 sec  49.0 MBytes  411 Mbits/sec
                    [  6]  1.0- 2.0 sec  152 MBytes  1.28 Gbits/sec
                    [  8]  1.0- 2.0 sec  127 MBytes  1.07 Gbits/sec
                    [ 10]  1.0- 2.0 sec  70.2 MBytes  589 Mbits/sec
                    [ 12]  1.0- 2.0 sec  70.4 MBytes  590 Mbits/sec
                    [ 15]  1.0- 2.0 sec  70.6 MBytes  592 Mbits/sec
                    [ 14]  1.0- 2.0 sec  25.9 MBytes  217 Mbits/sec
                    [ 11]  1.0- 2.0 sec  68.0 MBytes  570 Mbits/sec
                    [  7]  1.0- 2.0 sec  61.0 MBytes  512 Mbits/sec
                    [ 13]  1.0- 2.0 sec  55.9 MBytes  469 Mbits/sec
                    [ 16]  1.0- 2.0 sec  73.0 MBytes  612 Mbits/sec
                    [ 17]  1.0- 2.0 sec  30.8 MBytes  258 Mbits/sec
                    [  4]  1.0- 2.0 sec  81.5 MBytes  684 Mbits/sec
                    [  3]  1.0- 2.0 sec  41.0 MBytes  344 Mbits/sec
                    [  5]  1.0- 2.0 sec  47.1 MBytes  395 Mbits/sec
                    [  9]  1.0- 2.0 sec  81.5 MBytes  684 Mbits/sec
                    [SUM]  1.0- 2.0 sec  1.08 GBytes  9.27 Gbits/sec
                    [ 18]  2.0- 3.0 sec  48.0 MBytes  403 Mbits/sec
                    [  4]  2.0- 3.0 sec  84.9 MBytes  712 Mbits/sec
                    [  3]  2.0- 3.0 sec  47.6 MBytes  400 Mbits/sec
                    [  5]  2.0- 3.0 sec  49.0 MBytes  411 Mbits/sec
                    [  6]  2.0- 3.0 sec  163 MBytes  1.37 Gbits/sec
                    [  7]  2.0- 3.0 sec  65.5 MBytes  549 Mbits/sec
                    [  8]  2.0- 3.0 sec  119 MBytes  997 Mbits/sec
                    [ 10]  2.0- 3.0 sec  90.2 MBytes  757 Mbits/sec
                    [  9]  2.0- 3.0 sec  82.6 MBytes  693 Mbits/sec
                    [ 13]  2.0- 3.0 sec  59.9 MBytes  502 Mbits/sec
                    [ 12]  2.0- 3.0 sec  57.8 MBytes  484 Mbits/sec
                    [ 16]  2.0- 3.0 sec  55.5 MBytes  466 Mbits/sec
                    [ 15]  2.0- 3.0 sec  57.6 MBytes  483 Mbits/sec
                    [ 11]  2.0- 3.0 sec  66.2 MBytes  556 Mbits/sec
                    [ 14]  2.0- 3.0 sec  33.9 MBytes  284 Mbits/sec
                    [ 17]  2.0- 3.0 sec  33.4 MBytes  280 Mbits/sec
                    [SUM]  2.0- 3.0 sec  1.09 GBytes  9.34 Gbits/sec
                    [ 18]  3.0- 4.0 sec  42.1 MBytes  353 Mbits/sec
                    [  4]  3.0- 4.0 sec  94.5 MBytes  793 Mbits/sec
                    [  3]  3.0- 4.0 sec  43.4 MBytes  364 Mbits/sec
                    [  5]  3.0- 4.0 sec  47.4 MBytes  397 Mbits/sec
                    [  6]  3.0- 4.0 sec  171 MBytes  1.44 Gbits/sec
                    [  7]  3.0- 4.0 sec  65.1 MBytes  546 Mbits/sec
                    [  8]  3.0- 4.0 sec  92.8 MBytes  778 Mbits/sec
                    [  9]  3.0- 4.0 sec  82.9 MBytes  695 Mbits/sec
                    [ 16]  3.0- 4.0 sec  60.4 MBytes  506 Mbits/sec
                    [ 15]  3.0- 4.0 sec  57.4 MBytes  481 Mbits/sec
                    [ 11]  3.0- 4.0 sec  69.4 MBytes  582 Mbits/sec
                    [ 13]  3.0- 4.0 sec  67.2 MBytes  564 Mbits/sec
                    [ 10]  3.0- 4.0 sec  91.8 MBytes  770 Mbits/sec
                    [ 14]  3.0- 4.0 sec  30.9 MBytes  259 Mbits/sec
                    [ 17]  3.0- 4.0 sec  36.6 MBytes  307 Mbits/sec
                    [ 12]  3.0- 4.0 sec  57.5 MBytes  482 Mbits/sec
                    [SUM]  3.0- 4.0 sec  1.08 GBytes  9.31 Gbits/sec

                    We can go ahead and mark this thread solved, my box will run at wire (near) for the 10G test machines.
                    The fix was as follows

                    /boot/loader.conf.local
                    compat.linuxkpi.mlx4_enable_sys_tune="1"
                    net.link.ifqmaxlen="2048"
                    net.inet.tcp.soreceive_stream="1"
                    net.inet.tcp.hostcache.cachelimit="0"
                    compat.linuxkpi.mlx4_inline_thold="0"
                    compat.linuxkpi.mlx4_high_rate_steer="1"
                    compat.linuxkpi.mlx4_log_num_mgm_entry_size="7"

                    sysctls

                    hw.mlxen0.conf.rx_size                                                                                         2048
                    hw.mlxen0.conf.tx_size                                                                                         2048
                    kern.ipc.maxsockbuf Maximum socket buffer size                                                         16777216
                    net.link.vlan.mtag_pcp Retain VLAN PCP information as packets are passed up the stack 0
                    net.route.netisr_maxqlen maximum routing socket dispatch queue length                         2048
                    net.inet.ip.intr_queue_maxlen Maximum size of the IP input queue                                 2048
                    net.inet.tcp.recvspace Initial receive socket buffer size                                                 131072
                    net.inet.tcp.sendspace Initial send socket buffer size                                                 131072

                    Next I will measure actual throughput in pps, because in doing this testing i learned wire speed doesn't seem to mean much. That was pointed out to me a couple times, i was just obsessed with starting from a place that is equal(ish) with linux. I'm sure someone else will find these useful for a mellanox connectx-3 adapter.

                    Should put my chelsio t5 back, i know this hardware will do what I am asking given the right tuning?

                    Thanks again!

                    1 Reply Last reply Reply Quote 0
                    • ?
                      Guest
                      last edited by

                      Yes, so PPS means how much you can actually process as a bottom bound. If you can process a billion packets per second on tiny packets, then any packet that is bigger will just get you even more bandwidth.

                      Also, it seems that at least half of the tuning settings are for the hardware driver (mlx) itself, so I imagine that if you use a Chelsio card you'll need to find the settings for that driver as well.

                      1 Reply Last reply Reply Quote 0
                      • F
                        fwcheck
                        last edited by

                        Nice to see that the issue is resolved.
                        I will check if some of these settings are usefull to increase throughput through vms aswell.
                        Thanks for sharing.

                        1 Reply Last reply Reply Quote 0
                        • T
                          t1n_gate
                          last edited by t1n_gate

                          I know this is an old thread, sorry to rezz.

                          I work with the "other popular" software based firewall.
                          I just got finished running 10Gb testing on 28 core HP DL380G9s running 4 10Gb NICs.
                          I ran into very similar speed constraints during my testing. Out of the box the security gateway would only push 3 to 4 Gb/s via iperf3 (24 streams). By using sim_affinity, (bind a nic to a specific CPU core) I was able to get the box to run at 6 to 8 Gbp/s.
                          https://sc1.checkpoint.com/documents/R77/CP_R77_PerformanceTuning_WebAdmin/6731.htm (search for "sim" to jump to the sim affinity section.)

                          This, of course, was not good enough, because the goal was to reach 20Gb/s using bonded NICs.

                          It turns out that the fix was to enable "multi-queue" instead. This allows each interface to have a multiple queues that is serviced by the number of licensed cores.
                          https://sc1.checkpoint.com/documents/R77/CP_R77_Firewall_WebAdmin/92711.htm

                          This allowed us to max out 20Gb/s easily, and I suspect even 40Gb/s would be easily maxed out as well.

                          So.. My question is... Does PFSense have a similar setting to allow a single interface to be serviced by multiple CPU cores? There could be, of course, issues with enabling this. One issue is packet reordering, but since this is an Internet gateway, I don't see that as a big deal..

                          Thoughts?

                          Edit:
                          Not sure if this is the same, but it seems similar:
                          https://www.netgate.com/docs/pfsense/hardware/tuning-and-troubleshooting-network-cards.html

                          1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator
                            last edited by

                            Multiple queues should exist for ix or ixl interfaces by default. You can configure a fixed number using those options if you wish otherwise the system will add as many as the driver supports or your have cores for.

                            You should see the queues in top -aSH at the command line.

                            Steve

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.