• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

Multicore forwarding

TNSR
3
19
3.5k
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • O
    oudenos
    last edited by Mar 16, 2022, 1:17 PM

    I have been playing since yesterday with TNSR in my home lab and so far I'm very impressed.

    My goal now is to achieve 10 Gbps line rate (14 Mpps) plain IP forwarding, but I'm struggling. I've tried to reserve cores (corelist-workers), setup queues (num-rx-queues) and assign queues to cores (rx-queue), but it doesn't seem to work and barely get a +20% speedup (with 4 cores instead of one).

    Am I missing something?

    O 1 Reply Last reply Mar 16, 2022, 4:01 PM Reply Quote 0
    • O
      oudenos @oudenos
      last edited by Mar 16, 2022, 4:01 PM

      Just to provide more context: initial single core forwarding ticks at around 4.5-5 Mpps. Now, using 4 cores, I get 8.5 Mpps, which isn't even twice the forwarding rate, so I'm definitely missing something.

      dataplane cpu corelist-workers 2
      dataplane cpu corelist-workers 3
      dataplane cpu corelist-workers 4
      dataplane cpu corelist-workers 5
      dataplane ethernet default-mtu 9000
      dataplane dpdk dev 0000:07:00.0 network num-rx-desc 4096
      dataplane dpdk dev 0000:07:00.0 network num-rx-queues 4
      
      interface TenGigabitEthernet7/0/0
          rx-queue 0 cpu 2
          rx-queue 1 cpu 3
          rx-queue 2 cpu 4
          rx-queue 3 cpu 5
      exit
      

      The card is a Mellanox ConnectX-4 Lx MCX4121A-XCAT

      D 1 Reply Last reply Mar 24, 2022, 7:44 PM Reply Quote 0
      • D
        Derelict LAYER 8 Netgate @oudenos
        last edited by Mar 24, 2022, 7:44 PM

        @oudenos How are you testing?

        Chattanooga, Tennessee, USA
        A comprehensive network diagram is worth 10,000 words and 15 conference calls.
        DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
        Do Not Chat For Help! NO_WAN_EGRESS(TM)

        O 1 Reply Last reply Mar 25, 2022, 12:39 PM Reply Quote 0
        • O
          oudenos @Derelict
          last edited by Mar 25, 2022, 12:39 PM

          @derelict I'm using a Mikrotik router as a traffic generator connected to a switch so I have indipendent stats on packet traffic. Further testing suggests that the Mellanox card may be using only 2 receive queues, despite 4 are indicated (perhaps some form of Receive Side Scaling gone wrong?).

          The same does not happen with an Intel X710 that correctly uses 4 queues and is capable of scaling across 4 cores, hence operating at line rate.

          D 1 Reply Last reply Mar 25, 2022, 1:45 PM Reply Quote 0
          • D
            Derelict LAYER 8 Netgate @oudenos
            last edited by Mar 25, 2022, 1:45 PM

            @oudenos

            Try this for the mellanox:

            no dataplane dpdk dev 0000:07:00.0 network num-rx-desc 4096
            dataplane ethernet default-mtu 1500
            dataplane dpdk no-multi-seg
            dataplane cpu corelist-workers 2
            dataplane cpu corelist-workers 3
            dataplane cpu corelist-workers 4
            dataplane cpu corelist-workers 5
            dataplane dpdk dev 0000:07:00.0 network num-rx-queues 4
            dataplane dpdk dev 0000:07:00.0 network num-tx-queues 5
            
            # drop the queue pinning
            interface TenGigabitEthernet7/0/0
                no rx-queue 0 cpu 2
                no rx-queue 1 cpu 3
                no rx-queue 2 cpu 4
                no rx-queue 3 cpu 5
            exit
            

            Chattanooga, Tennessee, USA
            A comprehensive network diagram is worth 10,000 words and 15 conference calls.
            DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
            Do Not Chat For Help! NO_WAN_EGRESS(TM)

            O 1 Reply Last reply Mar 29, 2022, 8:16 AM Reply Quote 1
            • O
              oudenos @Derelict
              last edited by Mar 29, 2022, 8:16 AM

              @derelict Thank you for taking time in looking into this and sorry for the delay, I couldn't test earlier.

              I tried the setup you suggested and indeed there is an improvement: now it's forwarding at around 12 Mpps (still 2 millions behind the Intel). Could you please share the ideas behind it?

              As for the problem itself, my guess is the card is only using one rx queue:

              vpp# show hardware-interfaces 
                            Name                Idx   Link  Hardware
              TenGigabitEthernet7/0/0            1     up   TenGigabitEthernet7/0/0
                Link speed: 10 Gbps
                RX Queues:
                  queue thread         mode      
                  0     vpp_wk_0 (1)   polling   
                  1     vpp_wk_1 (2)   polling   
                  2     vpp_wk_2 (3)   polling   
                  3     vpp_wk_3 (4)   polling   
                Ethernet address b8:ce:f6:cc:f8:28
                Mellanox ConnectX-4 Family
                  carrier up full duplex mtu 9206 
                  flags: admin-up pmd tx-offload intel-phdr-cksum rx-ip4-cksum
                  Devargs: 
                  rx: queues 4 (max 1024), desc 1024 (min 0 max 65535 align 1)
                  tx: queues 5 (max 1024), desc 1024 (min 0 max 65535 align 1)
                  pci: device 15b3:1015 subsystem 15b3:0004 address 0000:07:00.00 numa 0
                  switch info: name 0000:07:00.0 domain id 0 port id 65535
                  max rx packet len: 65536
                  promiscuous: unicast off all-multicast on
                  vlan offload: strip off filter off qinq off
                  rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum tcp-lro 
                                     vlan-filter jumbo-frame scatter timestamp keep-crc 
                                     rss-hash buffer-split 
                  rx offload active: ipv4-cksum 
                  tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso 
                                     outer-ipv4-cksum vxlan-tnl-tso gre-tnl-tso geneve-tnl-tso 
                                     multi-segs mbuf-fast-free udp-tnl-tso ip-tnl-tso 
                  tx offload active: udp-cksum tcp-cksum 
                  rss avail:         ipv4-frag ipv4-tcp ipv4-udp ipv4-other ipv4 ipv6-tcp-ex 
                                     ipv6-udp-ex ipv6-frag ipv6-tcp ipv6-udp ipv6-other 
                                     ipv6-ex ipv6 l4-dst-only l4-src-only l3-dst-only l3-src-only 
                  rss active:        ipv4-frag ipv4-tcp ipv4-udp ipv4-other ipv4 ipv6-tcp-ex 
                                     ipv6-udp-ex ipv6-frag ipv6-tcp ipv6-udp ipv6-other 
                                     ipv6-ex ipv6 
                  tx burst mode: No MPW + SWP  + CSUM + INLINE + METADATA
                  rx burst mode: Vector SSE
              
                  tx frames ok                                         560
                  tx bytes ok                                        33990
                  rx frames ok                                  6432726393
                  rx bytes ok                                 385963591865
                  rx missed                                     1590501788
                  extended stats:
                    rx_good_packets                             6432726393
                    tx_good_packets                                    560
                    rx_good_bytes                             385963591865
                    tx_good_bytes                                    33990
                    rx_missed_errors                            1590501788
                    rx_q0_packets                               6432726352
                    rx_q0_bytes                               385963583350
                    rx_q1_packets                                        9
                    rx_q1_bytes                                       1638
                    rx_q2_packets                                       14
                    rx_q2_bytes                                       3898
                    rx_q3_packets                                       18
                    rx_q3_bytes                                       2979
                    tx_q0_packets                                        7
                    tx_q0_bytes                                        602
                    tx_q1_packets                                      542
                    tx_q1_bytes                                      32562
                    tx_q2_packets                                        8
                    tx_q2_bytes                                        560
                    tx_q3_packets                                        1
                    tx_q3_bytes                                         86
                    tx_q4_packets                                        2
                    tx_q4_bytes                                        180
                    rx_unicast_packets                          8023281551
                    rx_unicast_bytes                          481396893900
                    tx_unicast_packets                                 539
                    tx_unicast_bytes                                 32340
                    rx_multicast_packets                                59
                    rx_multicast_bytes                                7586
                    tx_multicast_packets                                20
                    tx_multicast_bytes                                1636
                    rx_broadcast_packets                                32
                    rx_broadcast_bytes                                6777
                    tx_broadcast_packets                                 2
                    tx_broadcast_bytes                                 120
                    tx_phy_packets                                     561
                    rx_phy_packets                              8023279352
                    tx_phy_bytes                                     36340
                    rx_phy_bytes                              513489889095
              
              D 1 Reply Last reply Mar 29, 2022, 1:36 PM Reply Quote 0
              • D
                Derelict LAYER 8 Netgate @oudenos
                last edited by Mar 29, 2022, 1:36 PM

                @oudenos What is the system setup? How many sockets, where the NIC is located, clock speed, core counts, memory amount and layout, 1 dimm, 6 dimms, memory clock etc etc.

                MLNX_DPDK_Quick_Start_Guide

                Was your testing of the Intel X710 in the same system or something else?

                Chattanooga, Tennessee, USA
                A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                Do Not Chat For Help! NO_WAN_EGRESS(TM)

                O 1 Reply Last reply Mar 29, 2022, 2:36 PM Reply Quote 0
                • O
                  oudenos @Derelict
                  last edited by Mar 29, 2022, 2:36 PM

                  @derelict Ok so I run TNSR inside KVM with PCIe device passthrough for the NIC. The hypervisor itself is 2 sockets NUMA, but I allocated 8 cores from the same NUMA node and 8 GB or RAM to the VM.

                  The Intel X710 runs inside an identical system on another hypervisor.

                  D 1 Reply Last reply Mar 29, 2022, 7:21 PM Reply Quote 0
                  • D
                    Derelict LAYER 8 Netgate @oudenos
                    last edited by Derelict Mar 29, 2022, 7:21 PM Mar 29, 2022, 7:21 PM

                    @oudenos

                    Did you allocate the cores on the same NUMA as the CX4 resides?

                    The PCI slots are connected to one of the two CPUs controllers directly. If the NIC is on NUMA1 and TNSR is running on NUMA0, then all PCIe requests have to go from Socket 0, to Socket 1, then to the NIC and back. That will put the hurt on performance.

                    Chattanooga, Tennessee, USA
                    A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                    DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                    Do Not Chat For Help! NO_WAN_EGRESS(TM)

                    O 1 Reply Last reply Mar 29, 2022, 7:43 PM Reply Quote 0
                    • O
                      oudenos @Derelict
                      last edited by Mar 29, 2022, 7:43 PM

                      @derelict How can I check that?

                      D 1 Reply Last reply Mar 30, 2022, 4:14 PM Reply Quote 0
                      • D
                        Derelict LAYER 8 Netgate @oudenos
                        last edited by Mar 30, 2022, 4:14 PM

                        @oudenos Something like this might help you map your system:

                        apt-get update
                        apt-get install hwloc
                        lstopo --output-format png > ~tnsr/lstopo.png

                        Then scp that image off and view it with your preferred method.

                        Chattanooga, Tennessee, USA
                        A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                        DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                        Do Not Chat For Help! NO_WAN_EGRESS(TM)

                        O 1 Reply Last reply Apr 7, 2022, 8:19 AM Reply Quote 0
                        • O
                          oudenos @Derelict
                          last edited by Apr 7, 2022, 8:19 AM

                          @derelict Thank you for your help and sorry for the delay.
                          As you correctly pointed out, the NIC is owned by the "wrong" CPU. However the same happens with the Intel one. Now I'm getting inconsistent results across reboots which lead me to think some sort of receive side scaling is involved. I'm working to set up a more realistic testbed with T-Rex as traffic generator and will also address NUMA pinning. I will get back to you as soon as I have results worth sharing with the community.

                          1 Reply Last reply Reply Quote 0
                          • L
                            LukeCage
                            last edited by May 26, 2022, 10:31 AM

                            I'm following the thread

                            I'm looking forward to the results
                            good work

                            O 1 Reply Last reply May 26, 2022, 1:17 PM Reply Quote 0
                            • O
                              oudenos @LukeCage
                              last edited by May 26, 2022, 1:17 PM

                              @lukecage unfortunately, there is not much to share.
                              I noticed that changing the number of queues at runtime often requires to reboot both the VM (TNSR) and the hypervisor (Ubuntu 20.04 + KVM) to work properly, otherwise many packets will get lost. This happens with both the Mellanox and the Intel, though the first one seems to be "more affected". No idea why, I suspect the only way to correctly reinitialize the NIC is to power-cycle it, maybe I did something wrong with KVM and stuff. (*)
                              Also, I'm pretty sure both NICs use Receive Side Scaling (RSS) and I had to change my traffic generator to TRex in order to have more entropy. I tried this on an Intel E810 card @ 100 Gbps, but it doesn't use scale to more than one CPU core. Again, I believe some sort queue-ish thing went wrong, perhaps I should try with the latest versions of VPP, DPDK, driver and firmware, but it requires building a TNSR-like distro from scratch afaik.

                              (*) If someone is willing to give it a try, it may be worth starting with bare metal rather than KVM: DPDK uses hugepages and KVM by default only supports 2MB hugepages, not 1GB.

                              Eventually, I dediced to give up on this for the moment. It's a shame, but I don't have enough time do all the tests.

                              Thank you very much to the community and @Derelict for the help they provided.

                              L 1 Reply Last reply May 26, 2022, 2:36 PM Reply Quote 0
                              • L
                                LukeCage @oudenos
                                last edited by May 26, 2022, 2:36 PM

                                @oudenos which cpu you using ?

                                my goal is 38m pps on gre tunel (i cant test more because my upstream dont allowed)

                                via interface 98m pps

                                and why do you need to run multiple cores, tnsr performance is fine

                                O 1 Reply Last reply May 26, 2022, 2:46 PM Reply Quote 0
                                • O
                                  oudenos @LukeCage
                                  last edited by May 26, 2022, 2:46 PM

                                  @lukecage Intel Xeon Silver 4210R
                                  With proper NUMA pinning I can achieve 5.5 Mpps IP forwarding per core. Splitting in multiples queues should enable multicore processing.

                                  How did you reach those numbers without tuning??

                                  L 1 Reply Last reply May 26, 2022, 2:56 PM Reply Quote 0
                                  • L
                                    LukeCage @oudenos
                                    last edited by May 26, 2022, 2:56 PM

                                    @oudenos

                                    I use it directly as bare metal
                                    not via vmware or any virtualization

                                    As a result, you are positioning a router, if you need high capacity, you should install it tnsr directly to the server.

                                    my hardware specs;

                                    i9-9900k
                                    32gb ram
                                    240gb ssd

                                    uptime
                                    14:51:44 up 69 days, 5:37, 2 users, load average: 1.00, 1.00, 1.00

                                    I was using tnsr centos version before, as it is understood from uptime, I switched to ubuntu on that date and now I am using it in ubuntu without any problems.

                                    O 1 Reply Last reply May 26, 2022, 3:03 PM Reply Quote 0
                                    • O
                                      oudenos @LukeCage
                                      last edited by May 26, 2022, 3:03 PM

                                      @lukecage Please run the following and post the result

                                      dataplane shell sudo vppctl
                                      show hardware-interfaces
                                      
                                      L 1 Reply Last reply May 26, 2022, 3:40 PM Reply Quote 0
                                      • L
                                        LukeCage @oudenos
                                        last edited by May 26, 2022, 3:40 PM

                                        @oudenos check private message

                                        1 Reply Last reply Reply Quote 0
                                        • First post
                                          Last post
                                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.