Multicore forwarding
-
Try this for the mellanox:
no dataplane dpdk dev 0000:07:00.0 network num-rx-desc 4096 dataplane ethernet default-mtu 1500 dataplane dpdk no-multi-seg dataplane cpu corelist-workers 2 dataplane cpu corelist-workers 3 dataplane cpu corelist-workers 4 dataplane cpu corelist-workers 5 dataplane dpdk dev 0000:07:00.0 network num-rx-queues 4 dataplane dpdk dev 0000:07:00.0 network num-tx-queues 5 # drop the queue pinning interface TenGigabitEthernet7/0/0 no rx-queue 0 cpu 2 no rx-queue 1 cpu 3 no rx-queue 2 cpu 4 no rx-queue 3 cpu 5 exit
-
@derelict Thank you for taking time in looking into this and sorry for the delay, I couldn't test earlier.
I tried the setup you suggested and indeed there is an improvement: now it's forwarding at around 12 Mpps (still 2 millions behind the Intel). Could you please share the ideas behind it?
As for the problem itself, my guess is the card is only using one rx queue:
vpp# show hardware-interfaces Name Idx Link Hardware TenGigabitEthernet7/0/0 1 up TenGigabitEthernet7/0/0 Link speed: 10 Gbps RX Queues: queue thread mode 0 vpp_wk_0 (1) polling 1 vpp_wk_1 (2) polling 2 vpp_wk_2 (3) polling 3 vpp_wk_3 (4) polling Ethernet address b8:ce:f6:cc:f8:28 Mellanox ConnectX-4 Family carrier up full duplex mtu 9206 flags: admin-up pmd tx-offload intel-phdr-cksum rx-ip4-cksum Devargs: rx: queues 4 (max 1024), desc 1024 (min 0 max 65535 align 1) tx: queues 5 (max 1024), desc 1024 (min 0 max 65535 align 1) pci: device 15b3:1015 subsystem 15b3:0004 address 0000:07:00.00 numa 0 switch info: name 0000:07:00.0 domain id 0 port id 65535 max rx packet len: 65536 promiscuous: unicast off all-multicast on vlan offload: strip off filter off qinq off rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum tcp-lro vlan-filter jumbo-frame scatter timestamp keep-crc rss-hash buffer-split rx offload active: ipv4-cksum tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum tcp-tso outer-ipv4-cksum vxlan-tnl-tso gre-tnl-tso geneve-tnl-tso multi-segs mbuf-fast-free udp-tnl-tso ip-tnl-tso tx offload active: udp-cksum tcp-cksum rss avail: ipv4-frag ipv4-tcp ipv4-udp ipv4-other ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-frag ipv6-tcp ipv6-udp ipv6-other ipv6-ex ipv6 l4-dst-only l4-src-only l3-dst-only l3-src-only rss active: ipv4-frag ipv4-tcp ipv4-udp ipv4-other ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-frag ipv6-tcp ipv6-udp ipv6-other ipv6-ex ipv6 tx burst mode: No MPW + SWP + CSUM + INLINE + METADATA rx burst mode: Vector SSE tx frames ok 560 tx bytes ok 33990 rx frames ok 6432726393 rx bytes ok 385963591865 rx missed 1590501788 extended stats: rx_good_packets 6432726393 tx_good_packets 560 rx_good_bytes 385963591865 tx_good_bytes 33990 rx_missed_errors 1590501788 rx_q0_packets 6432726352 rx_q0_bytes 385963583350 rx_q1_packets 9 rx_q1_bytes 1638 rx_q2_packets 14 rx_q2_bytes 3898 rx_q3_packets 18 rx_q3_bytes 2979 tx_q0_packets 7 tx_q0_bytes 602 tx_q1_packets 542 tx_q1_bytes 32562 tx_q2_packets 8 tx_q2_bytes 560 tx_q3_packets 1 tx_q3_bytes 86 tx_q4_packets 2 tx_q4_bytes 180 rx_unicast_packets 8023281551 rx_unicast_bytes 481396893900 tx_unicast_packets 539 tx_unicast_bytes 32340 rx_multicast_packets 59 rx_multicast_bytes 7586 tx_multicast_packets 20 tx_multicast_bytes 1636 rx_broadcast_packets 32 rx_broadcast_bytes 6777 tx_broadcast_packets 2 tx_broadcast_bytes 120 tx_phy_packets 561 rx_phy_packets 8023279352 tx_phy_bytes 36340 rx_phy_bytes 513489889095
-
@oudenos What is the system setup? How many sockets, where the NIC is located, clock speed, core counts, memory amount and layout, 1 dimm, 6 dimms, memory clock etc etc.
Was your testing of the Intel X710 in the same system or something else?
-
@derelict Ok so I run TNSR inside KVM with PCIe device passthrough for the NIC. The hypervisor itself is 2 sockets NUMA, but I allocated 8 cores from the same NUMA node and 8 GB or RAM to the VM.
The Intel X710 runs inside an identical system on another hypervisor.
-
Did you allocate the cores on the same NUMA as the CX4 resides?
The PCI slots are connected to one of the two CPUs controllers directly. If the NIC is on NUMA1 and TNSR is running on NUMA0, then all PCIe requests have to go from Socket 0, to Socket 1, then to the NIC and back. That will put the hurt on performance.
-
@derelict How can I check that?
-
@oudenos Something like this might help you map your system:
apt-get update
apt-get install hwloc
lstopo --output-format png > ~tnsr/lstopo.png
Then scp that image off and view it with your preferred method.
-
@derelict Thank you for your help and sorry for the delay.
As you correctly pointed out, the NIC is owned by the "wrong" CPU. However the same happens with the Intel one. Now I'm getting inconsistent results across reboots which lead me to think some sort of receive side scaling is involved. I'm working to set up a more realistic testbed with T-Rex as traffic generator and will also address NUMA pinning. I will get back to you as soon as I have results worth sharing with the community. -
I'm following the thread
I'm looking forward to the results
good work -
@lukecage unfortunately, there is not much to share.
I noticed that changing the number of queues at runtime often requires to reboot both the VM (TNSR) and the hypervisor (Ubuntu 20.04 + KVM) to work properly, otherwise many packets will get lost. This happens with both the Mellanox and the Intel, though the first one seems to be "more affected". No idea why, I suspect the only way to correctly reinitialize the NIC is to power-cycle it, maybe I did something wrong with KVM and stuff. (*)
Also, I'm pretty sure both NICs use Receive Side Scaling (RSS) and I had to change my traffic generator to TRex in order to have more entropy. I tried this on an Intel E810 card @ 100 Gbps, but it doesn't use scale to more than one CPU core. Again, I believe some sort queue-ish thing went wrong, perhaps I should try with the latest versions of VPP, DPDK, driver and firmware, but it requires building a TNSR-like distro from scratch afaik.(*) If someone is willing to give it a try, it may be worth starting with bare metal rather than KVM: DPDK uses hugepages and KVM by default only supports 2MB hugepages, not 1GB.
Eventually, I dediced to give up on this for the moment. It's a shame, but I don't have enough time do all the tests.
Thank you very much to the community and @Derelict for the help they provided.
-
@oudenos which cpu you using ?
my goal is 38m pps on gre tunel (i cant test more because my upstream dont allowed)
via interface 98m pps
and why do you need to run multiple cores, tnsr performance is fine
-
@lukecage Intel Xeon Silver 4210R
With proper NUMA pinning I can achieve 5.5 Mpps IP forwarding per core. Splitting in multiples queues should enable multicore processing.How did you reach those numbers without tuning??
-
I use it directly as bare metal
not via vmware or any virtualizationAs a result, you are positioning a router, if you need high capacity, you should install it tnsr directly to the server.
my hardware specs;
i9-9900k
32gb ram
240gb ssduptime
14:51:44 up 69 days, 5:37, 2 users, load average: 1.00, 1.00, 1.00I was using tnsr centos version before, as it is understood from uptime, I switched to ubuntu on that date and now I am using it in ubuntu without any problems.
-
@lukecage Please run the following and post the result
dataplane shell sudo vppctl show hardware-interfaces
-
@oudenos check private message