hoping for 10Gbps, getting sub 1Gbps speed Xeon E3-1270 v5 3.6GHz

SpaceBass

solved (mostly) - moved to an Intel X520-DA NIC which supports multiple queues. The HP 10Gbps NICs seem to lack multi queue support in the drivers. Now getting ~8Gbps

I spent an hour typing up a MUCH more detailed post but it keeps getting flagged as spam... sorry for such a low quality question/post ... posted what it'd allow in replies below

hey folks

TLDR - goal is 10Gbps WAN + vLAN routing, reality is 600Mbps on DIY hardware, 2.5Gbps on 6100; hoping for more? Is it hardware, my config, or my misunderstanding?

Can an Xeon E3-1270 v5 3.6GHz handle 10Gbps routing?
...
First, I have a strong bias towards running pfSense on Netgate hardware. I do that, mostly all Netgate 6100s at all my sites including my main site, which this post concerns.

I understand the 6100 maxes out around 2-2.5Gbps and I have no expectation otherwise. But with a 10Gbps symmetrical wan circuit, we'd like to take advance of more of that pipe.

Also I know tnsr exists but we aren't interested in migrating at this point.

I did (admittedly very) little research and thought a R230 Xeon E3-1270 v5 3.6GHz would be sufficient for single threaded operations like traffic routing at 10Gbps. At least, I thought it'd be beefier than the chip in the 6100. Worth a test, right?

Seems like it is also maxing out in single threaded routing - does that sound right?

SpaceBass

After two days of pulling hair trying to restore the config from the 6100 to the R230 (thanks pfBlockerng for borking the firewall in a restore :/ ) I have the R230 up and running. And the results are really quite poor.

My ISP has an iperf3 server three hops away from me. Details below on how I run iperf3 tests.

6100 - ~9.6Gpbs / sec
R230 - ~692 Mbits/sec

Intravlan
6100 - ~2Gbps/sec
R230 - 1.61 Mbits/sec ... in fact, after the first few packets, it drops to 0Mbits/sec

For baseline, same hosts on subnet (vLAN): 9.27 Gbits/sec

DETAILS

Hardware

pfSense box

Dell R230 Xeon E3-1270 v5 @ 3.6GHz
16GB
2x Samsung 850 SSD in ZFS redundant pool
HP NC523SPF NIC in PCIe port 2 (which I believe is full 16 lanes)

switches & cables & optics

unifi aggregation 10G switches
Intel 850mm SFP+ optics
mm patch cables (same ones used to get faster results with 6100)

Testing
iperf3:
iperf3 -c server.fqdn.foo.bar -P 10
iperf3 -c server.fqdn.foo.bar -P 10 -R
iperf3 -c server.fqdn.foo.bar -P 10 -6

NOTE while I have tried testing right from pfSense, all posted results are from a LAN host to another host and as indicated above the LAN host has no problem sending/receiving 10Gbps traffic.

system monitoring:
top -aSH

pfSense

WAN - static IPv4, dynamic IPv6
Hardware Checksum Offloading - tried both on and off
Hardware TCP Segmentation Offloading - tried both on and off
hn ALTQ support - on

SpaceBass

**a more successful result from lan to ISP's iperf3 **
Clear to see the single thread max out :/

0 root        -92    -     0B  1376K CPU7     7   0:54  99.87% [kernel{ql1 rcvq}]

iperf3 -c 198.60.x.x -P 10

[SUM]   0.00-10.00  sec  1.89 GBytes  1.62 Gbits/sec                  sender
[SUM]   0.00-10.03  sec  1.88 GBytes  1.61 Gbits/sec                  receiver

stephenw10

@spacebass said in hoping for 10Gbps, getting sub 1Gbps speed Xeon E3-1270 v5 3.6GHz:

NC523SPF

What NIC chipset/driver is that? Those numbers seems really low.

Make sure you have multiple queues in attached for each NIC.

Steve

keyser

@spacebass It’s your 10Gbe NIC that’s causing issues.

I have worked with HPE hardware for 20 years, and the NC523 card is a dud. I can’t quite remember the details, but the card took more than a year and half worth of Windows driver updates to it’s Qlogic controller to reliably deliver about 2Gbits performance. Until then it fluctuacted wildly and stalled to zero any time you tried to push it above about a gbit.

I can only guess how bad the driver state with FreeBSD is, but i’m quite sure the card is your culprit. Ditch it and get a Intel based 10Gbe NIC that has good driver support in pfSense.

SpaceBass

@stephenw10 said in hoping for 10Gbps, getting sub 1Gbps speed Xeon E3-1270 v5 3.6GHz:

What NIC chipset/driver is that? Those numbers seems really low.

Thanks @stephenw10 for replying!
It is a Qlogic 3200 which uses the qlxgb drivers.

I applied the following tunables (but no real change)

kern.ipc.nmbjumbo9=262144
net.inet.tcp.recvbuf_max=262144
net.inet.tcp.recvbuf_inc=16384
kern.ipc.nmbclusters=1000000
kern.ipc.maxsockbuf=2097152
net.inet.tcp.recvspace=131072
net.inet.tcp.sendbuf_max=262144
net.inet.tcp.sendspace=65536

Make sure you have multiple queues in attached for each NIC.

Can you elaborate? I'm not using traffic shaping and that's the only context I have for queues

SpaceBass

@keyser the good news is that I have Intel cards on order...

That said, I have a ton of these HP cards and have no problem getting 10Gbps on Linux-based boxes...it could be drivers, but the qlxgb in FreeBSD is pretty tried and true.

stephenw10

Ah I wasn't sure if ql1 there was that. Ok then at the very least I'd start by disabling all the hardware off-loading options. But if you know exactly which driver it is you can start looking for known bugs/workarounds.

But, yeah, if you can use an Intel NIC, you should.

SpaceBass

@stephenw10 thanks - if I want to add the full (supposed) 8 queues, how would I go about it?

stephenw10

How many queues are you getting?

It's probably a sysctl or loader tunable for that driver. Without having one to test it's difficult to say.
Try sysctl hw.ql or maybe sysctl hw.qlxgb and see what values exist.

Also check sysctl dev.ql.0

SpaceBass

@stephenw10 said in hoping for 10Gbps, getting sub 1Gbps speed Xeon E3-1270 v5 3.6GHz:

sysctl dev.ql.0

that's the key ... no mq options there though and I dont see any listed in the readme.txt in the driver's source.

I'll wait for the Intel NIC and see what I can get out of that

stephenw10

How many queues is it using by default? If it's just 1 that would explain the single threaded performance.

SpaceBass

@stephenw10 I'll fire the box up and check ... just curious, what am I looking for in the output of sysctl? I didn't see anything with 'mq' or 'queues' in the output when I first checked.

perhaps related - what does it tell us that I get closer to 4Gbps with the -R flag on iperf3 (eg inbound) vs 2Gbps without the flag?

I'm not overly determined to make this Qlogic card work, but this is a really good learning opportunity and I'm enjoying the process.

stephenw10

You might have more Rx queues than Tx queues for example. Most drivers show that in the boot logs but I'm not sure qlxb does. vmstat -i may also show it.

SpaceBass

@stephenw10
Thanks for the continued help...

Here's what I see

dev.ql.0.wake: 0
dev.ql.0.num_sds_rings: 4
dev.ql.0.num_rds_rings: 2
dev.ql.0.free_pkt_thres: 1024
dev.ql.0.snd_pkt_thres: 16
dev.ql.0.rcv_pkt_thres_d: 32
dev.ql.0.rcv_pkt_thres: 128
dev.ql.0.jumbo_replenish: 2
dev.ql.0.std_replenish: 8
dev.ql.0.debug: 0
dev.ql.0.fw_version: 4.16.50.1401759177
dev.ql.0.stats: 0
dev.ql.0.%parent: pci2
dev.ql.0.%pnpinfo: vendor=0x1077 device=0x8020 subvendor=0x103c subdevice=0x3733 class=0x020000
dev.ql.0.%location: slot=0 function=0 dbsf=pci0:2:0:0 handle=\_SB_.PCI0.PEG1.PEGP
dev.ql.0.%driver: ql
dev.ql.0.%desc: Qlogic ISP 80xx PCI CNA Adapter-Ethernet Function v1.1.36

I'm having trouble finding much documentation online for this driver... would snd_pkt_thres be the number of threads it is able to or currently using for the outbound queues?

Here's the output from a TrueNAS box with the same card which has no trouble moving 10Gbps traffic:

dev.ql.0.%desc: Qlogic ISP 80xx PCI CNA Adapter-Ethernet Function v1.1.36
root@matterhorn[~]# sysctl sysctl dev.ql.1
dev.ql.1.num_sds_rings: 4
dev.ql.1.num_rds_rings: 2
dev.ql.1.free_pkt_thres: 1024
dev.ql.1.snd_pkt_thres: 16
dev.ql.1.rcv_pkt_thres_d: 32
dev.ql.1.rcv_pkt_thres: 128
dev.ql.1.jumbo_replenish: 2
dev.ql.1.std_replenish: 8
dev.ql.1.debug: 0
dev.ql.1.fw_version: 4.20.1.1429931003
dev.ql.1.stats: 0

stephenw10

Mmm, so likely those are the default values. There should be a description of each tunable if you run: sysctl -d dev.ql.0

SpaceBass

@stephenw10 said in hoping for 10Gbps, getting sub 1Gbps speed Xeon E3-1270 v5 3.6GHz:

sysctl -d dev.ql.0

well that's a helpful command! thanks

still not seeing anything about queues though :/

dev.ql.0:
dev.ql.0.wake: Device set to wake the system
dev.ql.0.num_sds_rings: Number of Status Descriptor Rings
dev.ql.0.num_rds_rings: Number of Rcv Descriptor Rings
dev.ql.0.free_pkt_thres: Threshold for # of packets to free at a time
dev.ql.0.snd_pkt_thres: Threshold for # of snd packets
dev.ql.0.rcv_pkt_thres_d: Threshold for # of rcv pkts to trigger indication defered
dev.ql.0.rcv_pkt_thres: Threshold for # of rcv pkts to trigger indication isr
dev.ql.0.jumbo_replenish: Threshold for Replenishing Jumbo Frames
dev.ql.0.std_replenish: Threshold for Replenishing Standard Frames
dev.ql.0.debug: Debug Level
dev.ql.0.fw_version: firmware version
dev.ql.0.stats: Statistics
dev.ql.0.%parent: parent device
dev.ql.0.%pnpinfo: device identification
dev.ql.0.%location: device location relative to parent
dev.ql.0.%driver: device driver name
dev.ql.0.%desc: device description

Also, interesting, when I do an iperf3 with -R here's the top output on the pfSense box...

CPU 0:  0.0% user,  0.0% nice,  1.9% system, 86.5% interrupt, 11.6% idle
CPU 1:  0.4% user,  0.0% nice,  8.1% system,  2.7% interrupt, 88.8% idle
CPU 2:  0.0% user,  0.0% nice,  5.8% system, 54.8% interrupt, 39.4% idle
CPU 3:  0.4% user,  0.0% nice,  8.5% system,  6.6% interrupt, 84.6% idle
CPU 4:  0.8% user,  0.0% nice, 25.9% system,  6.2% interrupt, 67.2% idle
CPU 5:  0.4% user,  0.0% nice, 21.6% system,  7.3% interrupt, 70.7% idle
CPU 6:  0.4% user,  0.0% nice,  0.8% system, 51.4% interrupt, 47.5% idle
CPU 7:  0.4% user,  0.0% nice,  5.4% system,  8.9% interrupt, 85.3% idle
Mem: 305M Active, 201M Inact, 872M Wired, 14G Free
ARC: 404M Total, 63M MFU, 336M MRU, 32K Anon, 1218K Header, 3853K Other
     122M Compressed, 283M Uncompressed, 2.31:1 Ratio
Swap: 1024M Total, 1024M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   12 root        -92    -     0B   528K CPU0     0   0:32  90.84% [intr{irq265: ql0}]
   11 root        155 ki31     0B   128K CPU1     1  45:28  87.99% [idle{idle: cpu1}]
   11 root        155 ki31     0B   128K CPU7     7  45:32  86.71% [idle{idle: cpu7}]
   11 root        155 ki31     0B   128K CPU3     3  45:28  83.73% [idle{idle: cpu3}]
   11 root        155 ki31     0B   128K RUN      5  45:17  71.47% [idle{idle: cpu5}]
   11 root        155 ki31     0B   128K CPU4     4  45:15  70.76% [idle{idle: cpu4}]
   12 root        -92    -     0B   528K CPU2     2   0:24  60.84% [intr{irq266: ql0}]
    0 root        -92    -     0B  1376K -        6   0:31  54.00% [kernel{ql1 txq}]
   11 root        155 ki31     0B   128K RUN      6  44:26  48.31% [idle{idle: cpu6}]
   12 root        -92    -     0B   528K WAIT     6   0:49  43.46% [intr{irq268: ql1}]
   11 root        155 ki31     0B   128K RUN      2  44:57  37.84% [idle{idle: cpu2}]
   12 root        -92    -     0B   528K WAIT     6   0:49  23.37% [intr{irq264: ql0}]
   11 root        155 ki31     0B   128K RUN      0  45:02  12.88% [idle{idle: cpu0}]
    0 root        -92    -     0B  1376K -        7   0:07  12.82% [kernel{ql0 rcvq}]
    0 root        -92    -     0B  1376K -        1   0:12   5.35% [kernel{ql1 rcvq}]
99487 avahi        20    0    13M  4500K select   5   0:06   2.82% avahi-daemon: running [washington.local]
   12 root        -92    -     0B   528K WAIT     4   0:20   2.34% [intr{irq267: ql0}]
    0 root        -92    -     0B  1376K -        1   0:11   1.83% [kernel{ql0 txq}]

stephenw10

Well at least 4 IRQs for ql0 there. Does vmstat -i show those?

Nothing about queues there I agree. That's the sort of setting that would usually be a loader value though. Those are usually shown in hw.ql but only values that are set are shown.

SpaceBass

@stephenw10

Intel X520-da2 update

tl;dr better performance for sure, but still not 10Gbps. 8 CPU cores, each NIC using 4 queues.

I'm increasingly of the opinion that even with a beefy CPU pfSense just doesnt like doing 10Gbps

iperf3
iperf3 -c ISP's server -P 10

[SUM]   0.00-10.00  sec  5.64 GBytes  4.84 Gbits/sec  1455             sender
[SUM]   0.00-10.03  sec  5.63 GBytes  4.82 Gbits/sec                  receiver

iperf3 -c ISP's server -P 10 -R

[SUM]   0.00-10.03  sec  5.18 GBytes  4.43 Gbits/sec  4033             sender
[SUM]   0.00-10.00  sec  5.14 GBytes  4.42 Gbits/sec                  receiver

iperf3 -c local server on other vLAN -P 18

[SUM]   0.00-10.00  sec  5.40 GBytes  4.64 Gbits/sec  11944             sender
[SUM]   0.00-10.01  sec  5.39 GBytes  4.62 Gbits/sec                  receiver

top

last pid: 52809;  load averages:  0.71,  0.41,  0.36                             up 0+00:28:16  16:09:59
742 threads:   10 running, 699 sleeping, 33 waiting
CPU 0:  0.0% user,  0.0% nice, 31.1% system,  0.0% interrupt, 68.9% idle
CPU 1:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 2:  0.4% user,  0.0% nice,  3.9% system,  0.0% interrupt, 95.7% idle
CPU 3:  0.4% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.2% idle
CPU 4:  0.0% user,  0.0% nice, 75.2% system,  0.0% interrupt, 24.8% idle
CPU 5:  0.4% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.6% idle
CPU 6:  0.4% user,  0.0% nice, 75.6% system,  0.0% interrupt, 24.0% idle
CPU 7:  0.0% user,  0.0% nice,  0.4% system,  0.0% interrupt, 99.6% idle
Mem: 297M Active, 178M Inact, 920M Wired, 14G Free
ARC: 384M Total, 53M MFU, 326M MRU, 32K Anon, 1086K Header, 3216K Other
     118M Compressed, 268M Uncompressed, 2.27:1 Ratio
Swap: 1024M Total, 1024M Free

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        155 ki31     0B   128K CPU7     7  27:56  99.89% [idle{idle: cpu7}]
   11 root        155 ki31     0B   128K CPU5     5  27:55  99.73% [idle{idle: cpu5}]
   11 root        155 ki31     0B   128K CPU3     3  27:56  99.16% [idle{idle: cpu3}]
   11 root        155 ki31     0B   128K CPU1     1  27:55  99.14% [idle{idle: cpu1}]
   11 root        155 ki31     0B   128K RUN      2  26:31  95.65% [idle{idle: cpu2}]
    0 root        -76    -     0B  1376K -        6   1:13  78.85% [kernel{if_io_tqg_6}]
    0 root        -76    -     0B  1376K CPU4     4   1:16  72.63% [kernel{if_io_tqg_4}]
   11 root        155 ki31     0B   128K RUN      0  26:34  69.36% [idle{idle: cpu0}]
    0 root        -76    -     0B  1376K -        0   1:27  30.30% [kernel{if_io_tqg_0}]
   11 root        155 ki31     0B   128K RUN      4  26:34  27.30% [idle{idle: cpu4}]
   11 root        155 ki31     0B   128K CPU6     6  26:57  21.08% [idle{idle: cpu6}]
    0 root        -76    -     0B  1376K -        2   1:40   3.32% [kernel{if_io_tqg_2}]
67535 unbound      20    0   107M    54M kqread   4   0:02   0.40% /usr/local/sbin/unbound -c /var/unbox

dmesg

root: dmesg | grep queues
ix0: Using 4 RX queues 4 TX queues
ix0: allocated for 4 queues
ix0: allocated for 4 rx queues
ix0: netmap queues/slots: TX 4/2048, RX 4/2048
ix1: Using 4 RX queues 4 TX queues
ix1: allocated for 4 queues
ix1: allocated for 4 rx queues
ix1: netmap queues/slots: TX 4/2048, RX 4/2048

ogghi

@spacebass
Pretty exactly the same on my end. I will now try to get in touch with our ISP again to make sure it's not their core router being the culprit here!

https://www.reddit.com/r/PFSENSE/comments/137iv07/comment/jj6oqw4/?utm_source=share&utm_medium=web2x&context=3