Intel I210 low throughput w/ VLANs

thetrevster

I've been spending the last 3 days trying to figure this one out... Just installed a Protectli FW4B with a Celeron J3160 and 8GB RAM running 2.7 CE. No packages except ACME for certs, no IPS/IDS or any of that. Two interfaces being used, igb0 for WAN, igb1 for LAN. Running 21 VLANs under igb1 with the parent/untagged interface not being used. Static IP directly on WAN (no PPPoE or any of that). No routing between VLANs, all traffic from each VLAN hits the VLAN interface then gets 1:1 NAT'd to a public VIP on my /27 I have assigned from my ISP.

Having trouble routing over 750Mb/s out of the VLANs (igb1.x) towards the Internet (igb0). Running iPerf traffic through the appliance on one of the VLANs to a public iPerf server (using -R), I get around 562Mb/s with 141 retries. CPU stays around 34% utilization which leads me to believe this isn't a CPU-bound issue (I think). Some output of boot log if it's helpful is below. I've tried tweaking with offload and currently have everything unchecked (enabled) except ALTQ support (I know it's advised to leave TSO and LRO checked, but experimenting at this point to resolve this).

I read somewhere that increasing the RX and TX below from 1024 to 4096 below may help so that may be the next thing I'll try, thoughts? This thing is already deployed and running (stable) just not getting the expected speed so I try to make tweaks in a maintenance window. I did notice while checking system activity that kernel{if_io_tqg_3}] is only eating up a single core during these iPerf tests, maybe something there? Appreciate any insight or tweaks I can try to squeeze more performance! Besides the large number of VLANs, not doing anything special, no intra-VLAN routing. Just basic 1:1 NAT for each IP in the /30 VLAN sub interface and minimal FW rules (like 5 floating and that's it).

cat /var/log/dmesg.boot | grep igb

igb0: <Intel(R) I210 Flashless (Copper)> port 0xe000-0xe01f mem 0xb1500000-0xb151ffff,0xb1520000-0xb1523fff at device 0.0 on pci1
igb0: NVM V0.6 imgtype6
igb0: Using 1024 TX descriptors and 1024 RX descriptors
igb0: Using 4 RX queues 4 TX queues
igb0: Using MSI-X interrupts with 5 vectors
igb0: Ethernet address: 00:e0:67:30:6e:b0
igb0: netmap queues/slots: TX 4/1024, RX 4/1024
igb1: <Intel(R) I210 Flashless (Copper)> port 0xd000-0xd01f mem 0xb1400000-0xb141ffff,0xb1420000-0xb1423fff at device 0.0 on pci2
igb1: NVM V0.6 imgtype6
igb1: Using 1024 TX descriptors and 1024 RX descriptors
igb1: Using 4 RX queues 4 TX queues
igb1: Using MSI-X interrupts with 5 vectors
igb1: Ethernet address: 00:e0:67:30:6e:b1
igb1: netmap queues/slots: TX 4/1024, RX 4/1024
igb2: <Intel(R) I210 Flashless (Copper)> port 0xc000-0xc01f mem 0xb1300000-0xb131ffff,0xb1320000-0xb1323fff at device 0.0 on pci3
igb2: NVM V0.6 imgtype6
igb2: Using 1024 TX descriptors and 1024 RX descriptors
igb2: Using 4 RX queues 4 TX queues
igb2: Using MSI-X interrupts with 5 vectors
igb2: Ethernet address: 00:e0:67:30:6e:b2
igb2: netmap queues/slots: TX 4/1024, RX 4/1024
igb3: <Intel(R) I210 Flashless (Copper)> port 0xb000-0xb01f mem 0xb1200000-0xb121ffff,0xb1220000-0xb1223fff at device 0.0 on pci4
igb3: NVM V0.6 imgtype6
igb3: Using 1024 TX descriptors and 1024 RX descriptors
igb3: Using 4 RX queues 4 TX queues
igb3: Using MSI-X interrupts with 5 vectors
igb3: Ethernet address: 00:e0:67:30:6e:b3
igb3: netmap queues/slots: TX 4/1024, RX 4/1024

thetrevster

@thetrevster I should also mention: I've noticed I was able to get throughput to hit over 850Mb/s but it seems to hit 850, then fade down to 600, then ramp right back up to 850, then slowly decline again to around 600 and the process repeats over and over throughout a test, if that's indicative of anything. Possibly the TX and RX adjustments above should be made?

stephenw10

@thetrevster said in Intel I210 low throughput w/ VLANs:

I did notice while checking system activity that kernel{if_io_tqg_3}] is only eating up a single core during these iPerf tests,

Right, 34% total CPU use could still be 100% of one core on a 4 core CPU. Try checking the per core usage either at the command line using top -HaSP or in Diag > System Activity in the gui.

Are you sure the WAN will pass 1G up and down? You can try running iperf on pfSense directly and testing against it to confirm the LAN side is passing 1G.

Steve

thetrevster

@stephenw10 it does appear a single core is getting eaten - so I suppose the real question is, is VLAN tagging offloaded to the NICs or is each VLAN processed single threaded within pfSense? I am sure the WAN does 1Gb/s. When I plug directly into the ISP-provided fiber switch, I'm seeing consistent 930Mb/s symmetrical. I did perform the local iPerf test. I ran two tests together, simultaneously, to get better CPU utilization and the average throughput for both tests was 677Mb/s ((332+347)/2). CPU spiked up to around 72% Running a single iPerf test yielded an average of 674Mb/s. CPU at 61%. I would have thought that two iPerf tests running on separate cores could have hit an average of over 900Mb/s. All iPerf tests mentioned were ran with 8 streams (-P8).

As far as LAN to pfSense as mentioned, I also performed that test and saw similar to above. So this makes me think it's not a specific "side" but more like NIC queuing possibly? VLANs would be out of the equation for the pfSense to Internet iPerf tests as the Internet WAN port is a straight L3 physical with no sub interfaces/tagging. Is it possible I need to tweak something with the I210 NICs?

thetrevster

@thetrevster I have stumbled upon dev.igb.0.iflib.override_ntxds and dev.igb.0.iflib.override_ntxds. I'm wondering if those values need to be adjusted for each applicable igb interface in question. I was mainly looking at content in post 15 here: https://hardforum.com/threads/pfsense-2-5-0-upgrade-results.2008073/

stephenw10

I wouldn't have expected VLANs to make any significant difference there.

You can certainly try setting a different value for the queue descriptors. That's not normally required for igb though.

thetrevster

@stephenw10 so I’m definitely still having issues but I have searched around and found the following forum post. This seems to line up with almost exactly the same thing I’m running into but it seems the OP never found a solution…

https://forum.netgate.com/topic/148800/throughput-expectations-on-celeron-igb-driver-system

stephenw10

Ah, I was just about to suggest the same thing I did there. Is your CPU stuck at some low frequency mode?

Check: sysctl dev.cpu.0

What does the top -HaSP output actually look like when your are testing?

thetrevster

@stephenw10 We might be getting somewhere... I'm not positive what I'm looking at, but the output is below for sysctl dev.cpu.0. If I remember reading right, 1601 indicates that TurboBoost or whatever is allowed on the CPU with the high limit being, apparently 2GHz, according to below.

dev.cpu.0.temperature: 34.0C
dev.cpu.0.coretemp.throttle_log: 0
dev.cpu.0.coretemp.tjmax: 90.0C
dev.cpu.0.coretemp.resolution: 1
dev.cpu.0.coretemp.delta: 56
dev.cpu.0.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc
dev.cpu.0.cx_usage_counters: 1802411824 0 0
dev.cpu.0.cx_usage: 100.00% 0.00% 0.00% last 70us
dev.cpu.0.cx_lowest: C1
dev.cpu.0.cx_supported: C1/1/1 C2/2/500 C3/3/1000
dev.cpu.0.freq_levels: 1601/2000 1600/2000 1520/1900 1440/1800 1360/1700 1280/1600 1200/1500 1120/1400 1040/1300 960/1200 880/1100 800/1000 720/900 640/800 560/700 480/600
dev.cpu.0.freq: 1601
dev.cpu.0.%parent: acpi0
dev.cpu.0.%pnpinfo: _HID=none _UID=0 _CID=none
dev.cpu.0.%location: handle=\_PR_.CPU0
dev.cpu.0.%driver: cpu
dev.cpu.0.%desc: ACPI CPU

I did some further iPerf testing (iperf3 -c speedtest.sea11.us.leaseweb.net -p 5201-5210 -P4 -R -t 60) with a local server here in Seattle from pfSense CLI directly and was able to see 940Mb/s which is great! During that test, dashboard showed ~64% CPU utilization and the below with your top command:

last pid: 36502;  load averages:  1.21,  0.71,  0.55                                                         up 14+13:49:52  17:35:04
307 threads:   8 running, 281 sleeping, 18 waiting
CPU 0:  0.8% user,  0.0% nice, 27.6% system, 15.4% interrupt, 56.3% idle
CPU 1:  1.6% user,  0.0% nice, 40.9% system,  9.4% interrupt, 48.0% idle
CPU 2:  0.8% user,  0.0% nice, 49.6% system, 29.9% interrupt, 19.7% idle
CPU 3:  3.9% user,  0.0% nice, 31.9% system, 13.8% interrupt, 50.4% idle
Mem: 40M Active, 269M Inact, 490M Wired, 56K Buf, 6991M Free
ARC: 130M Total, 19M MFU, 104M MRU, 324K Anon, 717K Header, 6000K Other
     100M Compressed, 259M Uncompressed, 2.58:1 Ratio
Swap: 1024M Total, 1024M Free

Message from syslogd@rtr01 at Nov 16 17:26:49 ... C   TIME    WCPU COMMAND
35513 tstrotz     109    0    17M  7300K CPU1     1   0:07  74.43% iperf3 -c speedtest.sea11.us.leaseweb.net -p 5201-5210 -P4 -R -t 6
   11 root        187 ki31     0B    64K CPU0     0 330.4H  54.34% [idle{idle: cpu0}]
   11 root        187 ki31     0B    64K RUN      3 329.2H  51.07% [idle{idle: cpu3}]
   11 root        187 ki31     0B    64K RUN      1 330.6H  48.84% [idle{idle: cpu1}]
   12 root        -56    -     0B   240K RUN      2 106:00  46.99% [intr{swi1: netisr 1}]
    0 root        -60    -     0B  1488K CPU2     2 592:29  43.96% [kernel{if_io_tqg_2}]
    0 root        -60    -     0B  1488K -        1 563:46  30.88% [kernel{if_io_tqg_1}]
   11 root        187 ki31     0B    64K RUN      2 331.6H  20.12% [idle{idle: cpu2}]
   12 root        -60    -     0B   240K WAIT     3 118:31  15.05% [intr{swi1: netisr 2}]
    0 root        -60    -     0B  1488K -        3 751:32   4.99% [kernel{if_io_tqg_3}]
    0 root        -60    -     0B  1488K -        0 568:48   3.88% [kernel{if_io_tqg_0}]
   12 root        -60    -     0B   240K WAIT     0 401:39   2.31% [intr{swi1: netisr 3}]
   12 root        -60    -     0B   240K WAIT     1 413:23   2.22% [intr{swi1: netisr 0}]
    0 root        -64    -     0B  1488K -        0  60:28   0.40% [kernel{dummynet}]
75759 tstrotz      20    0    14M  4384K CPU3     3   0:01   0.16% top -HaSP
    7 root        -16    -     0B    16K pftm     3  12:06   0.12% [pf purge]
    0 root        -60    -     0B  1488K -        1  14:18   0.06% [kernel{if_config_tqg_0}]
    8 root        -16    -     0B    16K -        1   8:27   0.05% [rand_harvestq]
18647 dhcpd        20    0    25M    12M select   0   5:54   0.03% /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd

Testing with a tagged VLAN through the MikroTik on the other hand with the exact same iPerf command, I get the below and a max of ~650Mb/s. Maybe it's the MikroTik switch? I can schedule time to go to the site to test a tagged VLAN directly out of the pfSense box itself and specify the tag on my laptop NIC if needed.

last pid:  1926;  load averages:  1.15,  0.97,  0.72                                                         up 14+13:54:27  17:39:39
306 threads:   7 running, 280 sleeping, 19 waiting
CPU 0:  2.8% user,  0.0% nice,  5.9% system, 22.8% interrupt, 68.5% idle
CPU 1:  0.8% user,  0.0% nice, 33.5% system,  8.7% interrupt, 57.1% idle
CPU 2:  2.0% user,  0.0% nice, 14.2% system,  6.3% interrupt, 77.6% idle
CPU 3:  0.8% user,  0.0% nice, 64.2% system,  1.2% interrupt, 33.9% idle
Mem: 39M Active, 268M Inact, 490M Wired, 56K Buf, 6992M Free
ARC: 129M Total, 20M MFU, 102M MRU, 464K Anon, 717K Header, 5961K Other
     100M Compressed, 259M Uncompressed, 2.59:1 Ratio
Swap: 1024M Total, 1024M Free

Message from syslogd@rtr01 at Nov 16 17:26:49 ... C   TIME    WCPU COMMAND
   11 root        187 ki31     0B    64K RUN      2 331.7H  75.74% [idle{idle: cpu2}]
   11 root        187 ki31     0B    64K CPU0     0 330.5H  67.53% [idle{idle: cpu0}]
    0 root        -60    -     0B  1488K CPU3     3 752:18  62.08% [kernel{if_io_tqg_3}]
   11 root        187 ki31     0B    64K CPU1     1 330.7H  59.62% [idle{idle: cpu1}]
   11 root        187 ki31     0B    64K RUN      3 329.3H  34.97% [idle{idle: cpu3}]
    0 root        -60    -     0B  1488K CPU1     1 564:02  29.40% [kernel{if_io_tqg_1}]
   12 root        -60    -     0B   240K WAIT     0 401:56  26.67% [intr{swi1: netisr 3}]
    0 root        -60    -     0B  1488K -        2 593:03  17.13% [kernel{if_io_tqg_2}]
   12 root        -60    -     0B   240K WAIT     1 106:14   6.13% [intr{swi1: netisr 1}]
    0 root        -60    -     0B  1488K -        0 569:12   5.26% [kernel{if_io_tqg_0}]
28137 root         20    0    32M    11M kqread   3  42:21   4.38% nginx: worker process (nginx)
   12 root        -60    -     0B   240K WAIT     0 413:36   3.34% [intr{swi1: netisr 0}]
   12 root        -60    -     0B   240K WAIT     0 118:39   2.47% [intr{swi1: netisr 2}]
20020 root         20    0   145M    56M accept   2   1:39   1.85% php-fpm: pool nginx (php-fpm)
32572 root         20    0   149M    52M accept   2   2:00   1.76% php-fpm: pool nginx (php-fpm){php-fpm}
    0 root        -64    -     0B  1488K -        0  60:29   0.33% [kernel{dummynet}]
75759 tstrotz      20    0    14M  4384K CPU2     2   0:01   0.15% top -HaSP
74879 root         20    0    13M  3000K select   1   6:39   0.10% /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/s
96522 root         20    0   107M    25M uwait    1   0:14   0.07% /usr/local/libexec/ipsec/charon --use-syslog{charon}

stephenw10

Mmm, that does seem suspicious. I would normally expect a higher result when testing from a client behind the firewall. Unless that client itself is restricted.

You can see that in both cases no single core is maxed out. But when testing from the firewall directly the load created by iperf itself is larger than anything else.