Intel I210 low throughput w/ VLANs
-
I've been spending the last 3 days trying to figure this one out... Just installed a Protectli FW4B with a Celeron J3160 and 8GB RAM running 2.7 CE. No packages except ACME for certs, no IPS/IDS or any of that. Two interfaces being used, igb0 for WAN, igb1 for LAN. Running 21 VLANs under igb1 with the parent/untagged interface not being used. Static IP directly on WAN (no PPPoE or any of that). No routing between VLANs, all traffic from each VLAN hits the VLAN interface then gets 1:1 NAT'd to a public VIP on my /27 I have assigned from my ISP.
Having trouble routing over 750Mb/s out of the VLANs (igb1.x) towards the Internet (igb0). Running iPerf traffic through the appliance on one of the VLANs to a public iPerf server (using -R), I get around 562Mb/s with 141 retries. CPU stays around 34% utilization which leads me to believe this isn't a CPU-bound issue (I think). Some output of boot log if it's helpful is below. I've tried tweaking with offload and currently have everything unchecked (enabled) except ALTQ support (I know it's advised to leave TSO and LRO checked, but experimenting at this point to resolve this).
I read somewhere that increasing the RX and TX below from 1024 to 4096 below may help so that may be the next thing I'll try, thoughts? This thing is already deployed and running (stable) just not getting the expected speed so I try to make tweaks in a maintenance window. I did notice while checking system activity that kernel{if_io_tqg_3}] is only eating up a single core during these iPerf tests, maybe something there? Appreciate any insight or tweaks I can try to squeeze more performance! Besides the large number of VLANs, not doing anything special, no intra-VLAN routing. Just basic 1:1 NAT for each IP in the /30 VLAN sub interface and minimal FW rules (like 5 floating and that's it).
cat /var/log/dmesg.boot | grep igb igb0: <Intel(R) I210 Flashless (Copper)> port 0xe000-0xe01f mem 0xb1500000-0xb151ffff,0xb1520000-0xb1523fff at device 0.0 on pci1 igb0: NVM V0.6 imgtype6 igb0: Using 1024 TX descriptors and 1024 RX descriptors igb0: Using 4 RX queues 4 TX queues igb0: Using MSI-X interrupts with 5 vectors igb0: Ethernet address: 00:e0:67:30:6e:b0 igb0: netmap queues/slots: TX 4/1024, RX 4/1024 igb1: <Intel(R) I210 Flashless (Copper)> port 0xd000-0xd01f mem 0xb1400000-0xb141ffff,0xb1420000-0xb1423fff at device 0.0 on pci2 igb1: NVM V0.6 imgtype6 igb1: Using 1024 TX descriptors and 1024 RX descriptors igb1: Using 4 RX queues 4 TX queues igb1: Using MSI-X interrupts with 5 vectors igb1: Ethernet address: 00:e0:67:30:6e:b1 igb1: netmap queues/slots: TX 4/1024, RX 4/1024 igb2: <Intel(R) I210 Flashless (Copper)> port 0xc000-0xc01f mem 0xb1300000-0xb131ffff,0xb1320000-0xb1323fff at device 0.0 on pci3 igb2: NVM V0.6 imgtype6 igb2: Using 1024 TX descriptors and 1024 RX descriptors igb2: Using 4 RX queues 4 TX queues igb2: Using MSI-X interrupts with 5 vectors igb2: Ethernet address: 00:e0:67:30:6e:b2 igb2: netmap queues/slots: TX 4/1024, RX 4/1024 igb3: <Intel(R) I210 Flashless (Copper)> port 0xb000-0xb01f mem 0xb1200000-0xb121ffff,0xb1220000-0xb1223fff at device 0.0 on pci4 igb3: NVM V0.6 imgtype6 igb3: Using 1024 TX descriptors and 1024 RX descriptors igb3: Using 4 RX queues 4 TX queues igb3: Using MSI-X interrupts with 5 vectors igb3: Ethernet address: 00:e0:67:30:6e:b3 igb3: netmap queues/slots: TX 4/1024, RX 4/1024
-
@thetrevster I should also mention: I've noticed I was able to get throughput to hit over 850Mb/s but it seems to hit 850, then fade down to 600, then ramp right back up to 850, then slowly decline again to around 600 and the process repeats over and over throughout a test, if that's indicative of anything. Possibly the TX and RX adjustments above should be made?
-
@thetrevster said in Intel I210 low throughput w/ VLANs:
I did notice while checking system activity that kernel{if_io_tqg_3}] is only eating up a single core during these iPerf tests,
Right, 34% total CPU use could still be 100% of one core on a 4 core CPU. Try checking the per core usage either at the command line using
top -HaSP
or in Diag > System Activity in the gui.Are you sure the WAN will pass 1G up and down? You can try running iperf on pfSense directly and testing against it to confirm the LAN side is passing 1G.
Steve
-
@stephenw10 it does appear a single core is getting eaten - so I suppose the real question is, is VLAN tagging offloaded to the NICs or is each VLAN processed single threaded within pfSense? I am sure the WAN does 1Gb/s. When I plug directly into the ISP-provided fiber switch, I'm seeing consistent 930Mb/s symmetrical. I did perform the local iPerf test. I ran two tests together, simultaneously, to get better CPU utilization and the average throughput for both tests was 677Mb/s ((332+347)/2). CPU spiked up to around 72% Running a single iPerf test yielded an average of 674Mb/s. CPU at 61%. I would have thought that two iPerf tests running on separate cores could have hit an average of over 900Mb/s. All iPerf tests mentioned were ran with 8 streams (-P8).
As far as LAN to pfSense as mentioned, I also performed that test and saw similar to above. So this makes me think it's not a specific "side" but more like NIC queuing possibly? VLANs would be out of the equation for the pfSense to Internet iPerf tests as the Internet WAN port is a straight L3 physical with no sub interfaces/tagging. Is it possible I need to tweak something with the I210 NICs?
-
@thetrevster I have stumbled upon dev.igb.0.iflib.override_ntxds and dev.igb.0.iflib.override_ntxds. I'm wondering if those values need to be adjusted for each applicable igb interface in question. I was mainly looking at content in post 15 here: https://hardforum.com/threads/pfsense-2-5-0-upgrade-results.2008073/
-
I wouldn't have expected VLANs to make any significant difference there.
You can certainly try setting a different value for the queue descriptors. That's not normally required for igb though.
-
@stephenw10 so I’m definitely still having issues but I have searched around and found the following forum post. This seems to line up with almost exactly the same thing I’m running into but it seems the OP never found a solution…
https://forum.netgate.com/topic/148800/throughput-expectations-on-celeron-igb-driver-system
-
Ah, I was just about to suggest the same thing I did there. Is your CPU stuck at some low frequency mode?
Check:
sysctl dev.cpu.0
What does the
top -HaSP
output actually look like when your are testing? -
@stephenw10 We might be getting somewhere... I'm not positive what I'm looking at, but the output is below for sysctl dev.cpu.0. If I remember reading right, 1601 indicates that TurboBoost or whatever is allowed on the CPU with the high limit being, apparently 2GHz, according to below.
dev.cpu.0.temperature: 34.0C dev.cpu.0.coretemp.throttle_log: 0 dev.cpu.0.coretemp.tjmax: 90.0C dev.cpu.0.coretemp.resolution: 1 dev.cpu.0.coretemp.delta: 56 dev.cpu.0.cx_method: C1/mwait/hwc C2/mwait/hwc C3/mwait/hwc dev.cpu.0.cx_usage_counters: 1802411824 0 0 dev.cpu.0.cx_usage: 100.00% 0.00% 0.00% last 70us dev.cpu.0.cx_lowest: C1 dev.cpu.0.cx_supported: C1/1/1 C2/2/500 C3/3/1000 dev.cpu.0.freq_levels: 1601/2000 1600/2000 1520/1900 1440/1800 1360/1700 1280/1600 1200/1500 1120/1400 1040/1300 960/1200 880/1100 800/1000 720/900 640/800 560/700 480/600 dev.cpu.0.freq: 1601 dev.cpu.0.%parent: acpi0 dev.cpu.0.%pnpinfo: _HID=none _UID=0 _CID=none dev.cpu.0.%location: handle=\_PR_.CPU0 dev.cpu.0.%driver: cpu dev.cpu.0.%desc: ACPI CPU
I did some further iPerf testing (iperf3 -c speedtest.sea11.us.leaseweb.net -p 5201-5210 -P4 -R -t 60) with a local server here in Seattle from pfSense CLI directly and was able to see 940Mb/s which is great! During that test, dashboard showed ~64% CPU utilization and the below with your top command:
last pid: 36502; load averages: 1.21, 0.71, 0.55 up 14+13:49:52 17:35:04 307 threads: 8 running, 281 sleeping, 18 waiting CPU 0: 0.8% user, 0.0% nice, 27.6% system, 15.4% interrupt, 56.3% idle CPU 1: 1.6% user, 0.0% nice, 40.9% system, 9.4% interrupt, 48.0% idle CPU 2: 0.8% user, 0.0% nice, 49.6% system, 29.9% interrupt, 19.7% idle CPU 3: 3.9% user, 0.0% nice, 31.9% system, 13.8% interrupt, 50.4% idle Mem: 40M Active, 269M Inact, 490M Wired, 56K Buf, 6991M Free ARC: 130M Total, 19M MFU, 104M MRU, 324K Anon, 717K Header, 6000K Other 100M Compressed, 259M Uncompressed, 2.58:1 Ratio Swap: 1024M Total, 1024M Free Message from syslogd@rtr01 at Nov 16 17:26:49 ... C TIME WCPU COMMAND 35513 tstrotz 109 0 17M 7300K CPU1 1 0:07 74.43% iperf3 -c speedtest.sea11.us.leaseweb.net -p 5201-5210 -P4 -R -t 6 11 root 187 ki31 0B 64K CPU0 0 330.4H 54.34% [idle{idle: cpu0}] 11 root 187 ki31 0B 64K RUN 3 329.2H 51.07% [idle{idle: cpu3}] 11 root 187 ki31 0B 64K RUN 1 330.6H 48.84% [idle{idle: cpu1}] 12 root -56 - 0B 240K RUN 2 106:00 46.99% [intr{swi1: netisr 1}] 0 root -60 - 0B 1488K CPU2 2 592:29 43.96% [kernel{if_io_tqg_2}] 0 root -60 - 0B 1488K - 1 563:46 30.88% [kernel{if_io_tqg_1}] 11 root 187 ki31 0B 64K RUN 2 331.6H 20.12% [idle{idle: cpu2}] 12 root -60 - 0B 240K WAIT 3 118:31 15.05% [intr{swi1: netisr 2}] 0 root -60 - 0B 1488K - 3 751:32 4.99% [kernel{if_io_tqg_3}] 0 root -60 - 0B 1488K - 0 568:48 3.88% [kernel{if_io_tqg_0}] 12 root -60 - 0B 240K WAIT 0 401:39 2.31% [intr{swi1: netisr 3}] 12 root -60 - 0B 240K WAIT 1 413:23 2.22% [intr{swi1: netisr 0}] 0 root -64 - 0B 1488K - 0 60:28 0.40% [kernel{dummynet}] 75759 tstrotz 20 0 14M 4384K CPU3 3 0:01 0.16% top -HaSP 7 root -16 - 0B 16K pftm 3 12:06 0.12% [pf purge] 0 root -60 - 0B 1488K - 1 14:18 0.06% [kernel{if_config_tqg_0}] 8 root -16 - 0B 16K - 1 8:27 0.05% [rand_harvestq] 18647 dhcpd 20 0 25M 12M select 0 5:54 0.03% /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcpd
Testing with a tagged VLAN through the MikroTik on the other hand with the exact same iPerf command, I get the below and a max of ~650Mb/s. Maybe it's the MikroTik switch? I can schedule time to go to the site to test a tagged VLAN directly out of the pfSense box itself and specify the tag on my laptop NIC if needed.
last pid: 1926; load averages: 1.15, 0.97, 0.72 up 14+13:54:27 17:39:39 306 threads: 7 running, 280 sleeping, 19 waiting CPU 0: 2.8% user, 0.0% nice, 5.9% system, 22.8% interrupt, 68.5% idle CPU 1: 0.8% user, 0.0% nice, 33.5% system, 8.7% interrupt, 57.1% idle CPU 2: 2.0% user, 0.0% nice, 14.2% system, 6.3% interrupt, 77.6% idle CPU 3: 0.8% user, 0.0% nice, 64.2% system, 1.2% interrupt, 33.9% idle Mem: 39M Active, 268M Inact, 490M Wired, 56K Buf, 6992M Free ARC: 129M Total, 20M MFU, 102M MRU, 464K Anon, 717K Header, 5961K Other 100M Compressed, 259M Uncompressed, 2.59:1 Ratio Swap: 1024M Total, 1024M Free Message from syslogd@rtr01 at Nov 16 17:26:49 ... C TIME WCPU COMMAND 11 root 187 ki31 0B 64K RUN 2 331.7H 75.74% [idle{idle: cpu2}] 11 root 187 ki31 0B 64K CPU0 0 330.5H 67.53% [idle{idle: cpu0}] 0 root -60 - 0B 1488K CPU3 3 752:18 62.08% [kernel{if_io_tqg_3}] 11 root 187 ki31 0B 64K CPU1 1 330.7H 59.62% [idle{idle: cpu1}] 11 root 187 ki31 0B 64K RUN 3 329.3H 34.97% [idle{idle: cpu3}] 0 root -60 - 0B 1488K CPU1 1 564:02 29.40% [kernel{if_io_tqg_1}] 12 root -60 - 0B 240K WAIT 0 401:56 26.67% [intr{swi1: netisr 3}] 0 root -60 - 0B 1488K - 2 593:03 17.13% [kernel{if_io_tqg_2}] 12 root -60 - 0B 240K WAIT 1 106:14 6.13% [intr{swi1: netisr 1}] 0 root -60 - 0B 1488K - 0 569:12 5.26% [kernel{if_io_tqg_0}] 28137 root 20 0 32M 11M kqread 3 42:21 4.38% nginx: worker process (nginx) 12 root -60 - 0B 240K WAIT 0 413:36 3.34% [intr{swi1: netisr 0}] 12 root -60 - 0B 240K WAIT 0 118:39 2.47% [intr{swi1: netisr 2}] 20020 root 20 0 145M 56M accept 2 1:39 1.85% php-fpm: pool nginx (php-fpm) 32572 root 20 0 149M 52M accept 2 2:00 1.76% php-fpm: pool nginx (php-fpm){php-fpm} 0 root -64 - 0B 1488K - 0 60:29 0.33% [kernel{dummynet}] 75759 tstrotz 20 0 14M 4384K CPU2 2 0:01 0.15% top -HaSP 74879 root 20 0 13M 3000K select 1 6:39 0.10% /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/s 96522 root 20 0 107M 25M uwait 1 0:14 0.07% /usr/local/libexec/ipsec/charon --use-syslog{charon}
-
Mmm, that does seem suspicious. I would normally expect a higher result when testing from a client behind the firewall. Unless that client itself is restricted.
You can see that in both cases no single core is maxed out. But when testing from the firewall directly the load created by iperf itself is larger than anything else.