CPU Usage when network used

qwaven

Hi Steve,

I still had the shell open from the same transfer. Here is a more complete view.
I am not clear if
kernel{igb0 que (qid 0)} is different than intr{irq269: igb0:que 0} however for igb3 I see [intr{irq288: igb3:que 1}] and [intr{irq287: igb3:que 0}] which still seems low given I have 4 cores no? I have not adjusted anything manually like this.

PID USERNAME PRI NICE SIZE 11 root 155 ki31 11 root 155 ki31 11 root 155 ki31 0 root -92 - 11 root 155 ki31 12 root -92 - 12 root -92 - 12 root -92 - 78054 root 78054 root 78054 root 78054 root 78054 root 78054 root 78054 root 41253 unbound 36170 root 20 root -16 - 0 root -92 - 12 root -92 - 78054 root 198 root 75724 root 23537 root 12 root -72 - 50030 root 22585 root 12 root -60 - 65534 root 0 root -92 - 339 root 74721 root 81162 root 78054 root 49333 dhcpd 12 root -92 - 78054 root 19 root -16 - 44931 root 23537 root 36968 root 36442 root 12 root -88 - 37136 root 15 root -68 - 78054 root 15 root -68 - RES STATE C TIME WCPU COMMAND
0K 64K CPU3 3 25.4H 74.96% [idle{idle: cpu3}]
0K 64K RUN 1 25.4H 54.03% [idle{idle: cpu1}]
0K 64K RUN 0 25.3H 41.49% [idle{idle: cpu0}]
0K 688K CPU2 2 10:46 35.19% [kernel{igb0 que (qid 0)}]
0K 64K RUN 2 25.3H 33.86% [idle{idle: cpu2}]
0K 816K CPU1 1 3:36 31.32% [intr{irq288: igb3:que 1}]
0K 816K WAIT 0 3:40 29.27% [intr{irq287: igb3:que 0}]
0K 816K WAIT 0 5:50 17.34% [intr{irq269: igb0:que 0}]
30 0 266M 221M RUN 1 2:13 16.83% /usr/local/bin/ntopng -d /v
22 0 266M 221M uwait 3 0:12 9.10% /usr/local/bin/ntopng -d /v
25 0 266M 221M uwait 0 0:11 7.71% /usr/local/bin/ntopng -d /v
23 0 266M 221M uwait 3 0:11 7.62% /usr/local/bin/ntopng -d /v
23 0 266M 221M nanslp 3 1:31 4.48% /usr/local/bin/ntopng -d /v
21 0 266M 221M nanslp 1 0:48 4.16% /usr/local/bin/ntopng -d /v
20 0 266M 221M nanslp 0 0:39 1.45% /usr/local/bin/ntopng -d /v
20 0 65412K 44220K kqread 0 0:01 0.67% /usr/local/sbin/unbound -c
21 0 98680K 39040K accept 3 0:06 0.62% php-fpm: pool nginx (php-fp
0K 16K - 0 0:37 0.57% [rand_harvestq]
0K 688K - 1 0:04 0.42% [kernel{igb3 que (qid 0)}]
0K 816K WAIT 3 0:34 0.34% [intr{irq290: igb3:que 3}]
20 0 266M 221M bpf 1 0:03 0.25% /usr/local/bin/ntopng -d /v
20 0 9860K 4776K CPU0 0 0:07 0.25% top -aSH
20 0 8428K 4984K kqread 0 0:04 0.21% redis-server: /usr/local/bi
20 0 12912K 13032K usem 0 0:00 0.16% /usr/local/sbin/ntpd -g -c
0K 816K WAIT 3 0:14 0.14% [intr{swi1: netisr 0}]
20 0 9464K 5868K select 3 0:10 0.14% /usr/local/sbin/miniupnpd -
20 0 23592K 8804K kqread 3 0:01 0.12% nginx: worker process (ngin
0K 816K WAIT 0 1:21 0.11% [intr{swi4: clock (0)}]
20 0 6600K 2356K bpf 3 0:07 0.08% /usr/local/sbin/filterlog -
0K 688K - 2 0:00 0.07% [kernel{igb3 que (qid 1)}]
36 0 98552K 39340K accept 1 0:13 0.07% php-fpm: pool nginx (php-fp
20 0 50888K 35668K nanslp 3 0:02 0.07% /usr/local/bin/php -f /usr/
20 0 6392K 2540K select 1 0:04 0.06% /usr/sbin/syslogd -s -c -c
20 0 266M 221M nanslp 0 0:00 0.05% /usr/local/bin/ntopng -d /v
20 0 12576K 7924K select 3 0:01 0.05% /usr/local/sbin/dhcpd -user
0K 816K RUN 2 0:20 0.04% [intr{irq289: igb3:que 2}]
20 0 266M 221M select 0 0:00 0.04% /usr/local/bin/ntopng -d /v
0K 16K pftm 0 0:22 0.03% [pf purge]
20 0 12904K 8152K select 0 0:01 0.03% sshd: root@pts/0 (sshd)
20 0 12912K 13032K select 0 0:08 0.03% /usr/local/sbin/ntpd -g -c
20 0 6900K 2444K nanslp 1 0:00 0.02% [dpinger{dpinger}]
20 0 6900K 2444K nanslp 1 0:00 0.02% [dpinger{dpinger}]
0K 816K WAIT 0 0:06 0.01% [intr{irq257: xhci0}]
20 0 6900K 2444K nanslp 1 0:00 0.01% [dpinger{dpinger}]
0K 80K - 3 0:05 0.01% [usb{usbus0}]
20 0 266M 221M nanslp 0 0:00 0.01% /usr/local/bin/ntopng -d /v
0K 80K - 2 0:05 0.01% [usb{usbus0}]

Cheers!

stephenw10

@qwaven said in CPU Usage when network used:

[intr{irq290: igb3:que 3}]

It looks like you have 4 queues for igb3 which is what I expect for a 4 core CPU but I only see one for igb0.
You might try running vmstat -i to confirm you do have the expected queues for each NIC. I thought they were all on-chip in that CPU but maybe igb0 is different in which case you might try using igb3, or one of the others, as WAN.

Steve

qwaven

So with vmstat I see the correct number:

irq269: igb0:que 0 57225866 135
irq270: igb0:que 1 421673 1
irq271: igb0:que 2 425910 1
irq272: igb0:que 3 421212 1
irq273: igb0:link 11 0

irq287: igb3:que 0 94141932 223
irq288: igb3:que 1 45221540 107
irq289: igb3:que 2 27199303 64
irq290: igb3:que 3 35826209 85
irq291: igb3:link 5 0

Cheers!

stephenw10

Mmm, but all the interrupt loading is on one queue. Do you have a PPPoE WAN?

The single thread performance of the N3700 is... not good. And potentially much worse if turbo/burst is not working.

Do you see any significant improvement if you disable ntop-ng?

Steve

qwaven

yes the WAN is PPPoE. Would there be something I can do to use more queues properly?

I can try and turn ntop off later to see what happens.

Cheers!

stephenw10

Ah! OK then, currently, you are limited to a single queue on the PPPoE interface and hence a single core.

See: https://redmine.pfsense.org/issues/4821

And the upstream: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=203856

You can probably get some performance by setting the sysctl net.isr.dispatch to deferred in Sys > Adv > System Tunables. That will require a reboot.

https://docs.netgate.com/pfsense/en/latest/hardware/tuning-and-troubleshooting-network-cards.html#pppoe-with-multi-queue-nics

Steve

qwaven

tried the dispatch
sysctl net.isr.dispatch
net.isr.dispatch: deferred

cpu seemed to about 50% utilization.

interrupt total rate
cpu0:timer 122117 254
cpu2:timer 121707 253
cpu3:timer 116674 243
cpu1:timer 115728 241
irq256: ahci0 11720 24
irq257: xhci0 2850 6
irq258: hdac0 2 0
irq260: t5nex0:evt 2 0
irq269: igb0:que 0 659069 1372
irq270: igb0:que 1 1457 3
irq271: igb0:que 2 516 1
irq272: igb0:que 3 515 1
irq273: igb0:link 3 0
irq274: pcib5 1 0
irq280: pcib6 1 0
irq286: pcib7 1 0
irq287: igb3:que 0 453042 943
irq288: igb3:que 1 573830 1194
irq289: igb3:que 2 755133 1572
irq290: igb3:que 3 438318 912
irq291: igb3:link 3 0
irq292: pcib8 1 0
Total 3372690 7020

qwaven

Also now tried disabling ntop cpu usage looks to be maybe 8-10% less.

stephenw10

Is that total CPU was 50%? Did throughput increase?

Steve

qwaven

That would be what was shown on the dashboard for cpu performance. If utilization is stuck on 1 core I am not sure if there would be anything else we can do.

As for throughput, it was about the same but I am not worrying about that as the source for the transfer may impact this as well. Ideally it would be great to see it closer to my actual speed but I'm not sure about testing it reliably.

Cheers!

qwaven

Hi again,

I'm assuming we've exhausted trying to improve the cpu utilization with this but I just wanted to say thanks for the help/efforts with this. I am still open to try anything though.

Cheers!

stephenw10

I suspect it might be. The single thread performance of that CPU is about equal to that of the Pentium M I used to run and that was good fpr ~650Mbps. At least according to this:
https://www.cpubenchmark.net/compare/Intel-Core2-Duo-E4500-vs-Intel-Pentium-N3700-vs-Intel-Pentium-M-1.73GHz/936vs2513vs1160
Obviously that's synthetic and there are many variable etc. No PPPoE overhead in that test either.
The E4500 can pass Gigabit, just. (at full size TCP packets...many variables etc!).

If that is to be believed then it probably is running burst mode and I'm not sure there's much we can do before RSS is re-written in FreeBSD to allow multiple cores.

You probably could see better performance off-loading the PPPoE to another device. That would probably mean a double NAT scenario unfortunately.

Steve

qwaven

Hi Steve,

It's unfortunate about this RSS issue. I have another board that I plan to try out, however its quite overkill especially if only 1 core is going to be used for pppoe. However it does have some better on board hardware that may help overall. It is however still just 2ghz/core.

https://www.supermicro.com/products/motherboard/atom/A2SDi-H-TP4F.cfm

Cheers!

stephenw10

Yes. I have a PPPoE WAN but fortunately/unfortunately it's no where near fast enough to worry about this.

No benchmarks for the C3958 but if we assume it's the same as the C3858 but with 4 more cores then it should make about ~40% better single thread performance.

It does seem like a waste of cores unless you virtualise it.

Steve

qwaven

Hi Steve,

So I flipped it over. Performance so far looks drastically better. CPU in the gui was about 5-6% while transferring over pppoe. I believe still just the 1 core.

PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0K 256K CPU1 1 7:39 97.26% [idle{idle: cpu1}]
11 root 155 ki31 0K 256K CPU10 10 7:41 97.12% [idle{idle: cpu10}]
11 root 155 ki31 0K 256K CPU13 13 7:33 96.96% [idle{idle: cpu13}]
11 root 155 ki31 0K 256K CPU7 7 7:45 96.85% [idle{idle: cpu7}]
11 root 155 ki31 0K 256K CPU11 11 7:38 96.51% [idle{idle: cpu11}]
11 root 155 ki31 0K 256K RUN 4 7:43 96.46% [idle{idle: cpu4}]
11 root 155 ki31 0K 256K CPU3 3 7:44 96.46% [idle{idle: cpu3}]
11 root 155 ki31 0K 256K CPU9 9 7:36 96.26% [idle{idle: cpu9}]
11 root 155 ki31 0K 256K CPU5 5 7:42 95.99% [idle{idle: cpu5}]
11 root 155 ki31 0K 256K RUN 8 7:19 95.56% [idle{idle: cpu8}]
11 root 155 ki31 0K 256K CPU6 6 7:42 95.12% [idle{idle: cpu6}]
11 root 155 ki31 0K 256K CPU2 2 7:42 94.98% [idle{idle: cpu2}]
11 root 155 ki31 0K 256K CPU12 12 7:40 93.93% [idle{idle: cpu12}]
11 root 155 ki31 0K 256K RUN 15 7:35 87.04% [idle{idle: cpu15}]
11 root 155 ki31 0K 256K CPU14 14 7:31 82.95% [idle{idle: cpu14}]
11 root 155 ki31 0K 256K RUN 0 7:24 79.60% [idle{idle: cpu0}]

irq298: ix0:q0 2716423 6058
irq299: ix0:q1 244578 545
irq300: ix0:q2 461159 1029
irq301: ix0:q3 243416 543
irq302: ix0:q4 378891 845
irq303: ix0:q5 124788 278
irq304: ix0:q6 478729 1068
irq305: ix0:q7 125913 281
irq306: ix0:link 1 0
irq307: ix1:q0 326596 728
irq308: ix1:q1 254938 569
irq309: ix1:q2 614196 1370
irq310: ix1:q3 250402 558
irq311: ix1:q4 388996 868
irq312: ix1:q5 128709 287
irq313: ix1:q6 492403 1098
irq314: ix1:q7 130143 290
irq315: ix1:link 1 0

ix0 is pppoe and ix1 is internal lans.

I was thinking about virtualizing. However I've seen so many talks about people suggesting this is not a great choice for a firewall. However I'm open to exploring this more. Do you have any thoughts? Proxmox was my first choice.

Cheers!

stephenw10

Nice, what sort of throughput were you seeing at that point?

I can't really advise on hypervisors, I'm not using anything right now.

A lot of people here are using Proxmox though. ESXi is also popular.

Steve

qwaven

Same throughput but I believe this is more because of the source. I have not had a chance to test internally the network to see if anything there is improved. Will update once I have.

qwaven

so testing with iperf3, I still don't seem to be getting anywhere close to 10G bandwidth.

It looks about spot on with 1G.

[ 41] 0.00-10.00 sec 56.4 MBytes 47.4 Mbits/sec 3258 sender
[ 41] 0.00-10.00 sec 56.4 MBytes 47.3 Mbits/sec receiver
[ 43] 0.00-10.00 sec 58.1 MBytes 48.8 Mbits/sec 3683 sender
[ 43] 0.00-10.00 sec 58.0 MBytes 48.6 Mbits/sec receiver
[SUM] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 69930 sender
[SUM] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver

Any ideas?

This is literally SFP+ 10G interface on pfsense to switch to fileserver. The file server has two 10G bonded links. Nothing else running.

Cheers!

stephenw10

How many processes are you running there?

You have 8 queues so I don't expect to any advantage over 8.

Is that result testing over 1G? What do you actually see over 10G?
I would anticipate something ~4Gbps maybe. Though if you're running iperf on the firewall it may reduce that.

Steve

qwaven

My test with iperf was sending 20 connections (what I saw someones example on the internets doing) and it looks pretty much to saturate if it were 1G.

This is not 1G. This is using my internal network. Pfsense reports it as 10G, the switch is all 10G, and the file server has 2x10G.

Curious why would iperf on the firewall reduce this?

fyi cpu did not appear stressed in any way.

Cheers!