High CPU load on single CPU core

yswery

Hi All

We have been seeing a very weird issue on our pfsense box where we see WAN latency (1Gbit WAN) going up from 0.7ms to over 100ms at times. While this occurs I noticed that there is one single CPU core maxed out throughout.

In htop no command shows up that is maxing out the CPU, but when looking at SYSTEM ACTIVITY in the pfsense UI we can see this:

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root        -76    -     0B  1184K CPU1     1 868:11  99.76% [kernel{if_io_tqg_1}]

Which matches with our maxed out CPU core:

last pid: 39007;  load averages:  2.01,  1.96,  1.99                                                                                                                                up 0+16:58:19  22:36:57
67 processes:  3 running, 64 sleeping
CPU 0:  8.6% user,  0.0% nice,  1.2% system,  0.0% interrupt, 90.2% idle
CPU 1:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
CPU 2:  6.9% user,  0.0% nice,  1.2% system,  0.0% interrupt, 91.9% idle
CPU 3:  0.0% user,  0.0% nice,  0.8% system,  6.3% interrupt, 93.0% idle
CPU 4:  0.4% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.6% idle
CPU 5:  0.4% user,  0.0% nice,  0.8% system,  0.0% interrupt, 98.8% idle
CPU 6:  2.0% user,  0.0% nice,  2.0% system,  0.0% interrupt, 96.1% idle
CPU 7:  0.0% user,  0.0% nice, 22.7% system,  0.0% interrupt, 77.3% idle

Does anyone know what if_io_tqg_1 is and what we might need to do to further diagnose whats going on?

PfSense Specs:

CPU: Intel(R) Atom(TM) CPU C3758 @ 2.20GHz
Ram: 32GB
NICs: Ethernet Connection X553 1GbE
Wan uplink: 1Gbit
Approx traffic via wan: 200Mbit

bmeeks

@yswery said in High CPU load on single CPU core:

Does anyone know what if_io_tqg_1 is and what we might need to do to further diagnose whats going on?

That represents network queue handlers. Found some related posts for you, but no real solutions.

https://forum.netgate.com/topic/173523/what-is-kernel-if_io_tqq_x

https://forums.freebsd.org/threads/what-is-kernel-if_io_tqg-100-load-of-core.70642/

stephenw10

Mmm, that is where the load from pf itself appears. 100% of one CPU core on a C3758 is a lot for 200Mbps though. And pf loading would normally be spread across queues/cores unless the NICs are being deliberately limited to one queue.

Steve

yswery

@stephenw10 said in High CPU load on single CPU core:

Mmm, that is where the load from pf itself appears. 100% of one CPU core on a C3758 is a lot for 200Mbps though. And pf loading would normally be spread across queues/cores unless the NICs are being deliberately limited to one queue.

Steve

Do you have any idea or hint where I might be able to see if there is any (accidental?) setting to use only one core per NIC?

While this occurs (which is more and more frequently in the past 2 weeks for us) we are seeing these spikes in latency and packet loss to our (directly connected) upstream.

Screen Shot 2022-08-16 at 7.04.26 PM (2).png When this issue isnt occurring we usually see under 1ms latency

Is there a way to see what type of traffic is trigging this CPU use?
The reason why I think its a "certain type of traffic" is because we sometimes see 200Mbps (or more if we run speedtests) without any issues at all

Here is an example of (external) monitoring

stephenw10

What pfSense version are you running?

Easiest way to check the NIC queues is usually the boot log. You should see something like:

ix0: <Intel(R) X553 N (SFP+)> mem 0x80400000-0x805fffff,0x80604000-0x80607fff at device 0.0 on pci9
ix0: Using 2048 TX descriptors and 2048 RX descriptors
ix0: Using 4 RX queues 4 TX queues

Steve

yswery

@stephenw10 said in High CPU load on single CPU core:

Easiest way to check the NIC queues is usually the boot log. You should see something like:

Oh interesting, this is what I see:

$ cat /var/log/dmesg.boot | grep ix3
ix3: <Intel(R) X553 (1GbE)> mem 0xdd200000-0xdd3fffff,0xdd600000-0xdd603fff at device 0.1 on pci7
ix3: Using 2048 TX descriptors and 2048 RX descriptors
ix3: Using an MSI interrupt
ix3: allocated for 1 queues
ix3: allocated for 1 rx queues
ix3: Ethernet address: ac:1f:6b:b1:d8:af
ix3: eTrack 0x8000087c
ix3: netmap queues/slots: TX 1/2048, RX 1/2048

ix3 being the WAN interface, but all 4 ixN devices are all showing "allocated for 1 queues"

This is pfsense 2.6 CE but this box is a few years old from v2.4 days and has been incrementally updated. (so might have some settings it now should not)

stephenw10

Ah, maybe you have set queues to 1 in /boot/loader.conf.local?

That was a common tweak back in the FreeBSD 10 (or was it 8?) era when multiqueue drivers could prove unstable.

But those should be at least 4 queues.

I would still have expected it to pass far more though even with single queue NICs. But that does explain why you are seeing the load on one core.

Steve

yswery

@stephenw10

So I just removed a bunch of older configs now this is what I see:

ix3: <Intel(R) X553 (1GbE)> mem 0xdd200000-0xdd3fffff,0xdd600000-0xdd603fff at device 0.1 on pci7
ix3: Using 2048 TX descriptors and 2048 RX descriptors
ix3: Using 8 RX queues 8 TX queues
ix3: Using MSI-X interrupts with 9 vectors
ix3: allocated for 8 queues
ix3: allocated for 8 rx queues

And holy wow, things are running beautifully (for now)

What even more crazy, is when the network is under-utilised we WERE getting ~0.7ms to our transit provider, and now we're seeing 0.3ms stable (and staying at 0.3ms under our regular 200mbit load)

I might be counting the chickens before the hatch, but this change alone seems to have made a dramatic improvement (better than we have recorded in our historic smokepings for the past 2 years even)

Thanks pointing this out!

stephenw10

Nice.

MoonKnight

@stephenw10
Hi,
I was looking at my own loader.conf since i didn't have the loader.conf.local

Is this normal?

kern.cam.boot_delay=10000
kern.ipc.nmbclusters="1000000"
kern.ipc.nmbjumbop="524288"
kern.ipc.nmbjumbo9="524288"
opensolaris_load="YES"
zfs_load="YES"
kern.geom.label.gptid.enable="0"
kern.geom.label.disk_ident.enable="0"
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
opensolaris_load="YES"
zfs_load="YES"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
net.link.ifqmaxlen="128"
autoboot_delay="3"
hw.hn.vf_transparent="0"
hw.hn.use_if_start="1"
net.link.ifqmaxlen="128"

Why so many "net.link.ifqmaxlen="128"
And some other double lines too

Vollans

@moonknight I get those double lines as well. It’s really weird!

fireodo

@moonknight said in High CPU load on single CPU core:

Why so many "net.link.ifqmaxlen="128"

Your machine is stuttering ... (joke)

stephenw10

It's a known issue but it's only cosmetic. The duplicate entries don't hurt anything.

Steve