Stuck CPU usage for kernel{igb0 que}.

jptech

Hi,

I have a PC Engines APU2 running pfSense. Every once in a while it starts running extremely bad. I get very high latency to my gateway and if I log in to the console and use top, I can see a process stuck with high CPU (core) usage. Ex:

     PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   11 root     155 ki31     0K    64K CPU2    2 665.1H 100.00% idle{idle: cpu2}
   11 root     155 ki31     0K    64K RUN     1 664.7H 100.00% idle{idle: cpu1}
   11 root     155 ki31     0K    64K CPU0    0 662.9H  94.97% idle{idle: cpu0}
    0 root     -92    -     0K   352K CPU3    3 112:31  73.88% kernel{igb0 que}
   11 root     155 ki31     0K    64K RUN     3 665.8H  30.18% idle{idle: cpu3}

The kernel{igb0 que} process stays stuck around 75%-90% and everything network related runs very bad until reboot. It happens very rarely, but I've seen it enough that I'm confident that stuck process is related to the bad performance I'm seeing. The problem I have is that I've got no idea how to start debugging it.

Any suggestions?

Harvy66

What does your network activity look like? Could there be a broadcast storm network loop and the high utilization is just the network being flooded?

jptech

@Harvy66:

What does your network activity look like? Could there be a broadcast storm network loop and the high utilization is just the network being flooded?

Assuming that process can be / is based on network utilization, that's the info I needed to get going in the right direction. It's my (relatively simple) home network, so I'm certain there aren't any network loops. However, something malfunctioning on the network wouldn't be out of the question.

I was also incorrect about a reboot being needed to resolve the issue. I saw everything drop back to normal before I rebooted earlier which seems like another hint you've got me pointed in the right direction. It could be weeks before it happens again though.

Thanks for the suggestion.

jptech

@Harvy66,

This happened again today and, knowing network utilization was the likely cause, I was able to track it down right away. You definitely put me on the right path. It was as simple as the network being saturated.

Specifically, I had a CrashPlan server on one VLAN and it's data store on another VLAN. When CrashPlan would (deep) compact data, it would saturate my firewall for the duration.

Thank you again for the help.