High load/interrupt and CPU usage

arainero

Hello everyone,

I logged into my pfSense box at the datacenter today and noticed higher than normal temperatures and then noticed an unusually high load. I SSH'd into the firewall and noticed that the system load was averaging around 12. The first thing I did was disabled ntopng since it was using ~20% according to top. After that the load dropped to around 9 and has stayed there.

After running top -aSH I got:

last pid: 74909;  load averages: 10.11, 10.37,  9.96                                                                                       up 19+12:52:25  19:10:02
256 processes: 14 running, 158 sleeping, 4 zombie, 80 waiting
CPU:  0.7% user,  0.0% nice,  0.4% system, 65.0% interrupt, 33.9% idle
Mem: 33M Active, 1057M Inact, 1015M Wired, 202M Cache, 898M Buf, 5580M Free
Swap: 16G Total, 16G Free

  PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
   12 root     -92    -     0K  1360K WAIT    3  28.3H  46.78% [intr{irq260: t5nex0:0}]
   12 root     -92    -     0K  1360K CPU4    4  28.1H  45.90% [intr{irq265: t5nex0:0}]
   12 root     -92    -     0K  1360K WAIT    7  28.0H  43.80% [intr{irq266: t5nex0:0}]
   12 root     -92    -     0K  1360K WAIT    6  28.3H  43.16% [intr{irq261: t5nex0:0}]
   12 root     -92    -     0K  1360K WAIT    1  28.2H  43.16% [intr{irq263: t5nex0:0}]
   12 root     -92    -     0K  1360K WAIT    5  28.2H  41.70% [intr{irq264: t5nex0:0}]
   12 root     -92    -     0K  1360K WAIT    5  28.3H  38.38% [intr{irq262: t5nex0:0}]
   12 root     -92    -     0K  1360K WAIT    0  28.1H  36.47% [intr{irq259: t5nex0:0}]
   11 root     155 ki31     0K   128K CPU6    6 419.4H  36.28% [idle{idle: cpu6}]
   11 root     155 ki31     0K   128K RUN     2 418.8H  34.86% [idle{idle: cpu2}]
   11 root     155 ki31     0K   128K RUN     5 419.3H  33.98% [idle{idle: cpu5}]
   11 root     155 ki31     0K   128K CPU0    0 417.2H  33.79% [idle{idle: cpu0}]
   11 root     155 ki31     0K   128K RUN     3 419.2H  33.59% [idle{idle: cpu3}]
   11 root     155 ki31     0K   128K RUN     7 419.5H  32.37% [idle{idle: cpu7}]
   11 root     155 ki31     0K   128K RUN     1 418.8H  32.08% [idle{idle: cpu1}]
   11 root     155 ki31     0K   128K CPU4    4 419.4H  29.98% [idle{idle: cpu4}]
   12 root     -72    -     0K  1360K WAIT    3 817:35  21.68% [intr{swi1: pfsync}]
   12 root     -92    -     0K  1360K WAIT    4 212:37   8.15% [intr{irq310: igb3:que}]
   12 root     -92    -     0K  1360K WAIT    5 214:54   7.08% [intr{irq284: igb0:que}]
   12 root     -92    -     0K  1360K WAIT    3 204:46   6.40% [intr{irq291: igb1:que}]
   12 root     -92    -     0K  1360K WAIT    1 220:40   6.30% [intr{irq280: igb0:que}]
   12 root     -92    -     0K  1360K WAIT    4 206:17   6.30% [intr{irq301: igb2:que}]
   12 root     -92    -     0K  1360K WAIT    3 206:09   5.86% [intr{irq300: igb2:que}]
   12 root     -92    -     0K  1360K WAIT    2 237:13   5.66% [intr{irq299: igb2:que}]
   12 root     -92    -     0K  1360K WAIT    1 226:07   5.66% [intr{irq307: igb3:que}]
   12 root     -92    -     0K  1360K WAIT    5 209:29   5.47% [intr{irq293: igb1:que}]
   12 root     -92    -     0K  1360K WAIT    2 205:22   5.27% [intr{irq290: igb1:que}]

It appears that t5nex0 is using a ton of CPU processing. I bought the http://store.pfsense.org/C2758/ with the 10gb Chelsio card and the t5nex0 appears to be related to the Chelsio card. We've had problems in the past where the fiber cable plugged into the Chelsio card would go up/down repeatedly and the only way to resolve that was to physically unplug and replug the cable in. That problem would only creep up every month or so. I ended up replacing the twinax cable and it hasn't happened yet, but I can't rule out that replacing the cable fixed that problem entirely. I never noticed this load before and I am wondering if this had a correlation now.

systat -vmstat 1

Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
        Tot   Share      Tot    Share    Free           in   out     in   out
Act  193852    7628  1346892     8108 5918148  count
All  316032   11752  1389380    35388          pages
Proc:                                                            Interrupts
  r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  286k total
  1          74      446k  10k 5970 277k  18k 7712    739 cow         uart0 4
                                                     6947 zfod        ehci0 23
 2.7%Sys  68.0%Intr  2.0%User  0.0%Nice 27.4%Idle         ozfod  1123 cpu0:timer
|    |    |    |    |    |    |    |    |    |           %ozfod       t5nex0:evt
=++++++++++++++++++++++++++++++++++>                      daefr  8179 t5nex0:0.0
                                       136 dtbuf     2363 prcfr  7986 t5nex0:0.1
Namei     Name-cache   Dir-cache     85468 desvn     9434 totfr  8210 t5nex0:0.2
   Calls    hits   %    hits   %     52334 numvn          react  8378 t5nex0:0.3
    2223    2223 100                  1408 frevn          pdwak  7663 t5nex0:0.4
                                                       13 pdpgs  7918 t5nex0:0.5
Disks   md0  ada0 pass0                                   intrn  8406 t5nex0:0.6
KB/t   0.00  0.00  0.00                           1039488 wire   8037 t5nex0:0.7
tps       0     0     0                             35516 act    6824 igb0:que 0
MB/s   0.00  0.00  0.00                           1082412 inact  6668 igb0:que 1
%busy     0     0     0                            206340 cache  6490 igb0:que 2
                                                  5711808 free   6552 igb0:que 3
                                                  1030960 buf    6708 igb0:que 4
                                                                 6774 igb0:que 5
                                                                 6638 igb0:que 6
                                                                 6676 igb0:que 7
                                                                      igb0:link
                                                                 6204 igb1:que 0
                                                                 6513 igb1:que 1
                                                                 6436 igb1:que 2
                                                                 6669 igb1:que 3
                                                                 6495 igb1:que 4

netstat -m

131257/16253/147510 mbufs in use (current/cache/total)
131221/3287/134508/1000000 mbuf clusters in use (current/cache/total/max)
131221/3122 mbuf+clusters out of packet secondary zone in use (current/cache)
8128/751/8879/13291 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/3938 9k jumbo clusters in use (current/cache/total/max)
64/0/64/2215 16k jumbo clusters in use (current/cache/total/max)
328792K/13641K/342433K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile

ps auxw | grep '[p]hp'

root     7815   1.1  0.4 228324 32464  -  S     7:54PM      0:00.17 php-fpm: pool lighty (php-fpm)
root      243   0.0  0.3 224228 21308  -  Ss   16Sep15      1:16.62 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
root    23896   0.0  0.5 228324 38064  -  S     7:55PM      0:00.07 php-fpm: pool lighty (php-fpm)

Some notes:
We aren't doing any unusual traffic today. I have not rebooted anything yet since this machine is in production. I can try to reboot later after hours. Our failover firewall has no load (as expected).

Thanks for the help!

KOM

It's your NICs interrupting the CPU. Try going to System - Advanced - Networking. In Network Interfaces, uncheck Enable device polling.

arainero

Thanks for the reply.

After going to system > advanced > networking Enable device polling was already unchecked. Should I try something else?

KOM

Sorry, I think I had it backwards. If you have high interrupts, try enabling polling to see if it calms down somewhat.

heper

i personally haven't had much luck by enabling polling. it just slowed things down in my case.

have you tried some of the tuning tips here? http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr
i don't own 10gbe equipment, but i don't think it can hurt to try.

arainero

@KOM:

Sorry, I think I had it backwards. If you have high interrupts, try enabling polling to see if it calms down somewhat.

I enabled it and the load didn't move too much in either direction. Thank you for the suggestion.

@heper:

i personally haven't had much luck by enabling polling. it just slowed things down in my case.

have you tried some of the tuning tips here? http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr
i don't own 10gbe equipment, but i don't think it can hurt to try.

I will have to take a look at this. Thanks for providing it.

On another note, the load appears to be around 5-6 now. I believe this is related to lower usage. Towards the end of the business day our usage tapers off significantly, but we only do ~300 Mbs at peak anyways.