High load/interrupt and CPU usage
-
Hello everyone,
I logged into my pfSense box at the datacenter today and noticed higher than normal temperatures and then noticed an unusually high load. I SSH'd into the firewall and noticed that the system load was averaging around 12. The first thing I did was disabled ntopng since it was using ~20% according to top. After that the load dropped to around 9 and has stayed there.
After running top -aSH I got:
last pid: 74909; load averages: 10.11, 10.37, 9.96 up 19+12:52:25 19:10:02 256 processes: 14 running, 158 sleeping, 4 zombie, 80 waiting CPU: 0.7% user, 0.0% nice, 0.4% system, 65.0% interrupt, 33.9% idle Mem: 33M Active, 1057M Inact, 1015M Wired, 202M Cache, 898M Buf, 5580M Free Swap: 16G Total, 16G Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 12 root -92 - 0K 1360K WAIT 3 28.3H 46.78% [intr{irq260: t5nex0:0}] 12 root -92 - 0K 1360K CPU4 4 28.1H 45.90% [intr{irq265: t5nex0:0}] 12 root -92 - 0K 1360K WAIT 7 28.0H 43.80% [intr{irq266: t5nex0:0}] 12 root -92 - 0K 1360K WAIT 6 28.3H 43.16% [intr{irq261: t5nex0:0}] 12 root -92 - 0K 1360K WAIT 1 28.2H 43.16% [intr{irq263: t5nex0:0}] 12 root -92 - 0K 1360K WAIT 5 28.2H 41.70% [intr{irq264: t5nex0:0}] 12 root -92 - 0K 1360K WAIT 5 28.3H 38.38% [intr{irq262: t5nex0:0}] 12 root -92 - 0K 1360K WAIT 0 28.1H 36.47% [intr{irq259: t5nex0:0}] 11 root 155 ki31 0K 128K CPU6 6 419.4H 36.28% [idle{idle: cpu6}] 11 root 155 ki31 0K 128K RUN 2 418.8H 34.86% [idle{idle: cpu2}] 11 root 155 ki31 0K 128K RUN 5 419.3H 33.98% [idle{idle: cpu5}] 11 root 155 ki31 0K 128K CPU0 0 417.2H 33.79% [idle{idle: cpu0}] 11 root 155 ki31 0K 128K RUN 3 419.2H 33.59% [idle{idle: cpu3}] 11 root 155 ki31 0K 128K RUN 7 419.5H 32.37% [idle{idle: cpu7}] 11 root 155 ki31 0K 128K RUN 1 418.8H 32.08% [idle{idle: cpu1}] 11 root 155 ki31 0K 128K CPU4 4 419.4H 29.98% [idle{idle: cpu4}] 12 root -72 - 0K 1360K WAIT 3 817:35 21.68% [intr{swi1: pfsync}] 12 root -92 - 0K 1360K WAIT 4 212:37 8.15% [intr{irq310: igb3:que}] 12 root -92 - 0K 1360K WAIT 5 214:54 7.08% [intr{irq284: igb0:que}] 12 root -92 - 0K 1360K WAIT 3 204:46 6.40% [intr{irq291: igb1:que}] 12 root -92 - 0K 1360K WAIT 1 220:40 6.30% [intr{irq280: igb0:que}] 12 root -92 - 0K 1360K WAIT 4 206:17 6.30% [intr{irq301: igb2:que}] 12 root -92 - 0K 1360K WAIT 3 206:09 5.86% [intr{irq300: igb2:que}] 12 root -92 - 0K 1360K WAIT 2 237:13 5.66% [intr{irq299: igb2:que}] 12 root -92 - 0K 1360K WAIT 1 226:07 5.66% [intr{irq307: igb3:que}] 12 root -92 - 0K 1360K WAIT 5 209:29 5.47% [intr{irq293: igb1:que}] 12 root -92 - 0K 1360K WAIT 2 205:22 5.27% [intr{irq290: igb1:que}]
It appears that t5nex0 is using a ton of CPU processing. I bought the http://store.pfsense.org/C2758/ with the 10gb Chelsio card and the t5nex0 appears to be related to the Chelsio card. We've had problems in the past where the fiber cable plugged into the Chelsio card would go up/down repeatedly and the only way to resolve that was to physically unplug and replug the cable in. That problem would only creep up every month or so. I ended up replacing the twinax cable and it hasn't happened yet, but I can't rule out that replacing the cable fixed that problem entirely. I never noticed this load before and I am wondering if this had a correlation now.
systat -vmstat 1
Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER Tot Share Tot Share Free in out in out Act 193852 7628 1346892 8108 5918148 count All 316032 11752 1389380 35388 pages Proc: Interrupts r p d s w Csw Trp Sys Int Sof Flt ioflt 286k total 1 74 446k 10k 5970 277k 18k 7712 739 cow uart0 4 6947 zfod ehci0 23 2.7%Sys 68.0%Intr 2.0%User 0.0%Nice 27.4%Idle ozfod 1123 cpu0:timer | | | | | | | | | | %ozfod t5nex0:evt =++++++++++++++++++++++++++++++++++> daefr 8179 t5nex0:0.0 136 dtbuf 2363 prcfr 7986 t5nex0:0.1 Namei Name-cache Dir-cache 85468 desvn 9434 totfr 8210 t5nex0:0.2 Calls hits % hits % 52334 numvn react 8378 t5nex0:0.3 2223 2223 100 1408 frevn pdwak 7663 t5nex0:0.4 13 pdpgs 7918 t5nex0:0.5 Disks md0 ada0 pass0 intrn 8406 t5nex0:0.6 KB/t 0.00 0.00 0.00 1039488 wire 8037 t5nex0:0.7 tps 0 0 0 35516 act 6824 igb0:que 0 MB/s 0.00 0.00 0.00 1082412 inact 6668 igb0:que 1 %busy 0 0 0 206340 cache 6490 igb0:que 2 5711808 free 6552 igb0:que 3 1030960 buf 6708 igb0:que 4 6774 igb0:que 5 6638 igb0:que 6 6676 igb0:que 7 igb0:link 6204 igb1:que 0 6513 igb1:que 1 6436 igb1:que 2 6669 igb1:que 3 6495 igb1:que 4
netstat -m
131257/16253/147510 mbufs in use (current/cache/total) 131221/3287/134508/1000000 mbuf clusters in use (current/cache/total/max) 131221/3122 mbuf+clusters out of packet secondary zone in use (current/cache) 8128/751/8879/13291 4k (page size) jumbo clusters in use (current/cache/total/max) 0/0/0/3938 9k jumbo clusters in use (current/cache/total/max) 64/0/64/2215 16k jumbo clusters in use (current/cache/total/max) 328792K/13641K/342433K bytes allocated to network (current/cache/total) 0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters) 0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters) 0/0/0 requests for jumbo clusters delayed (4k/9k/16k) 0/0/0 requests for jumbo clusters denied (4k/9k/16k) 0 requests for sfbufs denied 0 requests for sfbufs delayed 0 requests for I/O initiated by sendfile
ps auxw | grep '[p]hp'
root 7815 1.1 0.4 228324 32464 - S 7:54PM 0:00.17 php-fpm: pool lighty (php-fpm) root 243 0.0 0.3 224228 21308 - Ss 16Sep15 1:16.62 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm) root 23896 0.0 0.5 228324 38064 - S 7:55PM 0:00.07 php-fpm: pool lighty (php-fpm)
Some notes:
We aren't doing any unusual traffic today. I have not rebooted anything yet since this machine is in production. I can try to reboot later after hours. Our failover firewall has no load (as expected).Thanks for the help!
-
It's your NICs interrupting the CPU. Try going to System - Advanced - Networking. In Network Interfaces, uncheck Enable device polling.
-
Thanks for the reply.
After going to system > advanced > networking Enable device polling was already unchecked. Should I try something else?
-
Sorry, I think I had it backwards. If you have high interrupts, try enabling polling to see if it calms down somewhat.
-
i personally haven't had much luck by enabling polling. it just slowed things down in my case.
have you tried some of the tuning tips here? http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr
i don't own 10gbe equipment, but i don't think it can hurt to try. -
@KOM:
Sorry, I think I had it backwards. If you have high interrupts, try enabling polling to see if it calms down somewhat.
I enabled it and the load didn't move too much in either direction. Thank you for the suggestion.
i personally haven't had much luck by enabling polling. it just slowed things down in my case.
have you tried some of the tuning tips here? http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr
i don't own 10gbe equipment, but i don't think it can hurt to try.I will have to take a look at this. Thanks for providing it.
On another note, the load appears to be around 5-6 now. I believe this is related to lower usage. Towards the end of the business day our usage tapers off significantly, but we only do ~300 Mbs at peak anyways.