High Load Troubleshooting - RCC-VE 2440 2.2.5
-
Hi,
I read https://doc.pfsense.org/index.php/High_Load_Troubleshooting as I am trying to track down an elusive CPU usage problem on a new RCC-VE2440 / 2.2.5 build I'm using at home. Once in a while after some as-yet-undetermined set of events, the CPU usage on the dashboard starts hovering around 20% and not dropping down. It will dip to 18, up to 21 etc but generally just stay "high" even when not under any load.
I have tried looking at output from "top -aSH" as well as "systat -vmstat 1" and I honestly don't know how to track down the culprit (nothing stands out).
I have PowerD enabled (HiAdaptive) and also the NUT package installed attached to an APC UPS via USB (I suspected NUT/USB drivers but even stopping the nut daemon and unplugging the USB cord the CPU usage remained high)
Is there any more detail on how to "drill in" to what's causing this high-cpu issue? Do long-term detailed CPU monitoring? thanks
-
I have tried looking at output from "top -aSH" as well as "systat -vmstat 1" and I honestly don't know how to track down the culprit (nothing stands out).
Post the output of those and we can tell you. Otherwise no way to even guess.
-
sorry- I had to wait for the problem to occur again
here is some output:top -aSH
last pid: 42450; load averages: 0.10, 0.07, 0.02 122 processes: 3 running, 93 sleeping, 26 waiting CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle Mem: 40M Active, 73M Inact, 150M Wired, 167M Buf, 3636M Free Swap: 8192M Total, 8192M Free PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND 11 root 155 ki31 0K 32K RUN 1 179:29 100.00% [idle{idle: cpu1}] 11 root 155 ki31 0K 32K CPU0 0 178:18 100.00% [idle{idle: cpu0}] 3828 root 52 0 233M 32824K accept 1 0:00 0.59% php-fpm: pool lighty (php-fpm) 0 root -16 0 0K 288K swapin 1 0:36 0.00% [kernel{swapper}] 12 root -92 - 0K 416K WAIT 0 0:31 0.00% [intr{irq259: igb1:que}] 12 root -92 - 0K 416K WAIT 0 0:22 0.00% [intr{irq256: igb0:que}] 15 root -16 - 0K 16K - 0 0:13 0.00% [rand_harvestq] 5 root -16 - 0K 16K pftm 1 0:10 0.00% [pf purge] 12 root -60 - 0K 416K WAIT 0 0:10 0.00% [intr{swi4: clock}] 22002 root 20 0 12456K 2172K select 0 0:09 0.00% /usr/local/sbin/apinger -c /var/etc/apinger.conf 86962 root 20 0 18984K 2688K select 0 0:09 0.00% /usr/local/libexec/nut/usbhid-ups -a ups 91411 root 20 0 18984K 2688K select 0 0:08 0.00% /usr/local/libexec/nut/usbhid-ups -a ups 12 root -92 - 0K 416K WAIT 1 0:08 0.00% [intr{irq257: igb0:que}] 12 root -92 - 0K 416K WAIT 1 0:08 0.00% [intr{irq260: igb1:que}] 71277 root 20 0 55628K 6108K select 0 0:06 0.00% sshd: admin@pts/0 (sshd) 16 root -68 - 0K 64K - 1 0:05 0.00% [usb{usbus0}] 50575 root 20 0 14540K 1992K select 0 0:04 0.00% /usr/sbin/powerd -b hadp -a hadp -n hadp 12 root -88 - 0K 416K WAIT 0 0:04 0.00% [intr{irq22: ehci0}] 90883 root 20 0 18844K 2488K select 1 0:03 0.00% /usr/local/sbin/upsd -u root 48457 root 52 20 17136K 2436K wait 0 0:03 0.00% /bin/sh /var/db/rrd/updaterrd.sh 10588 root 20 0 28344K 18104K select 1 0:03 0.00% /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.pid{ntpd} 20 root 16 - 0K 16K syncer 1 0:02 0.00% [syncer] 16 root -72 - 0K 64K - 1 0:02 0.00% [usb{usbus0}] 27214 root 20 0 48692K 6036K kqread 0 0:02 0.00% /usr/local/sbin/lighttpd -f /var/etc/lighty-webConfigurator.conf 244 root 44 0 229M 21852K kqread 0 0:02 0.00% php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm) 12 root -72 - 0K 416K WAIT 1 0:01 0.00% [intr{swi1: netisr 1}] 74277 root 20 0 17476K 4056K pause 1 0:01 0.00% /bin/tcsh 97371 uucp 20 0 18832K 2520K nanslp 0 0:01 0.00% /usr/local/sbin/upsmon ups@localhost 7 root -16 - 0K 16K psleep 0 0:01 0.00% [pagedaemon] 14 root -8 - 0K 48K - 0 0:01 0.00% [geom{g_event}] 12 root -88 - 0K 416K WAIT 1 0:01 0.00% [intr{irq268: ahci0}] 16 root -68 - 0K 64K - 0 0:01 0.00% [usb{usbus0}] 17661 root 20 0 16804K 2200K bpf 0 0:01 0.00% /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid 62 root -8 - 0K 16K mdwait 1 0:01 0.00% [md0] 18 root 20 - 0K 32K sdflus 0 0:01 0.00% [bufdaemon{/ worker}] 38529 root 20 0 14532K 2228K select 0 0:01 0.00% /usr/local/sbin/radvd -p /var/run/radvd.pid -C /var/etc/radvd.conf -m syslog 22306 root 20 0 28328K 3004K piperd 0 0:00 0.00% rrdtool - 65334 root 20 0 14656K 2336K select 1 0:00 0.00% /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run/syslog.pid -f /var/etc/syslog.conf 18 root -16 - 0K 32K psleep 0 0:00 0.00% [bufdaemon{bufdaemon}] 19 root -16 - 0K 16K vlruwt 0 0:00 0.00% [vnlru] 84285 proxy 20 0 12448K 2124K kqread 0 0:00 0.00% /usr/sbin/ftp-proxy 4 root -16 - 0K 32K - 0 0:00 0.00% [cam{doneq0}] 99065 unbound 20 0 47036K 14404K kqread 1 0:00 0.00% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound} 90103 root 20 0 21988K 3232K CPU1 1 0:00 0.00% top -aSH 99065 unbound 20 0 47036K 14404K kqread 1 0:00 0.00% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound} 0 root 8 0 0K 288K - 1 0:00 0.00% [kernel{thread taskq}] 27598 root 20 0 16664K 2300K nanslp 0 0:00 0.00% /usr/sbin/cron -s 2007 root 20 0 14560K 2184K nanslp 0 0:00 0.00% /usr/libexec/getty al.Pc ttyv0 9 root -16 ki-1 0K 16K pollid 0 0:00 0.00% [idlepoll] 4 root -16 - 0K 32K - 0 0:00 0.00% [cam{scanner}] 0 root 8 0 0K 288K - 0 0:00 0.00% [kernel{ffs_trim taskq}] 1 root 52 0 9112K 792K wait 1 0:00 0.00% [init] 276 root 20 0 13160K 4476K select 0 0:00 0.00% /sbin/devd -q
systat -vmstat 1
2 users Load 0.08 0.09 0.03 Dec 8 01:04 Mem:KB REAL VIRTUAL VN PAGER SWAP PAGER Tot Share Tot Share Free in out in out Act 125912 9412 1200760 10336 3713544 count All 205768 14056 1261620 59092 pages Proc: Interrupts r p d s w Csw Trp Sys Int Sof Flt ioflt 1645 total 1 51 1688 4850 8767 98 26 5786 915 cow uart1 3 3299 zfod 12 ehci0 22 19.8%Sys 0.4%Intr 17.1%User 0.0%Nice 62.8%Idle ozfod 1128 cpu0:timer | | | | | | | | | | %ozfod 35 igb0:que 0 ==========>>>>>>>>> daefr 10 igb0:que 1 31 dtbuf 1758 prcfr igb0:link Namei Name-cache Dir-cache 85468 desvn 3227 totfr 31 igb1:que 0 Calls hits % hits % 2778 numvn react 11 igb1:que 1 4545 4535 100 2 0 492 frevn pdwak igb1:link 17 pdpgs ahci0 268 Disks md0 da0 ada0 pass0 pass1 intrn 418 cpu1:timer KB/t 0.00 0.00 0.00 0.00 0.00 158452 wire tps 12 0 0 0 0 44712 act MB/s 0.00 0.00 0.00 0.00 0.00 76524 inact %busy 0 0 0 0 0 cache 3713560 free 175472 buf
-
I think it's hard to capture what I'm seeing here in a screen full of text. So I made a small screen recording showing the dashboard with a systat -vmstat 0.2 running on top of it. You can see there are these quick "surges" of activity and the dashboard CPU stays pretty steady around 20% even though the system is 99%+ idle. This is what I am trying to track down if you have any other ideas
-
After 24h the Processor RRD shows almost no CPU usage…
even though dashboard shows high load avgs … 0.11, 0.16, 0.10
mem usage seems normal, state table very small etc.
Not sure what's going on?? -
I think the dashboard is a CPU load generator, not exactly something usable for serious monitoring.
-
Yes I know that the php scripts etc on the dashboard cause some load but 20% continuous is not something I'm seeing on the other systems I've set up. Also, I can observe the same cpu spikes with systat (see video above) so something is definitely going on here.
-
0.1 is not a high load average at all, that's very low. Anything under 1 is generally normal, and values above that can be specious.
Brief small spikes of 20% that are so trivial the 1 minute average is under 0.5% total CPU usage are also normal. The RRD stats gathering for instance will chew a bit of CPU for a fraction of a second once a minute to pull the stats. Other things will do similar depending on what you're running. You're not seeing 20% continuous. Everything there looks perfectly normal.
-
It's weird though because of many many other systems that I've set up, I've not seen one that idles at 20% on the dashboard. They are all at 0-2%. I understand spikes to 20% here and there when pages are loaded or background tasks are run but to have it sit steadily at that level is something I have not seen. Where is the code that generates the load % on the dashboard? Maybe somehow it's not giving accurate readings? I mean in this case if you look at the RRD graph it certainly doesn't seem to correlate with what the dash is showing…
-
Maybe your dashboard widget isn't actually updating? The initial load is always higher but falls off quickly as it updates. If it stays at exactly the same value, it's definitely not updating, which would be your browser's AJAX. Try clearing cache, or a diff browser or machine, see if that's any different.
-
I had thought that too but if you look at the video I posted, you can see the dash value is actually updating, it's just staying steadily elevated. I think I am missing the FreeBSD mojo to dig into what is causing this… interrupts??