Slowly Climbing CPU/Load - updaterrd.sh?
-
Never seen or heard of that, what does your RRD CPU graph look like? The way it's steadily increasing like that makes me think you're graphing the wrong thing as CPU.
-
Yeah, that's what I thought, too, but I checked and double checked the SNMP OIDs in use. The spike you see in the first graph does correspond with an actual spikes in CPU load and is matched by the pfSense generated RRD graph.
But yes, as you suspect, the CPU utilization plotted in the pfSense RRD graphs look normal. But those graphs don't match the load average of the system, which does match what snmpd/Cacti is reporting. This does lead me to think that the two are in fact measuring slightly different things.
I suspect that the pfSense RRD graphs are measuring CPU usage as measured at the instant of the polling interval, while the SNMP daemon could be reporting the difference between the idle system process and 100%.
I do notice that top reports a valid higher than the value reported by the idle process
top -S
shows, and sometimes significantly higher, typically when the pagezero process also shows a bit of CPU time.Could it be system IO or something?
I've also restarted the RRD process and CPU usage as reported by SNMP hasn't shot back up. Not sure what's going on…
-
Compare it to 'top' run via a SSH session. This is almost certainly a SNMP bug or problem with your setup, not that the CPU usage really is increasing. That's almost guaranteed since the RRD graph does not show the same increase, there's no way it could show something significantly different from a trend SNMP is showing.
-
Chris, that's what I'm saying:
1. Load average reported by top matches CPU utilization reported by SNMP
2. CPU utilization reported by internal RRD graphs does not match load reported by top and SNMPCould it be the case that this happened because the RRD tools polls for data, then updated the RRD data and the update of the RRD data is what causes the increase in load - which is not noticed by the RRD tool update because it does not run at the same time that the RRD data is updated?
-
What process does top show is actually causing this then?
-
Still can't replicate and I haven't personally seen this on any system I've ever been logged into (which is a bunch).
Unless we can get specific instructions on how to replicate, or access to a system displaying this behavior, this ticket will be closed in a couple months.
If anyone in this thread can still replicate this, please contact me (cmb at pfsense dot org).
-
Well, I'm still seeing it on 1.2.2. Attached are two graphs - one generated by cacti which monitors the pfsense box using snmp, and another using the internally generated rrd graphs.
You can see that after a month of operation, Nice CPU utilization started going up significantly. Mysteriously, it has recently dropped.
Watching top, I see spikes of nice CPU utilization every minute. Here is a snapshot from top (had to hit
C
to turn of cumulative weighting to make the load more visible)last pid: 37003; load averages: 0.38, 0.43, 0.35 up 60+18:20:10 15:47:17 51 processes: 2 running, 47 sleeping, 2 zombie CPU states: 2.9% user, 51.4% nice, 45.7% system, 0.0% interrupt, 0.0% idle Mem: 74M Active, 18M Inact, 58M Wired, 344K Cache, 31M Buf, 93M Free PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND 37003 root 117 20 3276K 1584K RUN 0:00 7.96% pfctl 1630 root 8 20 10652K 8964K wait 17.3H 6.98% sh 36992 root -8 20 10652K 8964K piperd 0:00 4.98% sh 36994 root 117 20 3244K 1016K select 0:00 4.98% ping 36972 root 44 20 3244K 1016K select 0:00 3.96% ping 36970 root -8 20 10652K 8964K piperd 0:00 2.98% sh
A bunch of niced processes, pings and shs. I managed to run a
ps auxfw
at the same time and found these processes:USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 62.0 0.0 0 8 ?? RL 9Jan09 76242:43.65 idle root 1630 7.0 3.5 10652 8964 d0- SN 9Jan09 1040:10.32 /bin/sh /var/db/rrd/updaterrd.sh root 36992 5.0 3.5 10652 8964 ?? SN 3:47PM 0:00.09 /bin/sh /var/db/rrd/updaterrd.sh root 36994 5.0 0.4 3244 1016 ?? SN 3:47PM 0:00.02 ping -c 5 -q x.x.x.x root 36970 3.0 3.5 10652 8964 ?? SN 3:47PM 0:00.08 /bin/sh /var/db/rrd/updaterrd.sh root 36972 3.0 0.4 3244 1016 ?? SN 3:47PM 0:00.02 ping -c 5 -q x.x.x.x root 0 0.0 0.0 0 0 ?? WLs 9Jan09 0:00.00 swapper
So it looks like it's the updaterrd.sh program causing the increase in CPU time.
Best think I can think of right now would be to add some logging to /var/db/rrd/updaterrd.sh so we can figure out what's taking up the most time.
Right now I can see that the CPU is pegged about 10 seconds every 60 seconds by nice processes which corresponds with the current charts. Looks like it would have been nice to catch this when CPU utilization was averaging 50%.
Any suggestions on how to add logging to the updaterrd.sh script?
-
I'm seeing the same thing currently on an ALIX 2C3, and for the second time (see the attached graph). Last time I did nothing, and it crept back down on its own, as the graph shows, but it's certainly weird. There's no significant change in traffic during either of these times. I'm running 1.2-embedded. I haven't checked top to see which process it is, but I do know from watching the GUI tools that this CPU use is short spikes, not continuous. It hasn't negatively impacted network performance (why I mostly ignored it last time - didn't even try rebooting) but I'd love to know what's going on.
-
I'm experiencing a the climbing CPU/Load problem running 1.2.2 embedded on a Soekris Net5501-70, it seemed to trigger off after 3 weeks with the load going from around 10 to around 50 in 2 weeks. Loading on this box isn't huge it handled around 100GB of traffic over the month split between 2 ADSL lines
i found that by disabling and renabling RRD graphs seems to have bought the usage right back down
Martin
-
Just seen this happen again on my Net5501 box, seems to be once a month here
Martin