Slowly Climbing CPU/Load - updaterrd.sh?

cmb

Never seen or heard of that, what does your RRD CPU graph look like? The way it's steadily increasing like that makes me think you're graphing the wrong thing as CPU.

drees

Yeah, that's what I thought, too, but I checked and double checked the SNMP OIDs in use. The spike you see in the first graph does correspond with an actual spikes in CPU load and is matched by the pfSense generated RRD graph.

But yes, as you suspect, the CPU utilization plotted in the pfSense RRD graphs look normal. But those graphs don't match the load average of the system, which does match what snmpd/Cacti is reporting. This does lead me to think that the two are in fact measuring slightly different things.

I suspect that the pfSense RRD graphs are measuring CPU usage as measured at the instant of the polling interval, while the SNMP daemon could be reporting the difference between the idle system process and 100%.

I do notice that top reports a valid higher than the value reported by the idle process top -S shows, and sometimes significantly higher, typically when the pagezero process also shows a bit of CPU time.

Could it be system IO or something?

I've also restarted the RRD process and CPU usage as reported by SNMP hasn't shot back up. Not sure what's going on…

cmb

Compare it to 'top' run via a SSH session. This is almost certainly a SNMP bug or problem with your setup, not that the CPU usage really is increasing. That's almost guaranteed since the RRD graph does not show the same increase, there's no way it could show something significantly different from a trend SNMP is showing.

drees

Chris, that's what I'm saying:

1. Load average reported by top matches CPU utilization reported by SNMP
2. CPU utilization reported by internal RRD graphs does not match load reported by top and SNMP

Could it be the case that this happened because the RRD tools polls for data, then updated the RRD data and the update of the RRD data is what causes the increase in load - which is not noticed by the RRD tool update because it does not run at the same time that the RRD data is updated?

cmb

What process does top show is actually causing this then?

cmb

Still can't replicate and I haven't personally seen this on any system I've ever been logged into (which is a bunch).

Unless we can get specific instructions on how to replicate, or access to a system displaying this behavior, this ticket will be closed in a couple months.

If anyone in this thread can still replicate this, please contact me (cmb at pfsense dot org).

drees

Well, I'm still seeing it on 1.2.2. Attached are two graphs - one generated by cacti which monitors the pfsense box using snmp, and another using the internally generated rrd graphs.

You can see that after a month of operation, Nice CPU utilization started going up significantly. Mysteriously, it has recently dropped.

Watching top, I see spikes of nice CPU utilization every minute. Here is a snapshot from top (had to hit C to turn of cumulative weighting to make the load more visible)

last pid: 37003;  load averages:  0.38,  0.43,  0.35   up 60+18:20:10  15:47:17
51 processes:  2 running, 47 sleeping, 2 zombie
CPU states:  2.9% user, 51.4% nice, 45.7% system,  0.0% interrupt,  0.0% idle
Mem: 74M Active, 18M Inact, 58M Wired, 344K Cache, 31M Buf, 93M Free

  PID USERNAME PRI NICE   SIZE    RES STATE    TIME    CPU COMMAND
37003 root     117   20  3276K  1584K RUN      0:00  7.96% pfctl
 1630 root       8   20 10652K  8964K wait    17.3H  6.98% sh
36992 root      -8   20 10652K  8964K piperd   0:00  4.98% sh
36994 root     117   20  3244K  1016K select   0:00  4.98% ping
36972 root      44   20  3244K  1016K select   0:00  3.96% ping
36970 root      -8   20 10652K  8964K piperd   0:00  2.98% sh

A bunch of niced processes, pings and shs. I managed to run a ps auxfw at the same time and found these processes:

USER     PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
root      10 62.0  0.0     0     8  ??  RL    9Jan09 76242:43.65 idle
root    1630  7.0  3.5 10652  8964  d0- SN    9Jan09 1040:10.32 /bin/sh /var/db/rrd/updaterrd.sh
root   36992  5.0  3.5 10652  8964  ??  SN    3:47PM   0:00.09 /bin/sh /var/db/rrd/updaterrd.sh
root   36994  5.0  0.4  3244  1016  ??  SN    3:47PM   0:00.02 ping -c 5 -q x.x.x.x
root   36970  3.0  3.5 10652  8964  ??  SN    3:47PM   0:00.08 /bin/sh /var/db/rrd/updaterrd.sh
root   36972  3.0  0.4  3244  1016  ??  SN    3:47PM   0:00.02 ping -c 5 -q x.x.x.x
root       0  0.0  0.0     0     0  ??  WLs   9Jan09   0:00.00 swapper

So it looks like it's the updaterrd.sh program causing the increase in CPU time.

Best think I can think of right now would be to add some logging to /var/db/rrd/updaterrd.sh so we can figure out what's taking up the most time.

Right now I can see that the CPU is pegged about 10 seconds every 60 seconds by nice processes which corresponds with the current charts. Looks like it would have been nice to catch this when CPU utilization was averaging 50%.

Any suggestions on how to add logging to the updaterrd.sh script?

pfsense-cpu-rrd.png_thumb

pfsense-cpu-cacti.png_thumb

BenHead

I'm seeing the same thing currently on an ALIX 2C3, and for the second time (see the attached graph). Last time I did nothing, and it crept back down on its own, as the graph shows, but it's certainly weird. There's no significant change in traffic during either of these times. I'm running 1.2-embedded. I haven't checked top to see which process it is, but I do know from watching the GUI tools that this CPU use is short spikes, not continuous. It hasn't negatively impacted network performance (why I mostly ignored it last time - didn't even try rebooting) but I'd love to know what's going on.

1yearcpu.PNG_thumb

martinw

I'm experiencing a the climbing CPU/Load problem running 1.2.2 embedded on a Soekris Net5501-70, it seemed to trigger off after 3 weeks with the load going from around 10 to around 50 in 2 weeks. Loading on this box isn't huge it handled around 100GB of traffic over the month split between 2 ADSL lines

i found that by disabling and renabling RRD graphs seems to have bought the usage right back down

Martin

martinw

Just seen this happen again on my Net5501 box, seems to be once a month here

Martin