Slowly Climbing CPU/Load - updaterrd.sh?

drees

We've been using pfSense for about 6 months now on an ALIX 2C3 as a load balancing gateway for a T1 and DSL line with good results. Running pfSense v1.2 embedded.

We also monitor the machine using Cacti and have noticed that CPU usage / load has been climbing steadily over this period of time from basically negligible to now 40% CPU. At this rate, the CPU will be pegged in another 6-8 months or so.

Logging into the machine via the shell and running top shows a load average similar to the CPU utilization so SNMP does appear to be reporting the correct value:

load averages: 0.60, 0.69, 0.67

Watching top also shows that there are period spikes in nice/system load every minute or so:

last pid: 53724;  load averages:  0.87,  0.73,  0.68 up 111+22:54:13 12:57:45
53 processes:  2 running, 49 sleeping, 2 zombie
CPU states:  1.5% user, 50.0% nice, 40.8% system,  7.7% interrupt,  0.0% idle
Mem: 65M Active, 24M Inact, 61M Wired, 34M Buf, 92M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
53712 root        1  -8   20 10304K  9816K piperd   0:00  7.69% sh
95330 root        1  -8   20 10304K  9816K piperd  42.6H  5.62% sh
53694 root        1  -8   20 10304K  9816K piperd   0:00  4.20% sh
53714 root        1 132   20  1464K   888K select   0:00  3.00% ping
53696 root        1 124   20  1464K   888K select   0:00  0.35% ping

Process id 95330 is /var/db/rrd/updaterrd.sh, the other shell scripts running above appear to be the same script.

So the question is: why do the rrd scripts appear to be taking more and more CPU time? I though that RRD files only kept a static amount of data? The files in /var/db/rrd/ don't seem to be climbing in size.

Any ideas?

pfsense-cpu.png_thumb

thekurgan

I've also noticed this behavior. I use pfsense on a Toshiba Tecra laptop and now I hear the fan on quite a bit and the temperature of the laptop has risen over the past few weeks. The RRD update scripts is also my culprit.

drees

What type of configuration are you using? How many interfaces? How fast is your hardware?

Just trying to get some more information, I'll likely open a bug report soon.

drees

OK, I disabled RRD graph generation and how CPU utilization is flat around 25% instead of 55%.

It's still quite a bit higher than when the system was first installed and CPU utilization started around 10%.

pfsense-cpu.png_thumb

drees

I entered a ticket: Ticket #1811

cmb

Never seen or heard of that, what does your RRD CPU graph look like? The way it's steadily increasing like that makes me think you're graphing the wrong thing as CPU.

drees

Yeah, that's what I thought, too, but I checked and double checked the SNMP OIDs in use. The spike you see in the first graph does correspond with an actual spikes in CPU load and is matched by the pfSense generated RRD graph.

But yes, as you suspect, the CPU utilization plotted in the pfSense RRD graphs look normal. But those graphs don't match the load average of the system, which does match what snmpd/Cacti is reporting. This does lead me to think that the two are in fact measuring slightly different things.

I suspect that the pfSense RRD graphs are measuring CPU usage as measured at the instant of the polling interval, while the SNMP daemon could be reporting the difference between the idle system process and 100%.

I do notice that top reports a valid higher than the value reported by the idle process top -S shows, and sometimes significantly higher, typically when the pagezero process also shows a bit of CPU time.

Could it be system IO or something?

I've also restarted the RRD process and CPU usage as reported by SNMP hasn't shot back up. Not sure what's going on…

cmb

Compare it to 'top' run via a SSH session. This is almost certainly a SNMP bug or problem with your setup, not that the CPU usage really is increasing. That's almost guaranteed since the RRD graph does not show the same increase, there's no way it could show something significantly different from a trend SNMP is showing.

drees

Chris, that's what I'm saying:

1. Load average reported by top matches CPU utilization reported by SNMP
2. CPU utilization reported by internal RRD graphs does not match load reported by top and SNMP

Could it be the case that this happened because the RRD tools polls for data, then updated the RRD data and the update of the RRD data is what causes the increase in load - which is not noticed by the RRD tool update because it does not run at the same time that the RRD data is updated?

cmb

What process does top show is actually causing this then?

cmb

Still can't replicate and I haven't personally seen this on any system I've ever been logged into (which is a bunch).

Unless we can get specific instructions on how to replicate, or access to a system displaying this behavior, this ticket will be closed in a couple months.

If anyone in this thread can still replicate this, please contact me (cmb at pfsense dot org).

drees

Well, I'm still seeing it on 1.2.2. Attached are two graphs - one generated by cacti which monitors the pfsense box using snmp, and another using the internally generated rrd graphs.

You can see that after a month of operation, Nice CPU utilization started going up significantly. Mysteriously, it has recently dropped.

Watching top, I see spikes of nice CPU utilization every minute. Here is a snapshot from top (had to hit C to turn of cumulative weighting to make the load more visible)

last pid: 37003;  load averages:  0.38,  0.43,  0.35   up 60+18:20:10  15:47:17
51 processes:  2 running, 47 sleeping, 2 zombie
CPU states:  2.9% user, 51.4% nice, 45.7% system,  0.0% interrupt,  0.0% idle
Mem: 74M Active, 18M Inact, 58M Wired, 344K Cache, 31M Buf, 93M Free

  PID USERNAME PRI NICE   SIZE    RES STATE    TIME    CPU COMMAND
37003 root     117   20  3276K  1584K RUN      0:00  7.96% pfctl
 1630 root       8   20 10652K  8964K wait    17.3H  6.98% sh
36992 root      -8   20 10652K  8964K piperd   0:00  4.98% sh
36994 root     117   20  3244K  1016K select   0:00  4.98% ping
36972 root      44   20  3244K  1016K select   0:00  3.96% ping
36970 root      -8   20 10652K  8964K piperd   0:00  2.98% sh

A bunch of niced processes, pings and shs. I managed to run a ps auxfw at the same time and found these processes:

USER     PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
root      10 62.0  0.0     0     8  ??  RL    9Jan09 76242:43.65 idle
root    1630  7.0  3.5 10652  8964  d0- SN    9Jan09 1040:10.32 /bin/sh /var/db/rrd/updaterrd.sh
root   36992  5.0  3.5 10652  8964  ??  SN    3:47PM   0:00.09 /bin/sh /var/db/rrd/updaterrd.sh
root   36994  5.0  0.4  3244  1016  ??  SN    3:47PM   0:00.02 ping -c 5 -q x.x.x.x
root   36970  3.0  3.5 10652  8964  ??  SN    3:47PM   0:00.08 /bin/sh /var/db/rrd/updaterrd.sh
root   36972  3.0  0.4  3244  1016  ??  SN    3:47PM   0:00.02 ping -c 5 -q x.x.x.x
root       0  0.0  0.0     0     0  ??  WLs   9Jan09   0:00.00 swapper

So it looks like it's the updaterrd.sh program causing the increase in CPU time.

Best think I can think of right now would be to add some logging to /var/db/rrd/updaterrd.sh so we can figure out what's taking up the most time.

Right now I can see that the CPU is pegged about 10 seconds every 60 seconds by nice processes which corresponds with the current charts. Looks like it would have been nice to catch this when CPU utilization was averaging 50%.

Any suggestions on how to add logging to the updaterrd.sh script?

pfsense-cpu-rrd.png_thumb

pfsense-cpu-cacti.png_thumb

BenHead

I'm seeing the same thing currently on an ALIX 2C3, and for the second time (see the attached graph). Last time I did nothing, and it crept back down on its own, as the graph shows, but it's certainly weird. There's no significant change in traffic during either of these times. I'm running 1.2-embedded. I haven't checked top to see which process it is, but I do know from watching the GUI tools that this CPU use is short spikes, not continuous. It hasn't negatively impacted network performance (why I mostly ignored it last time - didn't even try rebooting) but I'd love to know what's going on.

1yearcpu.PNG_thumb

martinw

I'm experiencing a the climbing CPU/Load problem running 1.2.2 embedded on a Soekris Net5501-70, it seemed to trigger off after 3 weeks with the load going from around 10 to around 50 in 2 weeks. Loading on this box isn't huge it handled around 100GB of traffic over the month split between 2 ADSL lines

i found that by disabling and renabling RRD graphs seems to have bought the usage right back down

Martin

martinw

Just seen this happen again on my Net5501 box, seems to be once a month here

Martin