Slowly Climbing CPU/Load - updaterrd.sh?
-
We've been using pfSense for about 6 months now on an ALIX 2C3 as a load balancing gateway for a T1 and DSL line with good results. Running pfSense v1.2 embedded.
We also monitor the machine using Cacti and have noticed that CPU usage / load has been climbing steadily over this period of time from basically negligible to now 40% CPU. At this rate, the CPU will be pegged in another 6-8 months or so.
Logging into the machine via the shell and running top shows a load average similar to the CPU utilization so SNMP does appear to be reporting the correct value:
load averages: 0.60, 0.69, 0.67
Watching top also shows that there are period spikes in nice/system load every minute or so:
last pid: 53724; load averages: 0.87, 0.73, 0.68 up 111+22:54:13 12:57:45 53 processes: 2 running, 49 sleeping, 2 zombie CPU states: 1.5% user, 50.0% nice, 40.8% system, 7.7% interrupt, 0.0% idle Mem: 65M Active, 24M Inact, 61M Wired, 34M Buf, 92M Free PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 53712 root 1 -8 20 10304K 9816K piperd 0:00 7.69% sh 95330 root 1 -8 20 10304K 9816K piperd 42.6H 5.62% sh 53694 root 1 -8 20 10304K 9816K piperd 0:00 4.20% sh 53714 root 1 132 20 1464K 888K select 0:00 3.00% ping 53696 root 1 124 20 1464K 888K select 0:00 0.35% ping
Process id 95330 is /var/db/rrd/updaterrd.sh, the other shell scripts running above appear to be the same script.
So the question is: why do the rrd scripts appear to be taking more and more CPU time? I though that RRD files only kept a static amount of data? The files in /var/db/rrd/ don't seem to be climbing in size.
Any ideas?
-
I've also noticed this behavior. I use pfsense on a Toshiba Tecra laptop and now I hear the fan on quite a bit and the temperature of the laptop has risen over the past few weeks. The RRD update scripts is also my culprit.
-
What type of configuration are you using? How many interfaces? How fast is your hardware?
Just trying to get some more information, I'll likely open a bug report soon.
-
OK, I disabled RRD graph generation and how CPU utilization is flat around 25% instead of 55%.
It's still quite a bit higher than when the system was first installed and CPU utilization started around 10%.
-
I entered a ticket: Ticket #1811
-
Never seen or heard of that, what does your RRD CPU graph look like? The way it's steadily increasing like that makes me think you're graphing the wrong thing as CPU.
-
Yeah, that's what I thought, too, but I checked and double checked the SNMP OIDs in use. The spike you see in the first graph does correspond with an actual spikes in CPU load and is matched by the pfSense generated RRD graph.
But yes, as you suspect, the CPU utilization plotted in the pfSense RRD graphs look normal. But those graphs don't match the load average of the system, which does match what snmpd/Cacti is reporting. This does lead me to think that the two are in fact measuring slightly different things.
I suspect that the pfSense RRD graphs are measuring CPU usage as measured at the instant of the polling interval, while the SNMP daemon could be reporting the difference between the idle system process and 100%.
I do notice that top reports a valid higher than the value reported by the idle process
top -S
shows, and sometimes significantly higher, typically when the pagezero process also shows a bit of CPU time.Could it be system IO or something?
I've also restarted the RRD process and CPU usage as reported by SNMP hasn't shot back up. Not sure what's going on…
-
Compare it to 'top' run via a SSH session. This is almost certainly a SNMP bug or problem with your setup, not that the CPU usage really is increasing. That's almost guaranteed since the RRD graph does not show the same increase, there's no way it could show something significantly different from a trend SNMP is showing.
-
Chris, that's what I'm saying:
1. Load average reported by top matches CPU utilization reported by SNMP
2. CPU utilization reported by internal RRD graphs does not match load reported by top and SNMPCould it be the case that this happened because the RRD tools polls for data, then updated the RRD data and the update of the RRD data is what causes the increase in load - which is not noticed by the RRD tool update because it does not run at the same time that the RRD data is updated?
-
What process does top show is actually causing this then?
-
Still can't replicate and I haven't personally seen this on any system I've ever been logged into (which is a bunch).
Unless we can get specific instructions on how to replicate, or access to a system displaying this behavior, this ticket will be closed in a couple months.
If anyone in this thread can still replicate this, please contact me (cmb at pfsense dot org).
-
Well, I'm still seeing it on 1.2.2. Attached are two graphs - one generated by cacti which monitors the pfsense box using snmp, and another using the internally generated rrd graphs.
You can see that after a month of operation, Nice CPU utilization started going up significantly. Mysteriously, it has recently dropped.
Watching top, I see spikes of nice CPU utilization every minute. Here is a snapshot from top (had to hit
C
to turn of cumulative weighting to make the load more visible)last pid: 37003; load averages: 0.38, 0.43, 0.35 up 60+18:20:10 15:47:17 51 processes: 2 running, 47 sleeping, 2 zombie CPU states: 2.9% user, 51.4% nice, 45.7% system, 0.0% interrupt, 0.0% idle Mem: 74M Active, 18M Inact, 58M Wired, 344K Cache, 31M Buf, 93M Free PID USERNAME PRI NICE SIZE RES STATE TIME CPU COMMAND 37003 root 117 20 3276K 1584K RUN 0:00 7.96% pfctl 1630 root 8 20 10652K 8964K wait 17.3H 6.98% sh 36992 root -8 20 10652K 8964K piperd 0:00 4.98% sh 36994 root 117 20 3244K 1016K select 0:00 4.98% ping 36972 root 44 20 3244K 1016K select 0:00 3.96% ping 36970 root -8 20 10652K 8964K piperd 0:00 2.98% sh
A bunch of niced processes, pings and shs. I managed to run a
ps auxfw
at the same time and found these processes:USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 10 62.0 0.0 0 8 ?? RL 9Jan09 76242:43.65 idle root 1630 7.0 3.5 10652 8964 d0- SN 9Jan09 1040:10.32 /bin/sh /var/db/rrd/updaterrd.sh root 36992 5.0 3.5 10652 8964 ?? SN 3:47PM 0:00.09 /bin/sh /var/db/rrd/updaterrd.sh root 36994 5.0 0.4 3244 1016 ?? SN 3:47PM 0:00.02 ping -c 5 -q x.x.x.x root 36970 3.0 3.5 10652 8964 ?? SN 3:47PM 0:00.08 /bin/sh /var/db/rrd/updaterrd.sh root 36972 3.0 0.4 3244 1016 ?? SN 3:47PM 0:00.02 ping -c 5 -q x.x.x.x root 0 0.0 0.0 0 0 ?? WLs 9Jan09 0:00.00 swapper
So it looks like it's the updaterrd.sh program causing the increase in CPU time.
Best think I can think of right now would be to add some logging to /var/db/rrd/updaterrd.sh so we can figure out what's taking up the most time.
Right now I can see that the CPU is pegged about 10 seconds every 60 seconds by nice processes which corresponds with the current charts. Looks like it would have been nice to catch this when CPU utilization was averaging 50%.
Any suggestions on how to add logging to the updaterrd.sh script?
-
I'm seeing the same thing currently on an ALIX 2C3, and for the second time (see the attached graph). Last time I did nothing, and it crept back down on its own, as the graph shows, but it's certainly weird. There's no significant change in traffic during either of these times. I'm running 1.2-embedded. I haven't checked top to see which process it is, but I do know from watching the GUI tools that this CPU use is short spikes, not continuous. It hasn't negatively impacted network performance (why I mostly ignored it last time - didn't even try rebooting) but I'd love to know what's going on.
-
I'm experiencing a the climbing CPU/Load problem running 1.2.2 embedded on a Soekris Net5501-70, it seemed to trigger off after 3 weeks with the load going from around 10 to around 50 in 2 weeks. Loading on this box isn't huge it handled around 100GB of traffic over the month split between 2 ADSL lines
i found that by disabling and renabling RRD graphs seems to have bought the usage right back down
Martin
-
Just seen this happen again on my Net5501 box, seems to be once a month here
Martin