Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Slowly Climbing CPU/Load - updaterrd.sh?

    Scheduled Pinned Locked Moved General pfSense Questions
    15 Posts 5 Posters 7.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D
      drees
      last edited by

      We've been using pfSense for about 6 months now on an ALIX 2C3 as a load balancing gateway for a T1 and DSL line with good results. Running pfSense v1.2 embedded.

      We also monitor the machine using Cacti and have noticed that CPU usage / load has been climbing steadily over this period of time from basically negligible to now 40% CPU. At this rate, the CPU will be pegged in another 6-8 months or so.

      Logging into the machine via the shell and running top shows a load average similar to the CPU utilization so SNMP does appear to be reporting the correct value:

      load averages: 0.60,  0.69,  0.67

      Watching top also shows that there are period spikes in nice/system load every minute or so:

      last pid: 53724;  load averages:  0.87,  0.73,  0.68 up 111+22:54:13 12:57:45
      53 processes:  2 running, 49 sleeping, 2 zombie
      CPU states:  1.5% user, 50.0% nice, 40.8% system,  7.7% interrupt,  0.0% idle
      Mem: 65M Active, 24M Inact, 61M Wired, 34M Buf, 92M Free
      
        PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
      53712 root        1  -8   20 10304K  9816K piperd   0:00  7.69% sh
      95330 root        1  -8   20 10304K  9816K piperd  42.6H  5.62% sh
      53694 root        1  -8   20 10304K  9816K piperd   0:00  4.20% sh
      53714 root        1 132   20  1464K   888K select   0:00  3.00% ping
      53696 root        1 124   20  1464K   888K select   0:00  0.35% ping
      

      Process id 95330 is /var/db/rrd/updaterrd.sh, the other shell scripts running above appear to be the same script.

      So the question is: why do the rrd scripts appear to be taking more and more CPU time? I though that RRD files only kept a static amount of data? The files in /var/db/rrd/ don't seem to be climbing in size.

      Any ideas?
      pfsense-cpu.png
      pfsense-cpu.png_thumb

      1 Reply Last reply Reply Quote 0
      • T
        thekurgan
        last edited by

        I've also noticed this behavior.  I use pfsense on a Toshiba Tecra laptop and now I hear the fan on quite a bit and the temperature of the laptop has risen over the past few weeks.  The RRD update scripts is also my culprit.

        1 Reply Last reply Reply Quote 0
        • D
          drees
          last edited by

          What type of configuration are you using? How many interfaces? How fast is your hardware?

          Just trying to get some more information, I'll likely open a bug report soon.

          1 Reply Last reply Reply Quote 0
          • D
            drees
            last edited by

            OK, I disabled RRD graph generation and how CPU utilization is flat around 25% instead of 55%.

            It's still quite a bit higher than when the system was first installed and CPU utilization started around 10%.

            pfsense-cpu.png
            pfsense-cpu.png_thumb

            1 Reply Last reply Reply Quote 0
            • D
              drees
              last edited by

              I entered a ticket: Ticket #1811

              1 Reply Last reply Reply Quote 0
              • C
                cmb
                last edited by

                Never seen or heard of that, what does your RRD CPU graph look like? The way it's steadily increasing like that makes me think you're graphing the wrong thing as CPU.

                1 Reply Last reply Reply Quote 0
                • D
                  drees
                  last edited by

                  Yeah, that's what I thought, too, but I checked and double checked the SNMP OIDs in use. The spike you see in the first graph does correspond with an actual spikes in CPU load and is matched by the pfSense generated RRD graph.

                  But yes, as you suspect, the CPU utilization plotted in the pfSense RRD graphs look normal. But those graphs don't match the load average of the system, which does match what snmpd/Cacti is reporting. This does lead me to think that the two are in fact measuring slightly different things.

                  I suspect that the pfSense RRD graphs are measuring CPU usage as measured at the instant of the polling interval, while the SNMP daemon could be reporting the difference between the idle system process and 100%.

                  I do notice that top reports a valid higher than the value reported by the idle process top -S shows, and sometimes significantly higher, typically when the pagezero process also shows a bit of CPU time.

                  Could it be system IO or something?

                  I've also restarted the RRD process and CPU usage as reported by SNMP hasn't shot back up. Not sure what's going on…

                  1 Reply Last reply Reply Quote 0
                  • C
                    cmb
                    last edited by

                    Compare it to 'top' run via a SSH session. This is almost certainly a SNMP bug or problem with your setup, not that the CPU usage really is increasing. That's almost guaranteed since the RRD graph does not show the same increase, there's no way it could show something significantly different from a trend SNMP is showing.

                    1 Reply Last reply Reply Quote 0
                    • D
                      drees
                      last edited by

                      Chris, that's what I'm saying:

                      1. Load average reported by top matches CPU utilization reported by SNMP
                      2. CPU utilization reported by internal RRD graphs does not match load reported by top and SNMP

                      Could it be the case that this happened because the RRD tools polls for data, then updated the RRD data and the update of the RRD data is what causes the increase in load - which is not noticed by the RRD tool update because it does not run at the same time that the RRD data is updated?

                      1 Reply Last reply Reply Quote 0
                      • C
                        cmb
                        last edited by

                        What process does top show is actually causing this then?

                        1 Reply Last reply Reply Quote 0
                        • C
                          cmb
                          last edited by

                          Still can't replicate and I haven't personally seen this on any system I've ever been logged into (which is a bunch).

                          Unless we can get specific instructions on how to replicate, or access to a system displaying this behavior, this ticket will be closed in a couple months.

                          If anyone in this thread can still replicate this, please contact me (cmb at pfsense dot org).

                          1 Reply Last reply Reply Quote 0
                          • D
                            drees
                            last edited by

                            Well, I'm still seeing it on 1.2.2.  Attached are two graphs - one generated by cacti which monitors the pfsense box using snmp, and another using the internally generated rrd graphs.

                            You can see that after a month of operation, Nice CPU utilization started going up significantly.  Mysteriously, it has recently dropped.

                            Watching top, I see spikes of nice CPU utilization every minute.  Here is a snapshot from top (had to hit C to turn of cumulative weighting to make the load more visible)

                            last pid: 37003;  load averages:  0.38,  0.43,  0.35   up 60+18:20:10  15:47:17
                            51 processes:  2 running, 47 sleeping, 2 zombie
                            CPU states:  2.9% user, 51.4% nice, 45.7% system,  0.0% interrupt,  0.0% idle
                            Mem: 74M Active, 18M Inact, 58M Wired, 344K Cache, 31M Buf, 93M Free
                            
                              PID USERNAME PRI NICE   SIZE    RES STATE    TIME    CPU COMMAND
                            37003 root     117   20  3276K  1584K RUN      0:00  7.96% pfctl
                             1630 root       8   20 10652K  8964K wait    17.3H  6.98% sh
                            36992 root      -8   20 10652K  8964K piperd   0:00  4.98% sh
                            36994 root     117   20  3244K  1016K select   0:00  4.98% ping
                            36972 root      44   20  3244K  1016K select   0:00  3.96% ping
                            36970 root      -8   20 10652K  8964K piperd   0:00  2.98% sh
                            

                            A bunch of niced processes, pings and shs.  I managed to run a ps auxfw at the same time and found these processes:

                            USER     PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND
                            root      10 62.0  0.0     0     8  ??  RL    9Jan09 76242:43.65 idle
                            root    1630  7.0  3.5 10652  8964  d0- SN    9Jan09 1040:10.32 /bin/sh /var/db/rrd/updaterrd.sh
                            root   36992  5.0  3.5 10652  8964  ??  SN    3:47PM   0:00.09 /bin/sh /var/db/rrd/updaterrd.sh
                            root   36994  5.0  0.4  3244  1016  ??  SN    3:47PM   0:00.02 ping -c 5 -q x.x.x.x
                            root   36970  3.0  3.5 10652  8964  ??  SN    3:47PM   0:00.08 /bin/sh /var/db/rrd/updaterrd.sh
                            root   36972  3.0  0.4  3244  1016  ??  SN    3:47PM   0:00.02 ping -c 5 -q x.x.x.x
                            root       0  0.0  0.0     0     0  ??  WLs   9Jan09   0:00.00 swapper
                            

                            So it looks like it's the updaterrd.sh program causing the increase in CPU time.

                            Best think I can think of right now would be to add some logging to /var/db/rrd/updaterrd.sh so we can figure out what's taking up the most time.

                            Right now I can see that the CPU is pegged about 10 seconds every 60 seconds by nice processes which corresponds with the current charts.  Looks like it would have been nice to catch this when CPU utilization was averaging 50%.

                            Any suggestions on how to add logging to the updaterrd.sh script?

                            pfsense-cpu-rrd.png
                            pfsense-cpu-rrd.png_thumb
                            pfsense-cpu-cacti.png
                            pfsense-cpu-cacti.png_thumb

                            1 Reply Last reply Reply Quote 0
                            • B
                              BenHead
                              last edited by

                              I'm seeing the same thing currently on an ALIX 2C3, and for the second time (see the attached graph).  Last time I did nothing, and it crept back down on its own, as the graph shows, but it's certainly weird.  There's no significant change in traffic during either of these times.  I'm running 1.2-embedded.  I haven't checked top to see which process it is, but I do know from watching the GUI tools that this CPU use is short spikes, not continuous.  It hasn't negatively impacted network performance (why I mostly ignored it last time - didn't even try rebooting) but I'd love to know what's going on.

                              1yearcpu.PNG
                              1yearcpu.PNG_thumb

                              1 Reply Last reply Reply Quote 0
                              • M
                                martinw
                                last edited by

                                I'm experiencing a the climbing CPU/Load problem running 1.2.2 embedded on a Soekris Net5501-70, it seemed to trigger off after 3 weeks with the load going from around 10 to around 50 in 2 weeks. Loading on this box isn't huge it handled around 100GB of traffic over the month split between 2 ADSL lines

                                i found that by disabling and renabling RRD graphs seems to have bought the usage right back down

                                Martin

                                1 Reply Last reply Reply Quote 0
                                • M
                                  martinw
                                  last edited by

                                  Just seen this happen again on my Net5501 box, seems to be once a month here

                                  Martin

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post
                                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.