RRD updates stalled by top …



  • I brought up a pfSense box to play with a few weeks ago as a replacement for a Soekris OpenVPN appliance we'd been using for several years.

    The unit is an Intel ISP1100, 850MHz, 256MB, 10GB, on pfSense 1.2.  Unit had previously been doing other duties and is known to work well with FreeBSD.

    The OpenVPN configuration is a bridging config, which incidentally seems to work very well, though we did a little fudging to get it to work within the existing pfSense framework, and some minor versioning annoyances.

    One of the most potentially useful things was the local support for RRD, providing RRD traffic graphing.  This was working well for weeks.  I logged into the box this morning, after a long time away, and found blank graphs for the last three or four days.  A quick guess had me ps for rrd, which led me …

    ps agxuww|grep rrd

    root  40373  0.0  1.6  3648  3088  ??  IN  Sat08PM  0:00.02 /bin/sh /var/db/rrd/updaterrd.sh
    root    862  0.0  1.6  3648  3088  d0- IN  26May08 104:20.04 /bin/sh /var/db/rrd/updaterrd.sh

    ps agxlww|grep 40373

    0 40373  862  4  8 20  3648  3088 wait  IN    ??    0:00.02 /bin/sh /var/db/rrd/updaterrd.sh
        0 40374 40373  0  20 20  2268  1448 pause  IN    ??    0:00.02 /usr/bin/top -d 2 -s 1 0
        0 40375 40373  0  -8 20  1564  1028 piperd IN    ??    0:00.01 [awk]
        0 39209  1533  1  96  0  3948  2724 -      RV    d0    0:00.00 grep 40373 (tcsh)

    Okay, that looks wrong …

    kill -9 40374

    Fixed it.  I do know that years ago, we gave up on trying to pull data out of top via scripts because of problems that seem to resemble this (though that was via SNMP scripts in ucd-snmp).  Has anyone else seen this sort of problem with pfSense?  Obviously, it's not a real big deal or anything, but having graphs are nice.

    In any case, excellent job on pfSense.  It's a really slick package.



  • Ah, looks like you might have found the cause of a problem that a few folks have run into but we haven't been able to replicate nor find the cause of. Thanks for posting this! We'll check into it. Sounds like we need to find a way to get that info without top.



  • Do you have any suggestions on how we could get load averages, idle, sys and interrupt time, in a means that won't cause similar issues?



  • My graphs have been stopping quite a bit lately but I do not have 2 instances of the rrd update script nor do I see top running.  Think we are chasing multiple problems here potentially.



  • @sullrich:

    My graphs have been stopping quite a bit lately but I do not have 2 instances of the rrd update script nor do I see top running.  Think we are chasing multiple problems here potentially.

    Quite possibly, I do know at least a handful of people have seen the 2 instances running though. This is the first I've heard of it stopping data gathering without running 2 instances.



  • @cmb:

    Do you have any suggestions on how we could get load averages, idle, sys and interrupt time, in a means that won't cause similar issues?

    Use the underlying sysctls. Load average is easy - that's vm.loadavg.

    The other stuff can be accessed via the kern.cp_time sysctl. This consists of five values in the order user nice system interrupt idle. One of these values is incremented every tick of the statistics clock (see clocks(7)). A simple algorithm is to take a reading, sleep for a period of a few seconds, then take another reading. Work out the total number of ticks, then calculate percentages by comparing the two values.

    For more information, head over to /usr/src/usr.bin/top and get reading ;)



  • @David_W:

    @cmb:

    Do you have any suggestions on how we could get load averages, idle, sys and interrupt time, in a means that won't cause similar issues?

    Use the underlying sysctls. […] For more information, head over to /usr/src/usr.bin/top and get reading ;)

    That's approximately correct.  The continuing evolution towards sysctl and away from random kernel symbol groveling is making this somewhat easier as time goes on.  Under FreeBSD 2 and 3, it was a real bear, as you'd essentially need to take any "interesting" statistics programs, strip them down, and massage as needed in order to get useful raw numbers.



  • These days, top uses the sysctls I mentioned, so if you want to see how top uses the sysctls, you just have to read the C code. I was looking on a 6.3-RELEASE i386 machine to try to get as near as possible to pfSense.

    If I have missed anything from my answer or it's inaccurate, do correct it.


Log in to reply