RRD stops updating; traced the problem to check_reload_status?



  • I've been having a problem on 2.0-RC1 where the RRD graphs will stop working correctly. The image files will update and show the correct timestamp, but all the data points after a certain point of time will be missing in all the categories. I determined that this always happens, 100% of the time, after clicking 'Save' on System > Advanced. It will fix itself after clicking 'Save' on Status > RRD Graphs > Settings.

    The graphs stop showing data because, according to the timestamps in /var/db/rrd, none of the *.rrd files are being updated. Possibly related, this behavior is always accompanied by two instances of the '/bin/sh /var/db/rrd/updaterrd.sh' process running (normally, when it's working properly, there is only one).

    This bugged me, so I traced through the code to figure out what happens differently between the "breaking" and "fixing" actions. When saving settings on System > Advanced, this happens:

    1. /usr/local/www/system_advanced_admin.php calls /etc/inc/util.inc:send_event() with the message "service restart webgui"

    2. send_event() writes the message to /var/run/check_reload_status, which appears to be some sort of socket

    This is where things get a bit hazy. I understand that /usr/local/sbin/check_reload_status runs as a daemon which constantly checks the socket, but I don't know the actual specifics of what it does internally. It's a binary and I don't know where I can find its source code. But I ran 'strings' on it and took an educated guess…

    1. check_reload_status seems to run /usr/local/bin/php /etc/rc.restart_webgui

    2. rc.restart_webgui calls /etc/inc/rrd.inc:enable_rrd_graphing()

    3. enable_rrd_graphing() writes a fresh copy of /var/db/rrd/updaterrd.sh, kills all running instances via /bin/pkill -f updaterrd.sh, and runs the new copy (/usr/bin/nice -n20 /bin/sh /var/db/rrd/updaterrd.sh) in the background

    4. updaterrd.sh runs in an infinite loop, updating the *.rrd files and sleeping.

    When saving settings on the RRD Settings page, enable_rrd_graphing() is called directly, jumping into step 5 above. So somewhere in steps 1-4, two updaterrd.sh processes are somehow started.

    I think I've nailed it down to something in check_reload_status, for the following reason. Every time I run '/usr/local/bin/php /etc/rc.restart_webgui' through SSH, only one updaterrd.sh instance is started and the graphs update properly. Conversely, I whipped up the following script that (I think) pretty accurately distills what happens when clicking Save on System > Advanced:

    
      $fd = @fsockopen('unix:///var/run/check_reload_status');
      if ($fd) {
        fwrite($fd, 'service restart webgui');
        echo fread($fd, 4096);
        fclose($fd);
      }
    
    ?>
    

    The script echoes the string 'OK', but ps shows two updaterrd.sh instances running (and the *.rrd files stop updating).

    Now, the thing that baffles me is how two instances of updaterrd.sh can run at once. Unless there's another call hidden in a binary somewhere, the script is only stared in rrd.inc, and old versions are pkill'd away immediately before. Is it possible, somehow, that enable_rrd_graphing() is running twice, in two separate threads, which both pkill at the same exact instant, and then both spawn their own copy of updaterrd.sh?

    That's my crackpot theory, anyway. Anybody got anything better? ;)




Log in to reply