New Package: Service Watchdog


  • Rebel Alliance Developer Netgate

    Last night I added a very basic package for restarting services if they are detected down. It's still pretty "young" as packages go, but the basics all work.

    It only works on pfSense 2.1, and only works with services registered in pfSense, meaning the ones that show up under Status > Services.

    Given its newness, it will probably have bugs here and there.

    It's simple to use. After install, go to Services > Service Watchdog, click +, pick a service to monitor, click Add. Every minute it will check to see if the services in the list are running, and if they aren't, they will be restarted.

    You can reorder the services in the list in case there are dependencies to worry about (service X should be started before service Y)

    Known Issues:
    Does not work properly with dhcprelay6 service until version 1.1
    Does not work properly with OpenVPN services until version 1.1 and a pfSense 2.1 snapshot from after Aug 28 9:00AM EST.
    May show blank services or empty descriptions in some cases before version 1.2



  • I don't care what all those people say about you.  You are the bomb…
    Is this a 2.1 thing or works for 2.03 also?


  • Rebel Alliance Developer Netgate

    @kejianshi:

    I don't care what all those people say about you.  You are the bomb…
    Is this a 2.1 thing or works for 2.03 also?

    2.1 only.



  • Yeah - You already posted that…  I should learn to read.
    Thanks again.  I'll install and see how it goes.


  • Rebel Alliance Developer Netgate

    @kejianshi:

    Yeah - You already posted that…  I should learn to read.
    Thanks again.  I'll install and see how it goes.

    Maybe I did, or maybe I edited the OP to state that afterward.  ;D



  • Great package, this is much appreciated.

    A few cosmetic issues. I'm on the latest snapshot (AMD64) and both 0.1 and 0.2 don't show the Cron service in the dropdown, only a ":"

    Also, iperf and pfflowd show the service name but not the description.


  • Rebel Alliance Developer Netgate

    Odd, cron shows fine here for me on the latest snapshot (also amd64). I may have to blacklist cron since it doesn't really make sense for it to be handled via this script. (if cron isn't running, the script would never run)

    I haven't tried many packages. I'll have to install them and see what is different there.


  • Banned

    @jimp:

    Odd, cron shows fine here for me on the latest snapshot (also amd64).

    Ditto, and yeah, cron makes no sense there. :D

    As for packages, seems like almost all of them are missing descriptions. (gwled, blinkled, nut, darkstat). The only thing I have installed and can see the description is unbound.


  • Rebel Alliance Developer Netgate

    The packages apparently aren't setting their own/proper service description. The status page pulled their package descriptions instead. I just pushed a fix for that (and to skip cron and empty services)



  • Looks good - I can stop services from status services, and they spring back to life 1 minute later. Looking forward to trying listing multiple OpenVPN servers (a site-to-site and a road-warrior) once the necessary new snapshot comes.
    One thought - during the boot process cron is configured/started. The service watchdog job might run while the boot is still doing more stuff? If so, it should not really do anything, as the boot might still be getting things up and running. Should the code check for if $g['booting'] and bail out in that case?


  • Rebel Alliance Developer Netgate

    Yeah you're right. I just pushed a new rev to make it do nothing if it's booting.


  • Banned

    You can as well blacklist unbound I'd say, it has its own "watchdog" script - /usr/local/bin/unbound_monitor.sh. Though, not completely sure whether we'd not be better off dropping the looping shell script from unbound and using this package instead. :)



  • Squid also has it own sqp_monitor process. That function could now be done by this package and the special sqp_monitor code removed, if anyone cares or thinks it is a good idea.


  • Banned

    Just a quick thought…Snort running a heavy set of rules can take minutes to start and running this every minute could cause Snort to start multiple times... Would it be a thing to make a 5 minute penalty period after a boot before the script begins to monitor packages??



  • 2.1-RC1 (i386)
    built on Wed Aug 28 16:55:08 EDT 2013
    FreeBSD 8.3-RELEASE-p10

    I created 2 OpenVPN servers on a test system. They both appear in the dropdown list of services to add for Service Watchdog. After adding the 1st server, the 2nd server no longer appears in the dropdown list, so I can't add it as well.

    I stopped NTPD and both OpenVPN server services from status services. Waited a few minutes. NTPD  and Test Server 1 restarted.

    Does the dropdown list need a bit more tweaking to allow multiple individual OpenVPN servers to be added to the watch list?



  • @jimp:

    Yeah you're right. I just pushed a new rev to make it do nothing if it's booting.

    We also have situations where the boot script itself gets "killed: out of swap space". In that case, /var/run/bootup file nevers gets removed, so it always looks like the system is booting. I submitted pull requests so that Service Watchdog will start doing its thing anyway 15 minutes after boot time. See what you think.
    I engineered it to put a new function get_uptime_sec into the base system for the benefit of anything that cares to call it. But that means that the base system change has to go into 2.1 branch, and people have to have it in their snapshot to use my changes to Service Watchdog. So feel free to engineer it however…


  • Rebel Alliance Developer Netgate

    As I mentioned on the pull request, that isn't a good workaround. There are other things that will break if that flag file is left around too long, and the package shouldn't have to care about that.

    It would be better to find a way to clean that up automatically in the base system, but that isn't really a discussion for this thread since it's unrelated.


  • Rebel Alliance Developer Netgate

    @Supermule:

    Just a quick thought…Snort running a heavy set of rules can take minutes to start and running this every minute could cause Snort to start multiple times... Would it be a thing to make a 5 minute penalty period after a boot before the script begins to monitor packages??

    The "snort" binary would be in the list and it should show that it's running as far as this check is concerned.

    That said, it probably would not work right with snort anyhow, since an instance for one interface would die and this would never know, because of how snort handles its instances. It would only show it down if all instances of snort were dead.


  • Rebel Alliance Developer Netgate

    @phil.davis:

    2.1-RC1 (i386)
    built on Wed Aug 28 16:55:08 EDT 2013
    FreeBSD 8.3-RELEASE-p10

    I created 2 OpenVPN servers on a test system. They both appear in the dropdown list of services to add for Service Watchdog. After adding the 1st server, the 2nd server no longer appears in the dropdown list, so I can't add it as well.

    I stopped NTPD and both OpenVPN server services from status services. Waited a few minutes. NTPD  and Test Server 1 restarted.

    Does the dropdown list need a bit more tweaking to allow multiple individual OpenVPN servers to be added to the watch list?

    Yes that still needs some work.


  • Rebel Alliance Developer Netgate

    OpenVPN and captive portal instance matching should be fixed in 1.4, up now.


  • Banned

    Thanks again for the work on this, this feature hopefully should get to the base install once the whole stuff gets polished.


  • Rebel Alliance Developer Netgate

    @doktornotor:

    Thanks again for the work on this, this feature hopefully should get to the base install once the whole stuff gets polished.

    Seems better as a package for me. Not everyone needs/wants it, and it can react to changes faster as a package. After it's fairly well set it may not change much, but it still seems like a better fit as an add-on.



  • There are certain conditions that, in the past, have meant that raccoon was essentially offline (not about to connect anyone properly), even though the process still shows as running.

    If all those conditions haven't been cleared up, any chance to use this package to restart it when raccoon goes buggy?


  • Rebel Alliance Developer Netgate

    Not likely, if the process is running this would believe it to be up.

    That kind of check would add a whole mess of code that would be irrelevant to anything else it does. Seems maybe maybe a better fit as some sort of dedicated racoon watchdog that is capable of more than a running/not running check.



  • Another little tweak:

    function is_service_enabled($service_name)
    

    servicewatchdog_check_services() could check is_service_enabled and only bother to try and start it if it is both enabled and not running.
    It is probably best to allow people to add whatever services they like to the Service Watchdog watch list, as it is now. Then they can be in the list ready to be watched, even if they happen to be disabled at any particular time.



  • @jimp:

    OpenVPN and captive portal instance matching should be fixed in 1.4, up now.

    I can "watchdog" multiple OpenVPN instances now, and Watchdog restarts them if I stop them. A great thing.


  • Rebel Alliance Developer Netgate

    @phil.davis:

    Another little tweak:

    function is_service_enabled($service_name)
    

    servicewatchdog_check_services() could check is_service_enabled and only bother to try and start it if it is both enabled and not running.
    It is probably best to allow people to add whatever services they like to the Service Watchdog watch list, as it is now. Then they can be in the list ready to be watched, even if they happen to be disabled at any particular time.

    I can add that but that function does only work for packages, not for base system services. And even then, only packages that actually support an enable option using the exact option name for which that function checks.

    Seems safer to put the burden on the user to only watch services they know they need to stay active.



  • Hi,

    I did not install this package until now but when reading the thread I was asking if it could be possible to make an GUI option to say "if service isn't running for 3 x 1min then try to restart". Something similar like apinger for the WAN interfaces to check if a gateway is down.

    So if there is a service not running and the package detects that then it will count 1.
    If the same service isn't running one minute later then counts +1.
    And if it is not running another minute later then restarts the service because the service was detected as not running three minutes after another.

    If this value would be configurable for every service I think this could be helpful.


  • Rebel Alliance Developer Netgate

    It might be nice but it would probably triple the size of the code if not more. I'm trying to keep it simple.

    It's not a persistent daemon, so it would have to somehow store the values of those checks somewhere, read them in, increment them, etc, etc. Not trivial to accomplish. Also that would require making an edit screen rather than an add screen, which makes it a bit different to work with, even more code, etc.



  • Can you explain how this plugin works?

    How does it monitor and start processes?

    Is there any reason it would still be working after I completely removed the package from pfsense?


  • Rebel Alliance Developer Netgate

    @webdawg:

    Can you explain how this plugin works?

    It uses the functions built into pfSense to check the status and control services.

    @webdawg:

    How does it monitor and start processes?

    It sets up a cron job that runs once per minute to inspect the service status, and if it detects a monitored service as down, it restarts it.

    @webdawg:

    Is there any reason it would still be working after I completely removed the package from pfsense?

    Only if somehow its files and cron job were not removed.

    Keep in mind some other packages like squid have their own monitoring which is not related to this package.



  • Can I ask what account that the cronjob is created under?


  • Rebel Alliance Developer Netgate

    There is no concept of that. If you want to view the cron job, install the Cron package or inspect the config.xml file. It's probably best to start your own thread for help diagnosing your actual issue rather than approaching it this way.



  • No, this is good!

    I forgot how pfSense did it.  That is all I needed, thanks.



  • Hi friends.

    I just want to ask if its possible some way to add the redis server to the watchdog list.  Because i have installed the ntopng package and redis server keep stop it alone. 
    In the watchdog i config ntopng but when redis its down the ntopng doesn start.

    Thanks

    –----------
    EDIT:  At the momenti did it with a script and crontab  checking the redis pid,  also the directory /var/run/redis/redis.pid  doesnt exists so maybe that was making errors.