Unbound regularly crashing, need help creating a service watchdog.

gawainxx

Unbound has regularly been crashing and taking my internet down with it until I can manually restart it via the console CLI of the server itself or webUI if I happened to be home at that time.. Please see The following post for details.
https://forum.netgate.com/topic/148738/dns-crashing-every-36-hours-or-so-and-unbound-has-to-be-restarted/9

What I'm looking for is a temporary bandaid solution that will poll every 5 minutes to verify if Unbound is running, and then restart it if it stays down for longer then 2 minutes.]

Are there any specific plugins that could help with this, or perhaps some sort of Cron Job?

Gertjan

@gawainxx said in Unbound regularly crashing, need help creating a service watchdog.:

specific plugins

Like the watchdog package ?

Or the cron job page ?

@gawainxx said in Unbound regularly crashing, need help creating a service watchdog.:

temporary bandaid solution

Note : the Watchdog package makes stinking wounds .... amputation will be needed soon.
It's far more better to cure.

gawainxx

@Gertjan
Can you please elaborate?
I've set up the watchdog package and am going to use it as a crutch, continuing to monitor the logs for issues with unbound. Goal is to have that fixed within the next month but may take some trial and error...
Watchdog is just to prevent any extended outages that occur before I'm able to restart unbound.

JeGr

Problem is: if you configure the watchdog, how do you "feel" the problem/outage if no one sees anything going wrong? We already had such things in the forums until someone wrote their unbound got restarted around every 5-10 min. Always restarting without fixing the underlying problem is problematic as it makes it harder for you to get your logs from the time of the outage.

gawainxx

@JeGr
Thankfully, there is a very distinct error message whenever it does fail that I'm able to search the logs for. I plan to check for this about once a week seeing if there were any new events and continue to troubleshoot it until it's not present for a few months.

Gertjan

@gawainxx said in Unbound regularly crashing, need help creating a service watchdog.:

there is a very distinct error message whenever it does fail

This one

fatal error: Could not read config file: /unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-checkconf

That's a message when it fails to start.
At that moment is had already stopped. Most probably some event provoqued a unbound restart to take into account a hardware event (a NIC came up or down, some route changed == VPN activated - or whatever).
What you need to look for is just before that moment : when i is instructed to stop - or for that matter : how/why it stops.
Other logs will mention info about other process.

This is what is been shown when unbound stops :

Dec 9 12:09:10 	unbound 	38919:0 	info: service stopped (unbound 1.9.1).

Right after that, a boatload of statisticks are dumped to the log (reverse order here) :

Dec 9 12:09:10 	unbound 	38919:0 	info: 32.000000 64.000000 5 
Dec 9 12:09:10 	unbound 	38919:0 	info: 1.000000 2.000000 8
Dec 9 12:09:10 	unbound 	38919:0 	info: 0.524288 1.000000 3
Dec 9 12:09:10 	unbound 	38919:0 	info: 0.262144 0.524288 8
Dec 9 12:09:10 	unbound 	38919:0 	info: 0.131072 0.262144 4
.......
Dec 9 12:09:10 	unbound 	38919:0 	info: histogram of recursion processing times
Dec 9 12:09:10 	unbound 	38919:0 	info: average recursion processing time 18.671069 sec
Dec 9 12:09:10 	unbound 	38919:0 	info: server stats for thread 0: requestlist max 26 avg 12.4819 exceeded 0 jostled 0
Dec 9 12:09:10 	unbound 	38919:0 	info: server stats for thread 0: 103 queries, 20 answers from cache, 83 recursions, 0 prefetch, 0 rejected by ip ratelimiting

Then the restart is shown :

Dec 9 12:09:14 	unbound 	41526:0 	info: start of service (unbound 1.9.1). 
.....

Right after that moment, your error log line should pop up.

Btw :
Consider

fatal error: Could not read config file: /unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-checkconf

Why not doing what unbound proposes to do ?

First, I shut down unbound with the GUI.
Click on the Stop button :

Then in the console/ssh (option 8) access :

I 'cd' to the unbound working directory :

[2.4.4-RELEASE][admin@pfsense.brit-hotel-fumel.net]cd /var/unbound

I use 'unbound-checkconf' to check my config :

[2.4.4-RELEASE][admin@pfsense.brit-hotel-fumel.net]/var/unbound: unbound-checkconf
unbound-checkconf: no errors in /usr/local/etc/unbound/unbound.conf

Then I start unbound in debugging mode :

[2.4.4-RELEASE][admin@pfsense.brit-hotel-fumel.net]/var/unbound: unbound -dd
[1576165229] unbound[94442:0] notice: init module 0: validator
[1576165229] unbound[94442:0] notice: init module 1: iterator
[1576165229] unbound[94442:0] info: start of service (unbound 1.9.1).
....

Type Ctrl-C to stop unbound gracefully.

I guess you will see the same lines, because the problem isn't unbound - neither the stopping.
It's when unbound get's restarted by pfSense : the prcoess unbound isn't started like that : first, the working environment is set up :
The chroot dir (/var/bound) is created.
Needed files like the unbound.conf file are copied in place.
Other sub folders and files are created / put in place.
And some more things are done.

Something in the creation of that 'environment' a failure happens. The result will be : unbound gets started and can't find its config file (see error). As I said before : a file system error ?
At that moment, don't do anything but login into ssh and check if the folder /var/unbound exists.
Check if the file in that folder called unbound.conf exists.
Run the command 'top' at that moment : memory is full ?
Run the 'df' command : the first line will show the primary partition : not full ?
This could even be related by the reason why unbound was stopped : was there an OOM failure ?

Another radical solution could be :
Take a copy of the config.
Then : remove all packages).
Then : console / ssh access : reset to default (option 4).
Then : set up WAN access - if needed. If you really have to, change the LAN network.
Then : ... no, no more then ; stop here : pfSEnse works.

Now, test ... and wait .... the longer the better.
Does it happen again : you're good for a hardware issue.
No more issues after a week or so : re do not iport your config backup .... re do all your settings - step by step (wait a day or so between each step) - re install all packages ... slowly. As soon as the error comes back, you will know where to look now.

edit :

The error is already been discussed in this forum :
https://forum.netgate.com/topic/111784/solved-unbound-fails-on-restart-after-pfblockerng-updates

Two issuers where found : unbound needs a lot of time to stop when pfBlockerNG is also present. It was restarted to quickly. The poster also included files outside of the chroot .... that will fail also.