@gawainxx said in Unbound regularly crashing, need help creating a service watchdog.:
there is a very distinct error message whenever it does fail
This one
fatal error: Could not read config file: /unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-checkconf
That's a message when it fails to start.
At that moment is had already stopped. Most probably some event provoqued a unbound restart to take into account a hardware event (a NIC came up or down, some route changed == VPN activated - or whatever).
What you need to look for is just before that moment : when i is instructed to stop - or for that matter : how/why it stops.
Other logs will mention info about other process.
This is what is been shown when unbound stops :
Dec 9 12:09:10 unbound 38919:0 info: service stopped (unbound 1.9.1).
Right after that, a boatload of statisticks are dumped to the log (reverse order here) :
Dec 9 12:09:10 unbound 38919:0 info: 32.000000 64.000000 5
Dec 9 12:09:10 unbound 38919:0 info: 1.000000 2.000000 8
Dec 9 12:09:10 unbound 38919:0 info: 0.524288 1.000000 3
Dec 9 12:09:10 unbound 38919:0 info: 0.262144 0.524288 8
Dec 9 12:09:10 unbound 38919:0 info: 0.131072 0.262144 4
.......
Dec 9 12:09:10 unbound 38919:0 info: histogram of recursion processing times
Dec 9 12:09:10 unbound 38919:0 info: average recursion processing time 18.671069 sec
Dec 9 12:09:10 unbound 38919:0 info: server stats for thread 0: requestlist max 26 avg 12.4819 exceeded 0 jostled 0
Dec 9 12:09:10 unbound 38919:0 info: server stats for thread 0: 103 queries, 20 answers from cache, 83 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Then the restart is shown :
Dec 9 12:09:14 unbound 41526:0 info: start of service (unbound 1.9.1).
.....
Right after that moment, your error log line should pop up.
Btw :
Consider
fatal error: Could not read config file: /unbound.conf. Maybe try unbound -dd, it stays on the commandline to see more errors, or unbound-checkconf
Why not doing what unbound proposes to do ?
First, I shut down unbound with the GUI.
Click on the Stop button :
8e989357-e303-4bf7-b816-515b5bdb17c7-image.png
Then in the console/ssh (option 8) access :
I 'cd' to the unbound working directory :
[2.4.4-RELEASE][admin@pfsense.brit-hotel-fumel.net]cd /var/unbound
I use 'unbound-checkconf' to check my config :
[2.4.4-RELEASE][admin@pfsense.brit-hotel-fumel.net]/var/unbound: unbound-checkconf
unbound-checkconf: no errors in /usr/local/etc/unbound/unbound.conf
Then I start unbound in debugging mode :
[2.4.4-RELEASE][admin@pfsense.brit-hotel-fumel.net]/var/unbound: unbound -dd
[1576165229] unbound[94442:0] notice: init module 0: validator
[1576165229] unbound[94442:0] notice: init module 1: iterator
[1576165229] unbound[94442:0] info: start of service (unbound 1.9.1).
....
Type Ctrl-C to stop unbound gracefully.
I guess you will see the same lines, because the problem isn't unbound - neither the stopping.
It's when unbound get's restarted by pfSense : the prcoess unbound isn't started like that : first, the working environment is set up :
The chroot dir (/var/bound) is created.
Needed files like the unbound.conf file are copied in place.
Other sub folders and files are created / put in place.
And some more things are done.
Something in the creation of that 'environment' a failure happens. The result will be : unbound gets started and can't find its config file (see error). As I said before : a file system error ?
At that moment, don't do anything but login into ssh and check if the folder /var/unbound exists.
Check if the file in that folder called unbound.conf exists.
Run the command 'top' at that moment : memory is full ?
Run the 'df' command : the first line will show the primary partition : not full ?
This could even be related by the reason why unbound was stopped : was there an OOM failure ?
Another radical solution could be :
Take a copy of the config.
Then : remove all packages).
Then : console / ssh access : reset to default (option 4).
Then : set up WAN access - if needed. If you really have to, change the LAN network.
Then : ... no, no more then ; stop here : pfSEnse works.
Now, test ... and wait .... the longer the better.
Does it happen again : you're good for a hardware issue.
No more issues after a week or so : re do not iport your config backup .... re do all your settings - step by step (wait a day or so between each step) - re install all packages ... slowly. As soon as the error comes back, you will know where to look now.
edit :
The error is already been discussed in this forum :
https://forum.netgate.com/topic/111784/solved-unbound-fails-on-restart-after-pfblockerng-updates
Two issuers where found : unbound needs a lot of time to stop when pfBlockerNG is also present. It was restarted to quickly. The poster also included files outside of the chroot .... that will fail also.