Unbound restarting 20-30 times a day without any reason

Piktor10

System data:
pfSense 2.7.2-RELEASE (amd64), unbound 1.19.1
Average memory status:
Mem: 299M Active, 738M Inact, 426M Wired, 215M Buf, 2439M Free
Swap: 4096M Total, 4096M Free
__

My unbound is restarting in intervalls between 2min to 260min, 20 to 30 times a day!

Checking the log, nothing looks obvious, looks like the restart is done under some control:
The stop and start message appears everytime in the logfile. Its always a "service stopped" message followed by "start of service", so nothing pointing to a crash.

Jun  4 00:33:26 XXXX unbound[96489]: [96489:0] info: service stopped (unbound 1.19.1).
Jun  4 00:33:43 XXXX unbound[64466]: [64466:0] info: start of service (unbound 1.19.1).

The section between service stopped and start of service looks like this:

Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: service stopped (unbound 1.19.1).
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: server stats for thread 0: 5324 queries, 4863 answers from cache, 461 recursions, 1688 prefetc
h, 0 rejected by ip ratelimiting
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: server stats for thread 0: requestlist max 7 avg 0.377385 exceeded 0 jostled 0
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: average recursion processing time 0.103678 sec
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: histogram of recursion processing times
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: [25%]=0.0328249 median[50%]=0.0590507 [75%]=0.119681
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: lower(secs) upper(secs) recursions
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.000000    0.000001 30
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.000512    0.001024 1
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.001024    0.002048 2
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.002048    0.004096 5
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.004096    0.008192 3
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.008192    0.016384 3
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.016384    0.032768 71
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.032768    0.065536 144
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.065536    0.131072 105
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.131072    0.262144 61
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.262144    0.524288 27
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.524288    1.000000 6
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    1.000000    2.000000 3
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: server stats for thread 1: 3572 queries, 3274 answers from cache, 298 recursions, 1166 prefetc
h, 0 rejected by ip ratelimiting
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: server stats for thread 1: requestlist max 6 avg 0.30123 exceeded 0 jostled 0
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: average recursion processing time 0.127478 sec
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: histogram of recursion processing times
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: [25%]=0.0366778 median[50%]=0.0644189 [75%]=0.138044
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: lower(secs) upper(secs) recursions
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.000000    0.000001 19
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.004096    0.008192 1
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.016384    0.032768 44
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.032768    0.065536 88
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.065536    0.131072 69
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.131072    0.262144 47
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.262144    0.524288 20
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.524288    1.000000 6
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    1.000000    2.000000 3
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    2.000000    4.000000 1
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: server stats for thread 2: 1478 queries, 1340 answers from cache, 138 recursions, 490 prefetch
, 0 rejected by ip ratelimiting
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: server stats for thread 2: requestlist max 6 avg 0.16242 exceeded 0 jostled 0
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: average recursion processing time 0.152079 sec
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: histogram of recursion processing times
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: [25%]=0.0340964 median[50%]=0.0646504 [75%]=0.160563
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: lower(secs) upper(secs) recursions
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.000000    0.000001 16
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.016384    0.032768 17
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.032768    0.065536 37
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.065536    0.131072 29
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.131072    0.262144 20
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.262144    0.524288 13
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.524288    1.000000 1
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    1.000000    2.000000 5
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: server stats for thread 3: 4884 queries, 4395 answers from cache, 489 recursions, 1517 prefetc
h, 0 rejected by ip ratelimiting
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: server stats for thread 3: requestlist max 5 avg 0.352443 exceeded 0 jostled 0
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: average recursion processing time 0.117536 sec
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: histogram of recursion processing times
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: [25%]=0.03404 median[50%]=0.0589213 [75%]=0.119998
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: lower(secs) upper(secs) recursions
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.000000    0.000001 21
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.000512    0.001024 2
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.002048    0.004096 1
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.004096    0.008192 3
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.008192    0.016384 6
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.016384    0.032768 83
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.032768    0.065536 161
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.065536    0.131072 108
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.131072    0.262144 62
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.262144    0.524288 28
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    0.524288    1.000000 7
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    1.000000    2.000000 4
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info:    2.000000    4.000000 3
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] notice: Restart of unbound 1.19.1.
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] info: [pfBlockerNG]: pfb_unbound.py script exiting
Jun  4 20:46:19 XXXX unbound[64466]: [64466:0] notice: init module 0: python
Jun  4 20:46:20 XXXX unbound[64466]: [64466:0] info: [pfBlockerNG]: pfb_unbound.py script loaded
Jun  4 20:46:34 XXXX unbound[64466]: [64466:0] info: [pfBlockerNG]: init_standard script loaded
Jun  4 20:46:34 XXXX unbound[64466]: [64466:0] notice: init module 1: iterator

reporting results from 3-5 threads.

Every 15 minutes I get the message:

merge_config: configuration reload

followed by all my firewall named tables

That is all, I see in the logfiles. No reason, why unbound is restarting so often, nothing about what is causing it.

Has someone any idea?

Furthermore, how can I configure, that this configuration reload is done only every 6 or 12 hours?

bmeeks

Do you have automatic DNS registration of DHCP leases enabled?

If so, that results in unbound being given a restart command each time a DHCP client renews its lease. Since most clients renew at the half-way point of their lease duration, that can result in a lot of unbound restarts on a moderately populated network.

This is a known limitation of the current unbound implementation on pfSense. Work is underway to remedy that with later versions of the Kea DHCP replacement (but it's not ready yet).

Not sure where that second message about "merge_config: configuration reload" is coming from.

Piktor10

@bmeeks Thanks for that tip! Yes I have DHCP client DNS registration enabled. Will disable this, so that hint is very valuable!
Will watch, how things are look like after that.

Gertjan

Added to what is mentioned above :

@Piktor10 said in Unbound restarting 20-30 times a day without any reason:

That is all, I see in the logfiles. No reason, why unbound is restarting so often, nothing about what is causing it.

Has someone any idea?

Knowing what you know now :
read again :

and this is exactly what you see happening.
This is the reason.

By default, this option in not checked.
'Then, some one came along, and checked it ..."

Take note ; the future replacement of the ISC DHCP, KEA, will also have this functionality, and the DHCP host name will get communicated to unbound (DNS° without a DNS restart.
Until that moment : for all your important 'server type' devices, devices you have to name by host name, us a static DHCP MAC lease.

And Check this one :

as this option will not restart unbound, as they are ... static ^^

Piktor10

"For me reloading and flushing a resolution cache" was not a synonym for restarting the resolver!
Even, when I was reading that after unchecking, I have not understand this as a warning, that the whole resolver restarts! The only possible delay I saw in this warning, that the resolver can't answer the next queries from its cache, as that was flushed!

Gertjan

@Piktor10 said in Unbound restarting 20-30 times a day without any reason:

"For me reloading and flushing a resolution cache" was not a synonym for restarting the resolver!
Even, when I was reading that after unchecking, I have not understand this as a warning, that the whole resolver restarts!

Actually, unbound is send a reload signal.
A couple of years ago, it was discoverer (open source !) that unbound, upon receiving a 'reload' actually ... restarts.

Piktor10

@Gertjan Thats, what I figured out now as well. Seeing in the dhcpd.log the HUPs beeing sent.
So I would recommend, to change the wording in the warning to the current behaviour, and write restart instead of reload.

Next question: How that shall be fixed by using kea instead of dhcpd? Unbound still needs to get this data, without restarting.

I am also now further, why this happens to my pfSense since some weeks, even when that configuration was in place for years. My network got recently at least 4 mobile devices using new mac addresses for connection, once they were connected anywhere else. They don't remember mac addresses for certain WLANs or router macs. Because of this, I had to throw away static associations for those mobile devices. This is in timely coincidence, to the growing number of restarts.

Gertjan

@Piktor10 said in Unbound restarting 20-30 times a day without any reason:

Next question: How that shall be fixed by using kea instead of dhcpd? Unbound still needs to get this data, without restarting.

Believe it or not, after 10+ years of waiting, this is exactly one of the reasons KEA was introduced ! (IMHO)
A couple of weeks ago, a Netgate software développer acknowledged here on this forum that he started the implantation of just this very issue.
Kea that signals unbound the presence of a (new) lease without restarting unbound.

@Piktor10 said in Unbound restarting 20-30 times a day without any reason:

My network got recently at least 4 mobile devices using new mac addresses for connection, once they were connected anywhere else. They don't remember mac addresses for certain WLANs or router macs. Because of this, I had to throw away static associations for those mobile devices. This is in timely coincidence, to the growing number of restarts.

Not entirely pfSense's fault.
When, you, the pfSense admin, decides to add a DHCP Static MAC lease, it would be wise to check if the MAC in question is a randomized MAC, or not. The owner of the device should be willing to shut down this randomization on your network. So, see it as a win-win situation, check that randomisation is off.

bmeeks

@Piktor10 said in Unbound restarting 20-30 times a day without any reason:

Next question: How that shall be fixed by using kea instead of dhcpd? Unbound still needs to get this data, without restarting.

unbound has a control channel app that can update some parameters without restarting the daemon. But that control channel app needs to be called from within the DHCP server code. Currently the ISC DHCP daemon is not coded to do that. The new Kea DHCP daemon can be modified to do that, and that's what pfSense intends to do. Since the current ISC DHCP daemon will eventually be obsoleted, there is no point to expending developer resources to modify it. Better to put the effort into the replacement product - Kea.

Gertjan

@bmeeks

Some one, and yes, even me, could have taken this https://github.com/NoThrowForwardIt/pfsense-tools/blob/master/pfPorts/dhcpleases/files/dhcpleases.c as a starting point, and rewrite it so it takes in account the "unbound-control" parameters, so things could haven been done 'properly' ages ago.

The actual solution : check if the ISC DHCPd 'dhcpleases' file was changed, and if so, shoot at unbound.
It did pass the KIS test

michmoor

@bmeeks
Curious. I typically run pfSense in small-ish networks is it worth it keeping DHCP Registration enabled? I haven't noticed a performance hit but also the only benefit is knowing what IP is what when i nslookup.

bmeeks

@michmoor said in Unbound restarting 20-30 times a day without any reason:

@bmeeks
Curious. I typically run pfSense in small-ish networks is it worth it keeping DHCP Registration enabled? I haven't noticed a performance hit but also the only benefit is knowing what IP is what when i nslookup.

The decision point would be the number of DHCP clients and thus the number of DHCP lease renewals per interval (say per hour or per day). Each lease renewal results in an unbound restart with the current design when "Register DHCP leases in the DNS Resolver" is enabled. Each restart means DNS is unavailable until the unbound daemon gets back up and running. With a plain vanilla install that can be fairly quick, but if you use DNSBL, then the restart time can be much longer. Even still, enough restarts per time interval can still be very disruptive on a network as you have no DNS for an appreciable interval of time.

I, too, like to have my local DHCP clients registered in DNS. Because I have a Windows AD domain internally, I just use that for both DHCP and DNS. My unbound setup only resolves for pfSense itself and has a domain override for my AD domain that points back to my AD DNS servers so pfSense can resolve my local LAN hosts.

This problem with restarting unbound is not a good design choice in my view, but it is what it is. I have a pfSense install on an SG-6100 at a local church, and I had to disable the DHCP lease registration option there otherwise DNS became essentially unusable due to the frequent restarts initiated by lease allocations and renewals on the network.

michmoor

@bmeeks said in Unbound restarting 20-30 times a day without any reason:

This problem with restarting unbound is not a good design choice in my view, but it is what it is. I have a pfSense install on an SG-6100 at a local church, and I had to disable the DHCP lease registration option there otherwise DNS became essentially unusable due to the frequent restarts initiated by lease allocations and renewals on the network.

how many devices on the network?

bmeeks

@michmoor said in Unbound restarting 20-30 times a day without any reason:

how many devices on the network?

There is no universal precise number. It will depend primarily on the DHCP lease time. And then the number of devices comes into play. If you have a DHCP lease time of 1 hour with 1 device on the network, unbound would be seeing a restart once every 30 minutes (assuming the single device renewed its lease at the 50% point of the lease duration). Double the number of devices and the restarts could double in that 30 minute window. Depends on how far apart each lease start time was. Say you have 10 devices getting DHCP addresses and they all start up at staggered times. Then each of the 10 would be renewing its lease at a slightly different minute during the 1 hour lease duration. That could result in lots of unbound restarts over the hour.

Play different scenarios in your head and you can quickly see how this can become more and more of a problem. Of course 1 hour is a ridiculously low lease time, but 4 or 6 hours may not be unheard of for say a Guest Wireless network or some other network where user devices frequently come and go and you don't have a huge scope available.

Run the DHCP lease time up to days and the problem gets smaller but still does not disappear. If at any instance in time unbound is restarting when a client requests a DNS lookup, that lookup is going to fail.

michmoor

@bmeeks
alright alright...i suppose i have to give myself extra work and spin up a separate DNS server at home...

But seriously, i do like having all my pieces in one place , one firewall. All depends on the use case.

Gertjan

@michmoor said in Unbound restarting 20-30 times a day without any reason:

i have to give myself extra work and spin up a separate DNS server at home...

The situation existed for ... many years.
In a couple of weeks, months, with the next version of pfSense, this will be addressed.

I've been using static MAC leases for my devices that I wanted to know by 'name' like printers, NAS etc.

Register DHCP leases in the DNS Resolver" has been set to default value : NOT set.

Never had any DHCP/DNS issues.

michmoor

@Gertjan
Oh i was joking when i wrote that. Im not going to do that. haha.

Just to show how often the HUP process was bouncing on my system until i turned off DNS registration. The picture below is expanded over 5 days but rest assured it was restarting every few minutes. Small network so no impact but i can see it being a problem with several 100 endpoints.

Gertjan

@michmoor said in Unbound restarting 20-30 times a day without any reason:

but i can see it being a problem with several 100 endpoints.

Things will get even better when you 'some' Wifi connected devices that roam among several APs.
DHCP will fire away every x seconds .... and as much unbound restarts.

Quickly, unbound is more busy 'restarting' as actually doing 'DNS' for you.