Unbound seems to be restarting frequently

doktornotor

In many files a function "services_unbound_configure()" is called when starting or re-starting unbound and also in various other scenarios. This function is defined in the services.inc.
The first and worst thing that always happens when calling this function:
// kill any running Unbound instance
if (file_exists("{$g['varrun_path']}/unbound.pid"))
sigkillbypid("{$g['varrun_path']}/unbound.pid", "TERM");
That means: Cache is lost. Each time this function is called.

The cache is lost no matter what, even on the "graceful" unbound-control reload (per unbound docs). It would flush it anyway.

man unbound-control

reload Reload the server. This flushes the cache and reads the config file fresh.

I cannot see any solution here to losing the cache beyond crude hacks (restore the cache every time…) until the retarded design is fixed upstream.

ConfusedUser

@doktornotor:

The cache is lost no matter what, even on the "graceful" unbound-control reload (per unbound docs). It would flush it anyway. I cannot see any solution here to losing the cache beyond crude hacks (restore the cache every time…) until the retarded design is fixed upstream.

Completely correct, that's why I wrote

Also the cache could be stored and recovered afterwards. But I can't see this happening here.

Regarding the "reload" I see this simply as a more elegant way than killing a process. That's why I mentioned it.

doktornotor

Did you try this instead?


 if (file_exists("{$g['varrun_path']}/unbound.pid"))
      sigkillbypid("{$g['varrun_path']}/unbound.pid", "HUP");

This would need to be moved just before the final


return $return;

obviously - I don't get the "logic" behind killing the service first and then doing some reconfiguration. :o ::)

ConfusedUser

@doktornotor:

Did you try this instead?
 if (file_exists("{$g['varrun_path']}/unbound.pid"))
      sigkillbypid("{$g['varrun_path']}/unbound.pid", "HUP");
This would need to be moved just before the final
return $return;
obviously - I don't get the "logic" behind killing the service first and then doing some reconfiguration. :o ::)

I haven't tried because I don't want/need the process to be killed or restarted at all.
Since this function is called to start and to restart the process I'm checking if the process is running with this line "if (!(get_service_status(find_service_by_name('unbound'))))" and if it is already running I simply do nothing.

In my opinion the first thing that should happen is to split the start and the restart function and also to add a check box "keep cache when restarting" to the GUI which causes an "unbound-control dump_cache" and after restarting it runs an "unbound-control load_cache".
But changing the scripts to accomplish that goes far beyond my skills.
Also the fact that things are called currently that don't exist means for me someone really needs to look into the whole implementation.

doktornotor

@ConfusedUser:

I haven't tried because I don't want/need the process to be killed or restarted at all.
Since this function is called to start and to restart the process I'm checking if the process is running with this line "if (!(get_service_status(find_service_by_name('unbound'))))" and if it is already running I simply do nothing.

Well, this hack obviously is not acceptable, since it breaks things like the DHCP registration… So, we should move to testing something that might get accepted instead. :P

ConfusedUser

@doktornotor:

Well, this hack obviously is not acceptable, since it breaks things like the DHCP registration… So, we should move to testing something that might get accepted instead. :P

Sure… This is really just a workaround and for me it works like a charm. Obviously this can't go into release this way.

I was posting this because it might help others, too, who don't use dynamic WAN IPs and DHCP registration in unbound.

What really needs to be happen would be to fix AND clean up unbound.inc and services.inc.
I would love to help (obviously I'm not too lazy, the investigation and the workaround took me a few hours) but my skills are simply not good enough to provide a "proper" solution even though it doesn't look too complicated to me so I might give it a try...

For now (and for anyone who wants to use it): The workaround works perfect for me. But don't blame me if it kills your box. :)

doktornotor

Actually all the code to dump/restore cache is there. Just the checkbox to enable the dumpcache var went AWOL. This was there with the 2.1 package.

https://github.com/pfsense/pfsense-packages/blob/master/config/unbound/unbound.xml#L179

EDIT: Filed a bug about the missing cache save/restore - https://redmine.pfsense.org/issues/4667

Gertjan

I saw the code in unbound.inc, that handles the dump_cache and load_cache.
It might be the solution.
So I activated the 'dump_cache' facilities, so that before a 'reload', the cache was dumped.

Have a look at this:

function unbound_control($action) {
	global $config, $g;

	$cache_dumpfile = "/var/tmp/unbound_cache";

	switch ($action) {
	case "start":
.....
	case "stop":
.....
	case "reload":
.....
	case "dump_cache":
		// Dump Unbound's Cache
		if ($config['unbound']['dumpcache'] == "on")
			do_as_unbound_user("dump_cache");
		break;
	case "restore_cache":
		// Restore Unbound's Cache
		if ((is_service_running("unbound")) && ($config['unbound']['dumpcache'] == "on")) {
			if (file_exists($cache_dumpfile) && filesize($cache_dumpfile) > 0)
				do_as_unbound_user("load_cache < /var/tmp/unbound_cache");
		}
		break;
	default:
		break;

	}
}

Be carefully : $config['unbound']['dumpcache'] isn't defined neither declared in the GUI.
And doing something like this: do_as_unbound_user("load_cache < /var/tmp/unbound_cache");
will not work. The function do_as_unbound_user() works not that way.
Anyway, I had my unbound cache being dumped before restart.

And then I started to think … after reading this:

       load_cache
	      The contents of the cache	is loaded from stdin.  Uses  the  same
	      format as	dump_cache uses.  Loading the cache with old, or wrong
	      data can result in old or	wrong data returned to clients.	 Load-
	      ing data into the	cache in this way is supported in order	to aid
	      with debugging.

Consider the situation: unbound starts, and read all the files its need, like /var/etc/hosts, the DHCP leases file, etc.
Then we instruct it to load the cache file, /var/tmp/unbound_cache
It doesn't take long to discover that the internal working cache (with the new local info) in unbound is being replaced by what has been written in /var/tmp/unbound_cache.
F*ck.
Doing so makes it completely useless to restart unbound to begin with …... >:(
As said in the doc: dump_cache and load_cache exists for debugging purposes.

Leaves me with one idea: ditch unbound, even if it is the most secured name server, and make friends with bind...... (one of the most used name server on the planet !?!), which, I'm pretty sure, doesn't 'work' like this.

edit: hopping through the unbound source code make me think: it 'reloads' when reading in a cache file …. :o
Well .....

doktornotor

Yeah I had a look at the source code and I have hard time understanding the design. Also, the "reload" which is something presumed to be graceful in the rest of the world is pretty much equal to restart and disrupts the service. Unless this gets fixed quickly upstream, I frankly would suggest removing the DHCP registration features from the DNS Resolver GUI altogether. With this broken design, it is not usable.

kejianshi

It seems to work ok IF you are using static DHCP entries for the clients, like I do.

Of course this isn't possible if you are providing DHCP for a mall full of wireless clients.

But then again, why the hell would you want a mall full of wireless clients registered in DNS?

So maybe register the static entries but not the dynamic entries.

ConfusedUser

I saw host entries can be removed with unbound-control at runtime w/o restarting the service. But I can't see an option to add a host using unbound-control. If that is possible this could eliminate the issues when using DHCP.

I'm registering ALL my leases at a Windows Server DNS Service (since there is an AD running on it). So even DHCP is not updating the local DNS it still keeps restarting it. So I'm using unbound only for Internet DNS loolup really but everything locally is done by Microsoft DNS service. So definitely no need for restarting anything since hosts file will never change.

Well, "…keeps restarting..." is wrong in fact, "...kept restarting..." is correct in my case since I removed the ability to restart unbound from the code a couple of days ago (which I posted here). Even it's a very dirty workaround it works perfect so far without any adverse effects for my usage. Couldn't be happier that I've done that.

kevindd992002

Is there any progress to this issue?

Liath.WW

yeah hoping myself that this is fixed, and soon. I don't have any dns mappings done in DHCP (do have some statics), but I keep gettign the restart issue every flippin time a phone is turned on and connects. It's getting old fast.

CiscoKid85

In its current implementation, unbound is unusable for me. The restarts are causing the cache to clear so I've had to use forwarder to make everything work.

kevindd992002

It's clearly happening to very many users but is it already acknowledged as a real "issue"? No pun intended.

doktornotor

@kevindd992002:

It's clearly happening to very many users but is it already acknowledged as a real "issue"? No pun intended.

Perhaps take this to upstream mailing list? As discussed many times above, there's no graceful reload anywhere, the code is braindead.

ConfusedUser

Just want to mention that my "solution" https://forum.pfsense.org/index.php?topic=89589.msg517047#msg517047 is working great on 6 production systems. :)

cmb

@doktornotor:

@kevindd992002:

It's clearly happening to very many users but is it already acknowledged as a real "issue"? No pun intended.

Perhaps take this to upstream mailing list? As discussed many times above, there's no graceful reload anywhere, the code is braindead.

That's what I was planning on doing post-2.2.4. I dug into it a few days ago looking at something else to find basically what's been discussed previously. Sending it a -HUP is a full stop/start rather than a reload, and same for unbound-control reload. There is no way to just reload the config without doing a full stop/start. It is certainly braindead.

Anyone else is welcome to report upstream, just please post back a link to the thread here.

ky41083

How many people having this issue are using a config that used to use the forwarder rather than the resolver? And how many of those people still have the top 3 DHCP options checked in the forwarder, even though the forwarder is disabled?

If this is the case, humor me here… go to the DNS forwarder page, check the top enable box to enable access to the 3 DHCP options boxes below it, uncheck all 3 DHCP options boxes, then uncheck the top enable box (so we don't unwantingly enable the forwarder), click the save button, verify that all 3 DHCP options boxes for the forwarder are now unchecked and disabled once the page reloads, and reboot pfSense for sanity.

Please post back as to what effect this has on Unbound reloads… either the same, less frequently, or eliminates them all together.

jpns

I have the same problem with unbound constantly reloading and having no DNS resolution for about 30 seconds every few minutes. I did have both the dynamic and static DHCP clients options ticked for DNS but I disabled them both and it didn't fix it. I have come to the conclusion that unbound is so broken that I've had to switch back to dnsmasq and now it's all working fine. I was using forwarding in unbound anyway (I am running multi-WAN with failover), so it doesn't really bother me having to switch.