Unbound seems to be restarting frequently

johnpoz

@stompro said in Unbound seems to be restarting frequently:

Minimum RRSet TTL setting of unbound, to make the miniumum TTL 2 hours. That does seem to help quite a bit.

I have ran a min ttl of 1 hour for many many years - I have not seen any sort of issues with doing so.. Its not that I trying to actually lower the number of queries - but sites that use excessively low ttls bug the shit out of me ;)

There is no sane reason to have a ttl of 60 seconds - unless your were in the actual process of changing the record to point elsewhere.. And 60 seconds, 5 minutes seem to be a growing common thing with dns hosted by dns services. I think they are on purpose trying to increase the number of queries sent to them.. Either just bumping their numbers up, or wanting to track users more on how long they are staying on sites, how often they go there, etc.

Sure if the record is for a dynamic host, ie ddns sure you wouldn't want that ttl to be like a day or something...

So I would be curious how low they set the ttl of something.xyz.com that are blocking, will it maybe be unblock 5 minutes from now ;)

stompro

@johnpoz

This is getting off topic also, but I'm also exploring sending out a list of the top 100-500 domains in a domain override list to bypass cisco umbrella resolution for the most heavily requested domains... it seems like I can just use our ISPs or googles dns servers for those entries. And Unbound has some nice options for domain overrides that I didn't know about. I can set multiple servers for each domain, you can say to fall back to the system forwarders if the configured override forwarders are not accepting requests for fault tolerance. And I'm reading up on on dumping and restoring the cache to make unbound restarts retain the rrset cache. But this whole case of paying per lookup essentially is probably quite a niche situation.

And after reading more about how Cisco Umbrella calculates usage, it looks like it is a 30 day daily average, so you get some credit for slow weekends and the like.

johnpoz

@stompro said in Unbound seems to be restarting frequently:

This is getting off topic also

Yeah but quite often that is where the fun happens ;) A specific question can often lead to great discussions, be that always on point and specific the original question would be boring..

If threads were always question : answer and that is all - I doubt I would spend as much time here as I do..

I could see quite a few ways to reduce the number of queries sent to them ;) Overrides for stuff you know is on the bad list, but yeah I like your idea of known good domains being forwarded to somewhere else that doesn't charge you for the query.. I mean how likely is it for example for www.google.com to get put on their bad list ;) So why ever ask them for www.google.com.. And have that query count against your use.

The only problem I see with that is if they did for some reason put something on the block list that you were forwarding somewhere.. But there is prob a large list of specific fqdn or domains that are highly highly unlikely to be listed as bad by them.

stompro

@johnpoz said in Unbound seems to be restarting frequently:

I could see quite a few ways to reduce the number of queries sent to them ;) Overrides for stuff you know is on the bad list, but yeah I like your idea of known good domains being forwarded to somewhere else that doesn't charge you for the query.. I mean how likely is it for example for www.google.com to get put on their bad list ;) So why ever ask them for www.google.com.. And have that query count against your use.

I'm not too worried about caching the bad stuff, that should be a very small minority of requests, and once someone hits the block page, I doubt they will keep trying enough for it to be a problem.

www.google.com is actually one I wouldn't want to add, since Cisco Umbrella enforces safe search for google search results. Although it is possible to do that locally(setting a domain override for google.com to a certain ip), I'm not sure how they are doing it, so I would rather leave that on them. I think google uses sane ttls so they don't seem to be a problem.

I'm going to stick with bypassing for service related stuff, not content related. windowsupdate,msedge,connectioncheck.ubuntu.com,lencrpt.com.

stompro

@gertjan said in Unbound seems to be restarting frequently:

Consider the situation: unbound starts, and read all the files its need, like /var/etc/hosts, the DHCP leases file, etc.
Then we instruct it to load the cache file, /var/tmp/unbound_cache
It doesn't take long to discover that the internal working cache (with the new local info) in unbound is being replaced by what has been written in /var/tmp/unbound_cache.
F*ck.
Doing so makes it completely useless to restart unbound to begin with …... >:(
As said in the doc: dump_cache and load_cache exists for debugging purposes.

Hello, I was trying to track down why the unbound dump_cache and load_cache where not fully implemented, and saw your comments from 2015. I'm wondering if maybe something has changed in the unbound code... because I cannot verify that local data is included in the unbound cache dump.

When I do a cache dump, I cannot find any of my locally setup domains in the data.... and if I understand your rant correctly, you are saying that old stale local data from the dump would overwrite new data, making the dump and restore unusable.

In my testing with unbound 1.12 in 21.05.2, I don't notice any incorrect data after a dump and restore. Could you tell me what to look for? Maybe NXDOMAIN entries?

The negative catch entries don't seem to cause any problems... I just tried the following steps.

dig printer9.mylocaldomain.org, no results
dumped the cache
fgrep printer9 unboundcache.file
msg entry exists "msg printer9.mylocaldomain.org. IN A 33155 1 169 3 0 1 0" (I'm ignorant about what that actually means though, is that a negative cache entry?)
Added a host override for that host.
reloaded the dumped cache
dig printer9.mylocaldomain.org returns the correct results

Maybe the problem is with existing dns entries, that are now being overriden?

dig testsite.mylocaldomain.org, returns current A record.
dump unbound cache
fgrep cache file "testsite.mylocaldomain.org. 864 IN A 18.22.82.21"
Add domain override for testsite.mylocaldomain.org
Restore the dump.
dig testsite.mylocaldomain.org returns the correct info, it wasn't overwritten by the restore.

So maybe it would make sense to now revive that feature?

swixo

@johnpoz said in Unbound seems to be restarting frequently:

@swixo said in Unbound seems to be restarting frequently:

This is a workaround and in some cases undesirable.

I don't think anyone would disagree with that. It has been a sore point for a long time.

Agree it not very desirable for unbound to restart and loose its cache on every dhcp, etc.

Still hoping for a fix here. Another big time-waster network issue emerged that came down to this issue. Certain MAC clients getting stuck if they make a DNS request during the reload time.

Who do I have to pay to get this fixed properly?

Gertjan

@swixo said in Unbound seems to be restarting frequently:

Who do I have to pay to get this fixed properly?

That's a classic one.
Most ISP's, if not all, explain : first, power up our modem router. As soon as it started, power up your LAN devices. This is even more true when there is a modem before pfSense : first the modem, then pfSense, then the rest.
Some 'dumb' devices start way faster as our routers, and are ready to go, when pfSense or any router has an entire OS to boot.

First : ditch the device that have a broken client DHCP. If the device's DHCP clients starts, it's ok there is no answer from the DHCP server (= pfSense). The protocol supports a 60+ delay just fine.
Solution : apply the known 'power' rule, or get yourself a fast (like very fast) router device.
A 32 Mhz low power ARM device doesn't boot with the same speed as a 3 Ghz I9 core ;)

@swixo said in Unbound seems to be restarting frequently:

Agree it not very desirable for unbound to restart and loose its cache on every dhcp, etc.

While waiting :

and add DHCP MAC leases for the few (or all) all your LAN devices.
Done.

Patch

@swixo said in Unbound seems to be restarting frequently:

Still hoping for a fix here.

I know nothing about how it is implemented internally however if it was possible, rather than restarting unbound, it would be nice if a second instance of unbound was started and initialised. After which the old could be killed and the new connected to the interfaces.

Probably not practical however if it was it may reduce system down time. The equivalent of running up a live spare.

stompro

@swixo said in Unbound seems to be restarting frequently:

Who do I have to pay to get this fixed properly?

It seems like there is a fix, dhcpleases needs to be fixed to not restart unbound. There is code at https://github.com/pfsense/FreeBSD-ports/pull/751 but it needs more work. You may want to post a bounty there and see if anyone would be willing to work on polishing it up so it can be accepted.

thearamadon

@stompro what about it needs to be "polished"? Just the fact that there are merge conflicts?

luckman212

Nothing to add right now, other than: count me in as someone who hopes this gets addressed. The closest we've come appears to still be this draft PR from 2+ years ago.

I personally don't use the "register DHCP leases" option but most customers expect stuff like "a device named LAPTOP_3f7ea4 connects to the network, then try to connect to smb://LAPTOP_3f7ea4 should work"...