Help in understanding Unbound's host cache limit

chickendog

I tried my best Google this but haven't been able to get a clear answer.

I've been watching the msg.cache.count metric via this command:

unbound-control -c /var/unbound/unbound.conf stats_noreset | egrep 'total.num|cache.count'

I have host cache num-hosts set to the default of 10000 but the msg.cache.count is exceeding this number. Should Unbound not be evicting records once 10k is reached?

The results of the command is:

total.num.queries=200478
total.num.queries_ip_ratelimited=0
total.num.queries_cookie_valid=0
total.num.queries_cookie_client=0
total.num.queries_cookie_invalid=0
total.num.cachehits=181155
total.num.cachemiss=19323
total.num.prefetch=82558
total.num.queries_timed_out=0
total.num.expired=82558
total.num.recursivereplies=18665
total.num.dnscrypt.crypted=0
total.num.dnscrypt.cert=0
total.num.dnscrypt.cleartext=0
total.num.dnscrypt.malformed=0
msg.cache.count=13625
rrset.cache.count=9984
infra.cache.count=2
key.cache.count=0
dnscrypt_shared_secret.cache.count=0
dnscrypt_nonce.cache.count=0

I have recently turned on serve-expired and set serve-expired-ttl: 86400 but I'm not sure that matters in this case? As the cache limit should still be in effect?

johnpoz

@chickendog said in Help in understanding Unbound's host cache limit:

rrset.cache.count=9984

Pretty sure that his your host cache count.. Msg count would be more validation results and rcodes, etc.. Ie the headers so to speak from a query..

Take your forwarding and not resolving.. because your infra count is super low

As to serve zero counting against your host cache - hmmm, never looked into that.. Just an off the cuff guess, since even if the ttl is zero it would still be in the cache.. So I would think it would be purged as you hit your limit.. You could set your host cache to something super low and do some testing.. but 10k hosts cached is a lot of records ;)

If your concerned bump it up.

chickendog

@johnpoz I see thanks for enlightening me.

Yep I am forwarding.

10k is indeed a lot but that's just over one day in my environment with serving expired turned on. I ended up restarting unbound for something else so I will monitor the rrset.cache.count and see what it does.

I think I'll leave it at 10k for now. I want to have a fairly up-to-date cache where records are both fresh for frequently used domains but stale records don't stay there for more than a day. Serve-expired in tandem with serve-expired-ttl seems to do want I want.

Prefetch doesn't work that well because you need to hit the record within 10% of the TTL. For low TTL records this isn't that good because while you may have devices requesting a given record frequently for some time. If that domain is no longer used for say 5 mins if the TTL is 5min then it get's expired and removed from the cache. Many domains are low TTL so this happens a lot.

Take a scenario where a device is doing X on a website at 9am, then stops, comes back at 11am. All records are expired and gone so new ones must be acquired.

I know what you're thinking, other thing I could do is set minimum TTL to higher, say one day, BUT the negative there is that assuming these domain owners have set the TTL low for some reason my record will be even more stale than with serve-expired where the record is served and then refreshed then and there. So it's kind of best of both worlds.

Anyway I have gone off on a tangent and yes I am probably over-engineering it but these settings are there for a reason ;)

Thanks again @johnpoz

johnpoz

@chickendog said in Help in understanding Unbound's host cache limit:

need to hit the record within 10% of the TTL.

I don't think that is the way it works, its worded a bit funny - where they say it could increase your dns traffic by 10%

total.num.prefetch=7349

is what I show for number times unbound has prefetched..

Not a fan of very low ttls - its pointless.. Unless you were about to change your record to point to a new IP, have ttls in 30 and 60 second range does nothing but increase dns traffic - for what point other then them wanting to figure out how long maybe you been on some site?

I set my min ttl to 3600 seconds, yeah its not good practice to mess with ttls set by the owners - but its stupid to have be do a query every 30 freaking seconds for something.. Defaults the whole point of a "cache" if you ask me.

I serve zero, prefetch and min ttl 3600.. And I have yet to run into any sort of issue that I am aware of or have noticed anything odd, etc.

chickendog

@johnpoz I read that is the way it works on the Unbound docs here: https://unbound.docs.nlnetlabs.nl/en/latest/topics/core/serve-stale.html#serve-expired

total.num.prefetch is having it's count incremented just from having serve-expired turned on. I don't have prefetch turned on anymore. It makes sense though, it technically is a prefetch that it does after it serves the record. You can check the code to validate this, the above docs also describe this.

And yeah I have minimum TTL set to 5min from the upstream DNS provider NextDNS - so I catch that silly situation as well.

Gertjan

@chickendog

From what I make of it, when

is activated, record won't expire anymore, and refreshed when needed.
So "serve-expired" becomes irrelevant.
Your local unbound dns cache slowly fills up with the DNS names you most often use, no more waiting for DNS.

chickendog

@Gertjan No that's not correct. You need to read the Unbound docs, see my link above.
Or better yet sift through the code.

Gertjan

@chickendog

What is incorrect ?
Prefetching ?

The description seems fine to me :

When records already present in the cache, are refreshed 90 % of their TTL - so not expired yet - and updated within a second, these records can't expire anymore (except if the TTL was less then 10 seconds ^^)

My own measurements (munin script running unbound-control to question the unbound stats) shows me that nearly all requests are handled with data available in the cache.

chickendog

@Gertjan Not correct as in serve-expired is not irrelevant.
Your case might work ok for you, as it depends on how many clients you have and what domains they are requesting.
But if a domain is not requested within 10% of the TTL then it will not be prefetched.

If you don't believe me you can check the code, or ask the dev
https://github.com/search?q=repo%3ANLnetLabs%2Funbound%20prefetch&type=code

So a scenario where a record is fetched, not reused within it's TTL, and then expired - (thereby removed from the cache) is required to be fetched again even with prefetch enabled.
Say you are only using prefetch....with 5 min TTLs or less (or even 30 min TTLs) or less being the norm today, you can have scenarios where peak periods of your network are serviced well by the cache. But if there is another peak period later in the day, the cache has to get almost rebuilt.

With serve-expired you can keep these records in the cache from one peak period to the next. Then use serve-expired-ttl to optimise how long the records are kept. For me 1 day is good so that the peak periods throughout the day are served with an already healthy cache before the device requests it.

Hope that makes sense.