DNS Resolver not caching correct?

johnpoz

Not if the cache has been wiped because unbound has restarted.

Users that have issues with unbound, is normally related to it being restarted all the time - register dhcp reservations, and they have lots of them happening all the time. Or say pfblocker restarting it, etc.

Every time unbound restarts the cache is flushed.

Gertjan

And there your have it :
@johnpoz said in DNS Resolver not caching correct?:

maybe you restarted unbound!! Say if you have it registering dhcp, every time new dhcp lease it will restart unbound and you will lose your cache.
A restart of unbound will clear the cache.

Not a real issue, but very few people are aware of this.

Running pfSense with default settings, depending on the number of clients on LAN, will increase DHCP requests.
When registering a new lease, the DHCP daemon will kick (restart) the DNS cache = unbound. Which implies : cache lost.

Solution : do nothing with your devices.
Declare static DHCP entries as much as possible on your LAN(s) using their MAC address.
When done, remove this check :

edit : true, other processes can also restart unbound. pfBlocker might be one of them.
The DNS logs will tell you how often it restarts.

johnpoz

do this command
unbound-control -c /var/unbound/unbound.conf stats_noreset | grep rrset.cache

What do you show for records in your cache... Then restart unbound and do it again

before restart
rrset.cache.count=8549

Then restarted and looked a few moments later
[2.4.4-RELEASE][admin@sg4860.local.lan]/: unbound-control -c /var/unbound/unbound.conf stats_noreset | grep rrset.cache
rrset.cache.count=736

It has already started buiding up cache because things on network are always doing queries - but on a restart of unbound the cache is flushed.

You can also look at the infra cache value
infra.cache.count=5194

That was before, and then after restart
infra.cache.count=621

So yes if unbound is restarting often, you loose the cache.. And then stuff has to be resolved again.

KOM

@johnpoz Thanks, John. I didn't know that unbound could override the default TTL from an authoritative server.

johnpoz

I wouldn't suggest normal people do it ;) But if your fully aware of how dns actually works - and your sick of seeing queries for shit because they have a ridiculous low ttl.. 60 freaking seconds - give me a gawd damn break! ;) And I wouldn't suggest you set it too high.. An hour to me seems like a good min ttl for those that are under.. 60 seconds, 5 minutes, etc. Those are just too freaking low unless your about to switch to different IP, etc.

I think the amazon dns defaults to that on purpose to be honest, because they want more queries because they charge you per query ;) hehehe

I get it cdn stuff can move around - but put the shit behind some load balancers for freak sake vs such low ttls..

Altering the min ttl, can lead to issues - so you should be well aware of what your doing before you change it.. And know how to troubleshoot if that is what might be causing you some issue you run into.

Not going to help if your cache keeps getting cleared ;) which I would guess is the OP problem

edit: The other thing that ticks me a off is the lack of local cache for most of these iot devices.. Since they have no local cache every time they want to look up something, which can be every freaking minute - they have to do a query for it, so for local queries doesn't matter how high the ttl is. They normally only looking for a couple of things, so the local cache doesn't have to be large - shit cache say 10 records or something, running a min local cache can be done with almost zero resources.. etc.

Synology nas is like this, so I installed the dns package - just so I could point it to itself for dns (127.0.0.1) so its not doing a query every time it needs to look up something. It now has its own local cache. And only forwards to pfsense when something is not in its local cache.

Same thing goes for the usg from unifi.. It runs no local cache.. stupid ass shit!

mrsunfire

@johnpoz said in DNS Resolver not caching correct?:

unbound-control -c /var/unbound/unbound.conf stats_noreset | grep rrset.cache

I don't have the DHCP registration option set because I use static DHCP entries. My unbound doesn't restart. Well today it does because I changed something in the settings.

unbound-control -c /var/unbound/unbound.conf stats_noreset | grep rrset.cache

shows me

rrset.cache.count=3875

If I use a client to connect to twitter.com and shortly after do a dig twitter.com it also shows me 15 ms or more. Shouldn't be there 0ms because the client already asked for that query? Before that I cleared my DNS cache on that client.

I think I can enable the option "Server Expired" or is this a problem?

How can I see all entries that are cached?

johnpoz

And maybe its 15ms because that is how long it took to query it from cache.. That seems like a really fast response all the way from roots.. Once unbound has looked up say host.domain.tld, and then it looks for otherhost.domaint.tld it will already have the authoritative ns cached, and only has to ask them directly and not walk down from roots.

You need to validate via the run time query I did above to see how long unbound has been running, there are other things that can reload it.. For example pfblocker.

you can lookup specifics, or dump the whole cache if you would.

   dump_cache
          The contents of the cache is printed in a text format to stdout.
          You can redirect it to a file to store the cache in a file.

Use the lookup command and it will tell you what is cached for that and what it would use to lookup something.

Or you can just grep in the full cache for some specific record.

[2.4.4-RELEASE][admin@sg4860.local.lan]/: unbound-control -c /var/unbound/unbound.conf dump_cache | grep www.google.com
www.google.com. 1275    IN      A       172.217.1.36
msg www.google.com. IN A 32896 1 1275 3 1 0 0
www.google.com. IN A 0
[2.4.4-RELEASE][admin@sg4860.local.lan]/:

So in there you can see what the TTL is in the cache, and that it has a 0 set so it will respond even if the other cache entry .

unbound will return from cache, unless that entry has been flushed, or the whole cache has been flushed. 15 ms sure seems pretty quick for a full resolve from roots. So either it only talked to the authoritative server it already had cached, or it served it up from cache and it was a bit slow doing that.

Do looking up specific entries per the above command example will show you if the record is in cache, and what is left on the ttl, etc.

mrsunfire

@johnpoz said in DNS Resolver not caching correct?:

unbound-control -c /var/unbound/unbound.conf dump_cache | grep www.google.com

If it's cached, it's always 0 ms. I think PCI-E SSD and Core i7 should be fast enough :)

I will test around a bit and see if its better now with the Serve expired setting.

www.google.com.	243	IN	A	172.217.21.196
www.google.com.	243	IN	AAAA	2a00:1450:4001:808::2004
msg www.google.com. IN A 32896 1 243 3 1 0 0
www.google.com. IN A 0
msg www.google.com. IN AAAA 32896 1 243 3 1 0 0
www.google.com. IN AAAA 0

johnpoz

I worded that a bit wrong, I meant that I have reply with 0 ttl set, and still shows in the cache, etc. with the ttl counting down. Bad wording on my part.

Example of the ttl counting down

www.cnn.com.    3595    IN      CNAME   turner-tls.map.fastly.net.
msg www.cnn.com. IN A 32896 1 3595 3 2 1 0
www.cnn.com. IN CNAME 0
[2.4.4-RELEASE][admin@sg4860.local.lan]/: unbound-control -c /var/unbound/unbound.conf dump_cache | grep www.cnn.com
www.cnn.com.    3496    IN      CNAME   turner-tls.map.fastly.net.
msg www.cnn.com. IN A 32896 1 3496 3 2 1 0
www.cnn.com. IN CNAME 0
[2.4.4-RELEASE][admin@sg4860.local.lan]/:

What I would do is say query for something with a short ttl.. Say 60 seconds or something... Now just keep doing that query every couple seconds.. Do you get fast response? you see the ttl counting down.. You should see responses in 1 or 2 ms..

mrsunfire

After some hours I came back home and test again some names and now they show me all 0ms. I think the option Serve expering option solved my problem. Even twitter.com now resolves with 0 ms after 3 hours.

KOM

Maybe I missed something in the dozen+ posts on this topic, but why does it matter? 0ms vs 12ms is barely noticeable and it only applies to lookups.

johnpoz

0 vs 12 is not an issues.. But serving up say something via cache in 0ms or 12ms from cache can make make a difference vs say 500ms having to resolve it..

Much of it more of a tech thing vs hey I can notice its slower thing as well ;) Even if going to site xyz took 500ms to resolve its unlikely someone could actually notice the page loading slower if its was .5 seconds slower..

It can be hey I query this from cmd line why does it take 500ms when it should be cache local and be 1ms..

In the big picture I think resolving is the better solution, as long as your cache is working as it should - users are never going to notice anything. And you are now getting the info from the horses mouth so to speak.. And in the long run you can end up doing less queries since your always going to get the full ttl from the authoritative ns vs something that was cached, and you only got a partial ttl and had to do another query later, to only get again a less than full ttl. So while your query might be a few ms shorter, your going to end up doing more queries in the long run..

To actually make a decision you would have to do some real analysis on on your overall types of queries and amount of queries and the ttls you are getting back from if you forward, vs resolving, etc. But normally resolving is going to be the better option. But there are always going to be one offs.. Most users don't understand how it all works, and it comes down to I ask google for host.domain.tld and get an answer in X ms, vs I resolve it and get it Y ms.. where X<Y the gut reaction is forwarding is better.. When in the big picture its prob not.

KOM

Got it.

johnpoz

I could talk about this stuff for hours and hours and hours ;) Its a bit of a hobby/passion with me - my dream job would be just dealing with dns all day.. Vs now only now and then ;) I had a cool project a while back trying to host over 3000 some domains for a major player, etc. Trying to explain to them how its not worth it to try and do such a thing on your own - and how its not cost effective for the bandwidth required and the equipment required and how you can not do it from only 2 locations and provide actually good service - that it needs to be global, etc..

It was a fun project even though it came to nothing in the long run and they hosted it elsewhere - and prob cost my company money.. Not a business we wanted to get it hosting dns, when there are majors with global anycast networks that just better to host with them, etc.

I will say this, I would never go back to forwarding my queries anywhere... I will run a resolver on my own thank you very much.. It gives me the control and the info to do what I want, how I want to do it vs just sending all my queries to X and trusting their responses.. But that could just be me, others are very happy just asking x.x.x.x for host.domain.tld and being happy with what they get back.. That is not what I want - and I would think most people that have taken the step to moving to pfsense vs your off the shelf soho router like that ability as well.

Then can run a resolver, they can forward, they can run a full blown bind with a nice gui if they want, etc. This is one of the best things about pfsense - gives you options!!! And the ability to use such options without having to dive into the nitty gritty of conf files..

Sorry for the rant - but I love this topic, and I am like 6 beers in already.. Stopped for a few after work with a buddy ;)

mrsunfire

@KOM said in DNS Resolver not caching correct?:

Maybe I missed something in the dozen+ posts on this topic, but why does it matter? 0ms vs 12ms is barely noticeable and it only applies to lookups.

Because more than 0ms shows that its fordwarding too root servers and not resolving from cache. Thats the reason I use unbound.

@johnpoz
I can follow you. I also don‘t want any other resolving my names. I want to make the most I can my self. Thats why I‘m running a home server and pfSense.

johnpoz

I doubt 12 is from roots, from the authoritative ns ok.. But if your walking all the way down from roots in 12 ms.. Gawd damn that would be freaking quick ;)

Keep in mind that once you have looked up NS for say .com those are cached and do not have to ask "." again.. Just need to ask them for ns of domain.com.

And once the ns are cached for domain.com, I don't have to talk to them again.. just the ns for domain.com asking for host.domain.com

So if the specific record has ttl expired, or has never been looked up before - just have to directly talk to ns for domain.com and ask for host.domain.com

My guess on 12 ms vs 1-2ms response would either be slowing responding cache? Or just had to talk to a close authoritative ns for domain.com.. Maybe unbound was busy is why it took 12 ms vs typical 1 or 2ms? Maybe the ttl on this record is a stupid 60 seconds or something.

mrsunfire

If its from cache its always 0ms. I sniffed the traffic to check that.

johnpoz

If your local to the cache ok, but your not always going to see 0 ms if your client on the network.. Even a local lan introduces some delay ;) Or some small delay with cache answering

;; ANSWER SECTION:
www.google.com. 3346 IN A 172.217.1.36

;; Query time: 0 msec
;; SERVER: 192.168.9.253#53(192.168.9.253)
;; WHEN: Fri Aug 30 04:32:20 Central Daylight Time 2019

Next query
;; ANSWER SECTION:
www.google.com. 3344 IN A 172.217.1.36

;; Query time: 1 msec
;; SERVER: 192.168.9.253#53(192.168.9.253)
;; WHEN: Fri Aug 30 04:32:22 Central Daylight Time 2019

My point is it is possible to see a delay in the response time, even from when cache.

It could be possible, even if your local to the cache - to see a delay if machine is busy, or unbound is busy, etc. etc. Just because you see some small amount of delay does not mean it wasn't served from cache.

If you get back anything other than the full ttl - it was served from cache.

If your doing query over wireless - that could also introduce delay.. Or if your path to the dns is routed/firewalled locally, etc. A better indication of served from cache or resolved would be the ttl you get back

When your seeing this 12ms response - what was the ttl returned?

mrsunfire

I never saw more than 30% cpu usage and never more than 0ms. How can I check that better?

Where do I see the ttl? I will check that again.

johnpoz

when you do a dig, you will see the ttl

;; ANSWER SECTION:
www.google.com. 1038 IN A 172.217.1.36

See the 1038, that is the TTL returned, clearly that is not the full TTL of that record.. Nobody would set such an ODD ttl ;)

So it was clearly returned from cache. If you see a whole number, 60, 300, 1800, 3600, 86400 for example than that was resolved and you received the full ttl from the authoritative ns. You can always check what the full ttl is by doing a query direct to one of the authoritative NS for that domain.

Mind you, I have a min ttl set of 3600 on my unbound... So if ttl from authoritative ns is less than 3600, unbound will use 3600.. But it will then count down from that, so if I see 3600 returned as the ttl - pretty sure it was resolved, vs from cache.. Unless on the off chance you did the query at exactly when the ttl had counted down to that value ;) So while you might see a whole number - it still could of been from cache - you just got amazing lucky and queried exactly when say the ttl had counted down t 1800 ;)

So if your delay is something other than a couple of ms, and you have a nice whole number ttl - you can be pretty sure it was resolved, and not returned.. Even if you see say 12 ms, but the ttl was like 1432 or something - you would assume that was returned to you from cache - and something else caused the delay.

edit:
Another stat you might be interested in is the cache hit numbers..

[2.4.4-RELEASE][admin@sg4860.local.lan]/root: unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total.num
total.num.queries=14557
total.num.queries_ip_ratelimited=0
total.num.cachehits=12593
total.num.cachemiss=1964
total.num.prefetch=2263
total.num.zero_ttl=2318
total.num.recursivereplies=1964

So you can see the total numbers of queries that unbound has gotten since its last restart.. And the total number of hits for the cache.. And how many misses, how many prefetches done, etc. how many returns from 0 ttl (since I have that set) etc.. If your not seeing a large % of cache hits.. then yeah your doing more resolving than returning from cache.. I am pretty happy with 86% cache hit ratio.

Means 86% of the time when a client asked for something - it got returned from cache vs having to resolve it.

edit: People seem to miss the whole point of the cache.. To the local client if you record is returned from cache its going to be couple of ms to lookup whatever.domain.tld, so what does it matter if resolving takes 100ms and just asking google takes 30ms.. Once its cache, your client will be seeing 1ms..

In the big picture resolving can be faster and better because while you have to ask googledns all the time for something that is not in cache, and that might be 30ms (if they have it cached).. Your resolve might only take 15ms to ask the authoritative ns for the record.. All depends on where the authoritative ns is in relation to you, etc. And since your always going to get back the full TTL, you could need to do actual less queries than always asking googledns..

The only time forwarding gains you anything is if they already have it cached.. If your asking for something that is not.. Then it has to be resolved, and you just added the query time to googledns, and then waiting for them to resolve it on top of the time of your latency to them, etc. So what you save a handful of ms here and there? Nobody is going to notice the difference between getting an answer in 30ms vs 200 ;) and that only every comes into play if not already cached anyway.. So 1 of your clients might have to wait couple extra ms for something to be resolved, everyone else on your network will get the cached copy. And if your doing prefetch - the common domains will be kept active with nobody ever seeing the few ms delay to actually resolve it.

If you have the ability to run your own resolver - its just always a better option if you ask me.

here.. I resolved this locally in 139 ms

; <<>> DiG 9.14.4 <<>> www.whatever.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15212
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.whatever.com.              IN      A

;; ANSWER SECTION:
www.whatever.com.       14400   IN      CNAME   whatever.com.
whatever.com.           14400   IN      A       198.57.151.250

;; Query time: 139 msec
;; SERVER: 192.168.3.10#53(192.168.3.10)
;; WHEN: Fri Aug 30 05:49:31 Central Daylight Time 2019
;; MSG SIZE  rcvd: 75

I asked googledns for it - and took 99ms

; <<>> DiG 9.14.4 <<>> @8.8.8.8 www.whatever.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49654
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;www.whatever.com.              IN      A

;; ANSWER SECTION:
www.whatever.com.       14399   IN      CNAME   whatever.com.
whatever.com.           14399   IN      A       198.57.151.250

;; Query time: 99 msec
;; SERVER: 8.8.8.8#53(8.8.8.8)
;; WHEN: Fri Aug 30 05:50:07 Central Daylight Time 2019
;; MSG SIZE  rcvd: 75

So you think a client could ever notice 40 whole ms?? .04 of second ;)

And that is only the first client to ask for it, after that its just served from cache.