DNS Reply Resolver vs Reply and cache

johnpoz

@jrey said in DNS Reply Resolver vs Reply and cache:

implies the "resolver" (localhost) doesn't cache results for itself -- correct ?

huh? Yes unbound caches.. Not really sure what your confused about..

Yes a local host also runs its own local cache.. And will cache what it got from unbound for the length of the ttl.

So lets say client asks unbound for www.domain.tld, unbound resolves this from the authoritative NS.. Lets say the ttl on this record from the authoritative ns is 3600 seconds.

Client should get the IP with 3600 second ttl..

Now client 2 comes and asks unbound for the same fqdn www.domain.tld say 60 seconds latter, this client would get the response from unbound with a ttl of 3540 seconds..

client 3 comes in say 3000 seconds after unbound first looked it up, that 3rd client would get an answer with 600 seconds left on the ttl.

Now any of the clients as long as they want to look up this www.domain.tld again will just pull from their local cache until that ttl has expired.. They would then need to go ask unbound again.

Now some devices, especially say iot devices, don't have their own local cache, and any time the device wants to talk to www.domain.tld they would ask unbound.. But windows and linux, and pretty much any actual OS will run its own local cache.. Browsers will also normally run their own cache as well.. For example if you want to view what is in the firefox own cache.. Which will its own cache different then the local machines own cache.

about:networking#dns

unbound be it in resolver mode, or forwarder mode will cache what it looks up, and will answer from cache with whatever is left on the ttl for any client asking it for that..

If you want to see what is in the cache of unbound..

[24.03-RELEASE][admin@sg4860.home.arpa]/: unbound-control -c /var/unbound/unbound.conf dump_cache

you can also see what unbound would use to lookup a specific url and what it has cached for it.

[24.03-RELEASE][admin@sg4860.home.arpa]/: unbound-control -c /var/unbound/unbound.conf lookup forum.netgate.com
The following name servers are used for lookup of forum.netgate.com.
;rrset 584 3 0 7 0
netgate.com.    584     IN      NS      ns3.netgate.com.
netgate.com.    584     IN      NS      ns2.netgate.com.
netgate.com.    584     IN      NS      ns1.netgate.com.
;rrset 584 1 0 3 0
ns1.netgate.com.        584     IN      A       208.123.73.80
;rrset 584 1 0 3 0
ns2.netgate.com.        584     IN      A       208.123.73.90
;rrset 584 1 0 3 0
ns3.netgate.com.        584     IN      A       34.197.184.5
Delegation with 3 names, of which 3 can be examined to query further addresses.
It provides 3 IP addresses.
34.197.184.5            not in infra cache.
208.123.73.90           not in infra cache.
208.123.73.80           not in infra cache.
[24.03-RELEASE][admin@sg4860.home.arpa]/:

If you want to see what unbound has in its cache for a specific record

[24.03-RELEASE][admin@sg4860.home.arpa]/: unbound-control -c /var/unbound/unbound.conf dump_cache | grep forum.netgate.com
forum.netgate.com.      484     IN      A       208.123.73.71
msg forum.netgate.com. IN A 32896 1 484 0 1 1 3 -1 
forum.netgate.com. IN A 0
[24.03-RELEASE][admin@sg4860.home.arpa]/:

If you want to see for example what windows has in its local cache

ipconfig /displaydns

Gertjan

@jrey said in DNS Reply Resolver vs Reply and cache:

in the response what is the subtle difference between
resolver vs. reply

The unbound receives a request.
If it's a host name and unbound knows about, the answer is given right away.
For examples : host overrides or static DHCP leases.
If not, the cache is checked, and if its found and TTL is still valid => bingo.
This is what happens normally the most of the time :

Grey = hit.
Some other color : resolving took place, so this took some time.

Btw, There is always more to it.
Like this one :

The first and fourth are obvious ... and make caching even faster as expired stuff gets renewed 'if ever needed again' so the cache will grow ....

jrey

@johnpoz

what is in the local cache for both cases is
unbound-control -c /var/unbound/unbound.conf lookup sample_in_question
"the following name servers are used for lookup of .... and it list the upstreams

This is the same for both cases. so is the dig

however,

Starting with no record upstream

on localhost query, cache record is created on the upstream
delete the upstream cache
query again the subsequent response, cache record is created on the upstream

when the query is made from the localhost, does not return the "cache" and always creates the record in the upstream if I delete the cache there in between the queries --- ie it is going to the upstream every single time the request is made from localhost. it look like this..

Screen Shot 2024-08-29 at 12.53.39 PM.png
(for clarity this made 2 trips upstream)

Now the Client Test

again starting with no record upstream

take exactly the same query and run it on a client,

netgate returns "reply" for the first one
creates upstream cache (ie it went out to get it)
Delete the upstream cache
query again response is "cache" for subsequent queries with nothing being touched upstream (that is no record added) (as the reply indicates this is coming from the netgate "cache" or it would have created the record upstream if it had reached out ?)

after the ttl expires then the next query from a client will create the cache upstream again "reply" followed, by "cache" for subsequent client queries. again until is expires

Screen Shot 2024-08-29 at 1.16.01 PM.png
(this made 1 upstream trip and 2 cache)

because the localhost query will recreate the upstream cache record every time
and and the client query will create it the first time, but not on subsequent queries

I might conclude the localhost queries are not cached on the Netgate as it is going upstream every time (resolver), whereas client queries are getting cache hits from the netgate (reply/cache) and not going upstream until the local expires. then the next query is reply again, followed by cache etc.

A packet capture does confirm that localhost has traffic upstream with every request and client traffic does not only on the first "reply", subsequent queries create no traffic upstream

johnpoz

@jrey said in DNS Reply Resolver vs Reply and cache:

I might conclude the localhost queries are not cached on the Netgate as it is going upstream every time (resolver)

huh? What are you actually doing a query for, the name localhost? from where?

Where are you getting this info?

When you say localhost doing a query for some fqdn, are you talking about pfsense itself? Are you using the dns lookup gui page? It is always going to ask all the ns you have listed in general..

Sorry but I am having a hard time understanding what your even asking about.. And what your referring to when you say localhost.. Is this pfsense, is this some other device on your network asking unbound?

jrey

@johnpoz

the client is shown in both screen captures is the client ip making the request ..

the query most often is from the netgate so in that case of the 127.0.0.1 (resolver) that is the netgate making the request itself (so typically for example when pfblocker downloads a list of files from the same source, or there is a reverse (PTR) it has to look it up)

the 192.168.0.19 again is the client going through the netgate (my workstation actually)

my testing was done by actually ssh into the netgate and doing a dig me_a_name (some external name, not internal)
the same dig as on my workstation, giving the same results but as (reply/cache)

There is nothing wrong with the reply in either case - just seems that localhost (the netgate) goes upstream for every single query it makes itself and therefore doesn't cache the result

where as the client going through the same resolver - gets the same result but caches it.

its like to the localhost (the netgate) this "the following name servers are used for lookup of" means "go upstream every time",
but when a client hits the same netgate it means "go upstream and cache the result"

ie they should either both go upstream every time, based on the "unbound-control -c /var/unbound/unbound.conf lookup sample_in_question"
and response of
"the following name servers are used for lookup of"

or they should both cache.

Clearly by the packet packet capture

when the netgate does the dig me_a_name it goes upstream every time
dig me_a_name (round trip /resolver)
dig me_a_name (round trip /resolver)

when the client going through the netgate going the same dig me_a_name
dig me_a_name (round trip/reply)
dig me_a_name (cache)
dig me_a_name (cache)

dig me_a_name is always external

when you are on the netgate the query would be from itself -> to itself via the local host address 127.0.0.1 -> follow the path upstream

queries on the netgate go upstream every time (packet capture)

when a client makes a request the only difference would be
192.168.0.x (or 19) in my sample which has the DNS of 192.168.0.1 (the netgate) ->follow the path upstream

this clearly does not - only sending the first query upstream the rest from cache.. (packet capture)

It's not the end of the world just curious - there are no DNS issues as such, everything/everyone is getting the right / same answer .. netgate itself or client on the netgate)

if the packet capture had either (for the same query regardless of that source )

a) traffic for every query upstream that would be ok
or
b) it the traffic where first qurery upstream second query no upstream (cache) that would be ok too

But that is not the case -- we see actually a lot of "a" (all traffic by the netgate to the netgate) , and little of "b" (clients) doing this all the time.

the data is the DNS-Reply records generated by unbound / pfblocker.
in this data the "resolver" records are always the netgate doing the query to unbound
and the "reply/cache" records always the clients

it's the traffic path... to the upstream / cache and why they are different that is the curiosity.

johnpoz

@jrey maybe I am having a bad day, where are you getting cache reply from? I am having a real hard time understanding what your concerned with.. Or why you think its not being returned from cache?

Do you have prefetch enabled? Is will in the background do a refresh for something that ttl is almost expired, etc.

What specific log are you looking at or what info where you getting cache reply from like your screenshot.

If I do a dig on pfsense for something.. that is cached, it sure isn't having to look that up..

[24.03-RELEASE][admin@sg4860.home.arpa]/: dig forum.netgate.com

; <<>> DiG 9.18.20 <<>> forum.netgate.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47027
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;forum.netgate.com.             IN      A

;; ANSWER SECTION:
forum.netgate.com.      3586    IN      A       208.123.73.71

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Thu Aug 29 15:08:38 CDT 2024
;; MSG SIZE  rcvd: 62

[24.03-RELEASE][admin@sg4860.home.arpa]/:

Do you really think unbound went out and talked to the authoritative ns and gave me back an answer it 0 ms? ie less than 1..

Please post the output of your dig command.. for example here is one where its not from cache. I then look it up again, and you can see got an answer in 0ms vs the 264 ms it took the first time where unbound had to resolve

[24.03-RELEASE][admin@sg4860.home.arpa]/: dig www.yahoo.com

; <<>> DiG 9.18.20 <<>> www.yahoo.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40981
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.yahoo.com.                 IN      A

;; ANSWER SECTION:
www.yahoo.com.          3600    IN      CNAME   me-ycpi-cf-www.g06.yahoodns.net.
me-ycpi-cf-www.g06.yahoodns.net. 3600 IN A      69.147.65.251
me-ycpi-cf-www.g06.yahoodns.net. 3600 IN A      69.147.65.252

;; Query time: 264 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Thu Aug 29 15:12:46 CDT 2024
;; MSG SIZE  rcvd: 119

[24.03-RELEASE][admin@sg4860.home.arpa]/: dig www.yahoo.com

; <<>> DiG 9.18.20 <<>> www.yahoo.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.yahoo.com.                 IN      A

;; ANSWER SECTION:
www.yahoo.com.          3592    IN      CNAME   me-ycpi-cf-www.g06.yahoodns.net.
me-ycpi-cf-www.g06.yahoodns.net. 3592 IN A      69.147.65.252
me-ycpi-cf-www.g06.yahoodns.net. 3592 IN A      69.147.65.251

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1) (UDP)
;; WHEN: Thu Aug 29 15:12:54 CDT 2024
;; MSG SIZE  rcvd: 119

[24.03-RELEASE][admin@sg4860.home.arpa]/:

Along with response time, you can also almost always tell when something was served from cache vs having to be resolved because the ttl is some odd number as it has been counting down.. Notice in my first query the ttl is 3600, which is what the authoritative NS has set.

edit: here I started a sniff on my wan for port 53.. did dig for yahoo that is cached, no outside queries, then did one for cnn, and clearly you can see where that was resolved for.

But you don't see anything for yahoo in the sniff.

you can even see the resolve process where it asked gltd server for the ns of cnn.com and then asked that ns and got back cname that it followed. But nothing for yahoo listed, because it didn't need to lookup anything because it served it from cache.

It didn't have to ask roots for gtld servers for .com, because it had those cached as well.

jrey

@johnpoz said in DNS Reply Resolver vs Reply and cache:

Or why you think its not being returned from cache?

Well, because I can see the hit on the upstream DNS that I also control sitting right in front of me (different screen) and a packet capture says it is going there and creating a cache record there if I delete the cache record there between queries) every time I or the netgate itself queries. 127.0.0.1 (itself, which is the default) You actually have to select localhost on the list this.

but this is interesting

dig microsoft.com using ssh (so most packages) look stuff up (pfblocker for example)

(the DNS gui page same thing) defaults to using 127.0.0.1 (as you mentioned)

so everything by default does in fact go there, there is IMHO no cache going on here (round trip traffic upstream with every query) it makes against itself 127.0.0.1

change nothing still ssh'd into the netgate and then query targetting the LAN's IP of the interface (which is still the netgate's unbound) so
dig microsoft.com (at)192.168.0.1
same response from server except reply/cache (1 round trip upstream / 1 cache)

a DNS Lookup from the web page and a dig all default to 127.0.0.1 and all queries have created traffic to the upstream
Screen Shot 2024-08-29 at 4.24.18 PM.png

the same dig from the same ssh session but specify with the server as (at)192.168.0.1 (still the negates resolver)
Screen Shot 2024-08-29 at 4.25.16 PM.png
(1 upstream, 1 cache)

because of the proximity and speed of the upstreams, the response time difference is negligible.
However query against 127.0.0.1 it returns "resolver" does not IMHO cache (only because of the traffic in a packet capture that says it went next door for the answer with every query)
query to 192.168.0.1 returns reply/cache (with only 1 trip next door the reply, and 0 trips next door on the query that logged the cache)

jrey

@johnpoz

I think the subtle difference might be that you allow your netgate to go directly to root server, where as mine are specifically named internal

johnpoz

@jrey said in DNS Reply Resolver vs Reply and cache:

where as mine are specifically named internal

Huh? again where are you seeing this??

And your forwarding in unbound.. What are you doing a query for? microsoft - from where?? This is some client asking unbound on pfsense... Where are you seeing this log that says resolver in it???

Where is the outbound from your dig on pfsense?

This is not an output of dig - that is not any sort of log that I am aware of in pfsense?

is that some pfblocker log your looking at?? Where exactly in the gui of pfsense are you grabbing those screenshots from?

there is no different between asking loopback or the IP unbound is listening on.

first one no cache, second one is cached.

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;microsoft.com.                 IN      A

;; ANSWER SECTION:
microsoft.com.          3600    IN      A       20.76.201.171
microsoft.com.          3600    IN      A       20.70.246.20
microsoft.com.          3600    IN      A       20.236.44.162
microsoft.com.          3600    IN      A       20.112.250.133
microsoft.com.          3600    IN      A       20.231.239.246

;; Query time: 32 msec
;; SERVER: 192.168.9.253#53(192.168.9.253) (UDP)
;; WHEN: Thu Aug 29 17:32:56 CDT 2024
;; MSG SIZE  rcvd: 122

[24.03-RELEASE][admin@sg4860.home.arpa]/: dig @192.168.9.253 microsoft.com

; <<>> DiG 9.18.20 <<>> @192.168.9.253 microsoft.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49554
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;microsoft.com.                 IN      A

;; ANSWER SECTION:
microsoft.com.          3597    IN      A       20.70.246.20
microsoft.com.          3597    IN      A       20.236.44.162
microsoft.com.          3597    IN      A       20.112.250.133
microsoft.com.          3597    IN      A       20.231.239.246
microsoft.com.          3597    IN      A       20.76.201.171

;; Query time: 0 msec
;; SERVER: 192.168.9.253#53(192.168.9.253) (UDP)
;; WHEN: Thu Aug 29 17:32:59 CDT 2024
;; MSG SIZE  rcvd: 122

unbound doesn't care if it resolves or forwards to some other ns.. Once it gets an answer and something else asks it for that same fqdn, it will return its cache entry for that entry.. Its not going to go asking for it again, until such time that cache has expired. Or if you have prefetch set and something asks for a record and there only some amount of time left on the ttl, then it will answer from cache - and then in the background go and refresh its cache. You would have to look to the unbound specifics when it will refresh its cache, etc..

But I am still at a loss to where your seeing what your posting.. I am not aware of any log in pfsense that would show something like what your showing..

if you enabled query and reply logs in unbound in the custom option box

log-queries: yes
log-replies: yes
log-tag-queryreply: yes
log-servfail: yes
log-local-actions: yes

You get stuff like this in the log

Where I did a query on pfsense to its own address 192.168.9.253, and a query for microsoft.com from my pc at 192.168.9.100

you can see on pfsense this was no cache

[24.03-RELEASE][admin@sg4860.home.arpa]/: dig @192.168.9.253 microsoft.com

; <<>> DiG 9.18.20 <<>> @192.168.9.253 microsoft.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21138
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;microsoft.com.                 IN      A

;; ANSWER SECTION:
microsoft.com.          3600    IN      A       20.76.201.171
microsoft.com.          3600    IN      A       20.70.246.20
microsoft.com.          3600    IN      A       20.236.44.162
microsoft.com.          3600    IN      A       20.112.250.133
microsoft.com.          3600    IN      A       20.231.239.246

;; Query time: 23 msec
;; SERVER: 192.168.9.253#53(192.168.9.253) (UDP)
;; WHEN: Thu Aug 29 18:07:14 CDT 2024
;; MSG SIZE  rcvd: 122

And then when did from my pc, it was clearly a cached response

$ dig @192.168.9.253 microsoft.com                                       
                                                                         
; <<>> DiG 9.16.50 <<>> @192.168.9.253 microsoft.com                     
; (1 server found)                                                       
;; global options: +cmd                                                  
;; Got answer:                                                           
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40345                
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1     
                                                                         
;; OPT PSEUDOSECTION:                                                    
; EDNS: version: 0, flags:; udp: 4096                                    
;; QUESTION SECTION:                                                     
;microsoft.com.                 IN      A                                
                                                                         
;; ANSWER SECTION:                                                       
microsoft.com.          3586    IN      A       20.70.246.20             
microsoft.com.          3586    IN      A       20.236.44.162            
microsoft.com.          3586    IN      A       20.112.250.133           
microsoft.com.          3586    IN      A       20.231.239.246           
microsoft.com.          3586    IN      A       20.76.201.171            
                                                                         
;; Query time: 0 msec                                                    
;; SERVER: 192.168.9.253#53(192.168.9.253)                               
;; WHEN: Thu Aug 29 18:07:28 Central Daylight Time 2024                  
;; MSG SIZE  rcvd: 122

Notice the difference in the response time and how the ttl was less then when I did the query on pfsense.

jrey

@johnpoz

Q. where are you seeing this??

I'm actually see that in Graylog but the source of the data is pfSense / pfblocker Unified log, sent in real time. The data is not wrong, I was just trying to clarify the distinction of those named resolver. Sample screen shot of the log file on pfSense

Screen Shot 2024-08-29 at 6.35.57 PM.png

Q. And your forwarding in unbound.
Yes, I'm forwarding in unbound.

Q. What are you doing a query for? microsoft - from where??
A1. for microsoft.com (but it could be and is anything)
A2. in this case from the pfSense box with and without the server specified.

the response showing as resolver is the default 
dig microsoft.com
the response is to and from 127 addresss and does a round trip next door (server in this reply shows as 127.0.0.1)

the second screen capture with the reply/cache is simply
dig microsoft.com (at)192.168.0.1
The responding server is just that 192.168.0.1 which is what clients would hit. in this case only the reply caused a query next door the other came from cache didn't even open the door.

Q. Where are you seeing this log that says resolver in it???
A. see first answer unless you mean something else.

Statement: there is no different between asking loopback or the IP unbound is listening on.
A. clearly there is

unbound-control -c /var/unbound/unbound.conf lookup forum.netgate.com
The following name servers are used for lookup of....

and I said (and maybe not clearly enough) all of mine are

unbound-control -c /var/unbound/unbound.conf lookup sample_in_question
"the following name servers are used for lookup of .... and it list the upstreams

the list of upstreams is different than the responding servers format you are showing.

so then
What I think the definition for the original question Resolver vs. Reply/Cache is:
when you specify a forward to (and maybe only if it is local) the query on the netgate with localhost (or 127.0.0.1) will always resolve by reaching upstream when the query is against itself on that interface and it reports that as "resolver" (perhaps because in the unified log 127.0.0.1 implies simply I'm going to resolved this, not questions asked about or regarding cache)

However when you hit the same unbound on the non-local IP (so 192.168.0.1) -- it says ah here is a query, let me look that up for you, I don't have it go upstream get it, cache the result- next query again = I have that in cache.

the dig structure as shown above is exactly the same except for explicit server on the 192.
no magic query or anything like that. The results are exactly the same except one says it came from 127.0.0.1 and the other from 192.168.0.1 -- queries against 127.0.0.1 in this setup are most assuredly going upstream every time (but I'm not worried about the response time) it's actual not any better or worse than those that return on 192.168.0.1 when it is asked and returns them via cache.

the only reason for the question in the first place was to determine the correct filters on the Graylog dashboard. Not really a question about the DNS query or answer. Just to confirm what the difference was. In talking to you and testing I okay with the answer what I think.

Screen Shot 2024-08-29 at 7.12.17 PM.png

As you can see the pfSense box, on its own does a bunch of queries by and for itself (so stuff running on the box, be that pretty much anything that hit 127.. (itself) of blocker, other stuff anything that runs their queries there.

the queries by and for clients is exactly that 55.1% go upstream and 38.6% are cache.
with the pfsense queries in there the numbers are wacked.

Clearly in this setup, I'm actually ok with the stuff to and from localhost being isolated and in fact talking upstream every time.. There is no issue with the performance and none of that pfsense dns traffic actually counts against a specific client IP anyway.. That is just the various pfsense bits doing their thing looking stuff up as they need to.

Even if my definition is wrong, and therefore based solely on the observation of what goes up and by whom, I'm really ok with the way it is working.

Thanks, even though you may not realize it a couple of things you said gave me some clues of things to look at. I needed that sounding board. So much appreciated.