Serve Expired - Clearification :)

Taz79

Hello!

I have some questions about the feature "Serve Expired". It might be basic DNS knowledge though. I have tried googling about it without finding much about how its handled exactly..

After 2 days my DNS statistics looks like this:

unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total.num
total.num.queries=131025
total.num.queries_ip_ratelimited=0
total.num.cachehits=127505
total.num.cachemiss=3520
total.num.prefetch=15811
total.num.zero_ttl=15027
total.num.recursivereplies=3520

With Cache count:

unbound-control -c /var/unbound/unbound.conf stats_noreset |grep cache.count
msg.cache.count=4722
rrset.cache.count=10998
infra.cache.count=3620
key.cache.count=680

So what i could gather from this is that more than 10% of the queries ends up using a DNS entry wich has TTL=0. I seem to have very little cachemiss hits.. Only 2,7%

So my questions is:
How does Serve Expire work? Will the records stay in the DNS cache with TTL=0 forever until it gets a hit again? Or will the TTL=0 entries be purged by some setting eventually?

This is what i have found in the documentation from Unbound regarding the different statistic topics:

num.queries
number of queries received by thread

num.cachehits
number of queries that were successfully answered using a cache
lookup

num.cachemiss
number of queries that needed recursive processing

num.prefetch
number of cache prefetches performed. This number is included
in cachehits, as the original query had the unprefetched answer
from cache, and resulted in recursive processing, taking a slot
in the requestlist. Not part of the recursivereplies (or the
histogram thereof) or cachemiss, as a cache response was sent.

num.zero_ttl
number of replies with ttl zero, because they served an expired
cache entry.

num.recursivereplies
The number of replies sent to queries that needed recursive pro-
cessing. Could be smaller than threadX.num.cachemiss if due to
timeouts no replies were sent for some queries.

msg.cache.count
The number of items (DNS replies) in the message cache.

rrset.cache.count
The number of RRsets in the rrset cache. This includes rrsets
used by the messages in the message cache, but also delegation
information.

infra.cache.count
The number of items in the infra cache. These are IP addresses
with their timing and protocol support information.

key.cache.count
The number of items in the key cache. These are DNSSEC keys,
one item per delegation point, and their validation status.

Taz79

Been monitoring this since yesterday and i cannot see that the cache.count is declining at all. So it seems all the TTL=0 records stays in the cache?

[2.4.4-RELEASE][admin@Fenix.localdomain]/root: unbound-control -c /var/unbound/unbound.conf stats_noreset | egrep 'total.num|cache.count'

15/4-2019 10:30
total.num.queries=138229
total.num.queries_ip_ratelimited=0
total.num.cachehits=134153
total.num.cachemiss=4076
total.num.prefetch=17233
total.num.zero_ttl=16396
total.num.recursivereplies=4076
msg.cache.count=5893
rrset.cache.count=13071
infra.cache.count=4319
key.cache.count=884

15/4-2019  23:11
total.num.queries=178540
total.num.queries_ip_ratelimited=0
total.num.cachehits=173816
total.num.cachemiss=4724
total.num.prefetch=23519
total.num.zero_ttl=22422
total.num.recursivereplies=4724
msg.cache.count=6518
rrset.cache.count=13949
infra.cache.count=4848
key.cache.count=957

16/4-2019 08:11
total.num.queries=203688
total.num.queries_ip_ratelimited=0
total.num.cachehits=198712
total.num.cachemiss=4976
total.num.prefetch=25949
total.num.zero_ttl=24683
total.num.recursivereplies=4976
msg.cache.count=6774
rrset.cache.count=14133
infra.cache.count=5119
key.cache.count=961

Taz79

I found some more configuration entries for serve-expired.. So this parameters explains it all. The TTL 0 entries will stay in cache if these entries are not used. That is what i was looking for.. :) Case closed! :)

   serve-expired-ttl: <seconds>
          Limit serving of expired responses to configured seconds after
          expiration. 0 disables the limit. This option only applies when
          serve-expired is enabled. The default is 0.

   serve-expired-ttl-reset: <yes or no>
          Set the TTL of expired records to the serve-expired-ttl value
          after a failed attempt to retrieve the record from upstream.
          This makes sure that the expired records will be served as long
          as there are queries for it. Default is "no".

chrcoluk

I am the source of the feature been added to pfsense.

So basically.

The reaosn it was added is in the modern itnernet many mainstream services use DNS to route their traffic, and because of things like maintenance, DDOS attacks and so forth, they use extremely low TTL values, so they can reroute very quickly if required.

TTL values of 30 seconds or less is now fairly common.

As you can imagine, having to do a new DNS lookup so often has a performance hit.

The issue with the prefetch feature is it only works if you do a DNS lookup when less than 10% of the TTL is left, so basically with a 30 secs TTL, if you dont do another lookup within the last 3 seconds of the TTL, then prefetch isnt providing you any benefit. Its operating scope is too narrow.

So unbound implemented serve expired, what it does is when a record is expired, it will stay in the cache with the TTL value as 0, if another lookup comes in from the LAN (or to whatever networks your unbound is serving), then it will be served as a cached record for performance. However at the same time a new lookup is initiated from unbound to the authoritative server, so when there is a newer lookup later, it will server a newer record.

So its important to note the same expired record isnt served forever, its only served once, then a new one is fetched.

Newer versions of unbound allow this to be tweaked further and the good news is in the latest stable build of pfsense, we have the newer version (it was updated for security), I am considering getting another commit done to take advantage of it, as there is now an option as well that if e.g. you are uncomfortable perhaps using a cached record that might have been sitting there for a day you can set an effective expiry on the cached record itself using the more granular controls now available, I will see if i can get that field added to the UI as well.

chrcoluk

Since I cannot edit (I cannot fix the typos sorry).

But also to clarify, there is a reason this is off by default as you can imagine it is down to the admin if they are ok with records been served from a cache after they expired upstream :)

The description in pfsense I tried to make as understanding as possible whilst as short as possible so it wasnt bloating the interface.