Serve Expired - Clearification :)
I have some questions about the feature "Serve Expired". It might be basic DNS knowledge though. I have tried googling about it without finding much about how its handled exactly..
After 2 days my DNS statistics looks like this:
unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total.num total.num.queries=131025 total.num.queries_ip_ratelimited=0 total.num.cachehits=127505 total.num.cachemiss=3520 total.num.prefetch=15811 total.num.zero_ttl=15027 total.num.recursivereplies=3520
With Cache count:
unbound-control -c /var/unbound/unbound.conf stats_noreset |grep cache.count msg.cache.count=4722 rrset.cache.count=10998 infra.cache.count=3620 key.cache.count=680
So what i could gather from this is that more than 10% of the queries ends up using a DNS entry wich has TTL=0. I seem to have very little cachemiss hits.. Only 2,7%
So my questions is:
How does Serve Expire work? Will the records stay in the DNS cache with TTL=0 forever until it gets a hit again? Or will the TTL=0 entries be purged by some setting eventually?
This is what i have found in the documentation from Unbound regarding the different statistic topics:
number of queries received by thread
number of queries that were successfully answered using a cache
number of queries that needed recursive processing
number of cache prefetches performed. This number is included
in cachehits, as the original query had the unprefetched answer
from cache, and resulted in recursive processing, taking a slot
in the requestlist. Not part of the recursivereplies (or the
histogram thereof) or cachemiss, as a cache response was sent.
number of replies with ttl zero, because they served an expired
The number of replies sent to queries that needed recursive pro-
cessing. Could be smaller than threadX.num.cachemiss if due to
timeouts no replies were sent for some queries.
The number of items (DNS replies) in the message cache.
The number of RRsets in the rrset cache. This includes rrsets
used by the messages in the message cache, but also delegation
The number of items in the infra cache. These are IP addresses
with their timing and protocol support information.
The number of items in the key cache. These are DNSSEC keys,
one item per delegation point, and their validation status.
Been monitoring this since yesterday and i cannot see that the cache.count is declining at all. So it seems all the TTL=0 records stays in the cache?
[2.4.4-RELEASE][admin@Fenix.localdomain]/root: unbound-control -c /var/unbound/unbound.conf stats_noreset | egrep 'total.num|cache.count'
15/4-2019 10:30 total.num.queries=138229 total.num.queries_ip_ratelimited=0 total.num.cachehits=134153 total.num.cachemiss=4076 total.num.prefetch=17233 total.num.zero_ttl=16396 total.num.recursivereplies=4076 msg.cache.count=5893 rrset.cache.count=13071 infra.cache.count=4319 key.cache.count=884 15/4-2019 23:11 total.num.queries=178540 total.num.queries_ip_ratelimited=0 total.num.cachehits=173816 total.num.cachemiss=4724 total.num.prefetch=23519 total.num.zero_ttl=22422 total.num.recursivereplies=4724 msg.cache.count=6518 rrset.cache.count=13949 infra.cache.count=4848 key.cache.count=957 16/4-2019 08:11 total.num.queries=203688 total.num.queries_ip_ratelimited=0 total.num.cachehits=198712 total.num.cachemiss=4976 total.num.prefetch=25949 total.num.zero_ttl=24683 total.num.recursivereplies=4976 msg.cache.count=6774 rrset.cache.count=14133 infra.cache.count=5119 key.cache.count=961
I found some more configuration entries for serve-expired.. So this parameters explains it all. The TTL 0 entries will stay in cache if these entries are not used. That is what i was looking for.. :) Case closed! :)
serve-expired-ttl: <seconds> Limit serving of expired responses to configured seconds after expiration. 0 disables the limit. This option only applies when serve-expired is enabled. The default is 0. serve-expired-ttl-reset: <yes or no> Set the TTL of expired records to the serve-expired-ttl value after a failed attempt to retrieve the record from upstream. This makes sure that the expired records will be served as long as there are queries for it. Default is "no".
I am the source of the feature been added to pfsense.
The reaosn it was added is in the modern itnernet many mainstream services use DNS to route their traffic, and because of things like maintenance, DDOS attacks and so forth, they use extremely low TTL values, so they can reroute very quickly if required.
TTL values of 30 seconds or less is now fairly common.
As you can imagine, having to do a new DNS lookup so often has a performance hit.
The issue with the prefetch feature is it only works if you do a DNS lookup when less than 10% of the TTL is left, so basically with a 30 secs TTL, if you dont do another lookup within the last 3 seconds of the TTL, then prefetch isnt providing you any benefit. Its operating scope is too narrow.
So unbound implemented serve expired, what it does is when a record is expired, it will stay in the cache with the TTL value as 0, if another lookup comes in from the LAN (or to whatever networks your unbound is serving), then it will be served as a cached record for performance. However at the same time a new lookup is initiated from unbound to the authoritative server, so when there is a newer lookup later, it will server a newer record.
So its important to note the same expired record isnt served forever, its only served once, then a new one is fetched.
Newer versions of unbound allow this to be tweaked further and the good news is in the latest stable build of pfsense, we have the newer version (it was updated for security), I am considering getting another commit done to take advantage of it, as there is now an option as well that if e.g. you are uncomfortable perhaps using a cached record that might have been sitting there for a day you can set an effective expiry on the cached record itself using the more granular controls now available, I will see if i can get that field added to the UI as well.
Since I cannot edit (I cannot fix the typos sorry).
But also to clarify, there is a reason this is off by default as you can imagine it is down to the admin if they are ok with records been served from a cache after they expired upstream :)
The description in pfsense I tried to make as understanding as possible whilst as short as possible so it wasnt bloating the interface.