Serve Expired - Clearification :)



  • Hello!

    I have some questions about the feature "Serve Expired". It might be basic DNS knowledge though. I have tried googling about it without finding much about how its handled exactly..

    dd4e7b07-be5d-4d18-83f6-ca0fc02279a5-image.png

    After 2 days my DNS statistics looks like this:

    unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total.num
    total.num.queries=131025
    total.num.queries_ip_ratelimited=0
    total.num.cachehits=127505
    total.num.cachemiss=3520
    total.num.prefetch=15811
    total.num.zero_ttl=15027
    total.num.recursivereplies=3520
    

    With Cache count:

    unbound-control -c /var/unbound/unbound.conf stats_noreset |grep cache.count
    msg.cache.count=4722
    rrset.cache.count=10998
    infra.cache.count=3620
    key.cache.count=680
    

    So what i could gather from this is that more than 10% of the queries ends up using a DNS entry wich has TTL=0. I seem to have very little cachemiss hits.. Only 2,7%

    So my questions is:
    How does Serve Expire work? Will the records stay in the DNS cache with TTL=0 forever until it gets a hit again? Or will the TTL=0 entries be purged by some setting eventually?

    This is what i have found in the documentation from Unbound regarding the different statistic topics:

    num.queries
    number of queries received by thread

    num.cachehits
    number of queries that were successfully answered using a cache
    lookup

    num.cachemiss
    number of queries that needed recursive processing

    num.prefetch
    number of cache prefetches performed. This number is included
    in cachehits, as the original query had the unprefetched answer
    from cache, and resulted in recursive processing, taking a slot
    in the requestlist. Not part of the recursivereplies (or the
    histogram thereof) or cachemiss, as a cache response was sent.

    num.zero_ttl
    number of replies with ttl zero, because they served an expired
    cache entry.

    num.recursivereplies
    The number of replies sent to queries that needed recursive pro-
    cessing. Could be smaller than threadX.num.cachemiss if due to
    timeouts no replies were sent for some queries.

    msg.cache.count
    The number of items (DNS replies) in the message cache.

    rrset.cache.count
    The number of RRsets in the rrset cache. This includes rrsets
    used by the messages in the message cache, but also delegation
    information.

    infra.cache.count
    The number of items in the infra cache. These are IP addresses
    with their timing and protocol support information.

    key.cache.count
    The number of items in the key cache. These are DNSSEC keys,
    one item per delegation point, and their validation status.



  • Been monitoring this since yesterday and i cannot see that the cache.count is declining at all. So it seems all the TTL=0 records stays in the cache?

    [2.4.4-RELEASE][admin@Fenix.localdomain]/root: unbound-control -c /var/unbound/unbound.conf stats_noreset | egrep 'total.num|cache.count'

    15/4-2019 10:30
    total.num.queries=138229
    total.num.queries_ip_ratelimited=0
    total.num.cachehits=134153
    total.num.cachemiss=4076
    total.num.prefetch=17233
    total.num.zero_ttl=16396
    total.num.recursivereplies=4076
    msg.cache.count=5893
    rrset.cache.count=13071
    infra.cache.count=4319
    key.cache.count=884
    
    15/4-2019  23:11
    total.num.queries=178540
    total.num.queries_ip_ratelimited=0
    total.num.cachehits=173816
    total.num.cachemiss=4724
    total.num.prefetch=23519
    total.num.zero_ttl=22422
    total.num.recursivereplies=4724
    msg.cache.count=6518
    rrset.cache.count=13949
    infra.cache.count=4848
    key.cache.count=957
    
    16/4-2019 08:11
    total.num.queries=203688
    total.num.queries_ip_ratelimited=0
    total.num.cachehits=198712
    total.num.cachemiss=4976
    total.num.prefetch=25949
    total.num.zero_ttl=24683
    total.num.recursivereplies=4976
    msg.cache.count=6774
    rrset.cache.count=14133
    infra.cache.count=5119
    key.cache.count=961
    


  • I found some more configuration entries for serve-expired.. So this parameters explains it all. The TTL 0 entries will stay in cache if these entries are not used. That is what i was looking for.. :) Case closed! :)

       serve-expired-ttl: <seconds>
              Limit serving of expired responses to configured seconds after
              expiration. 0 disables the limit. This option only applies when
              serve-expired is enabled. The default is 0.
    
       serve-expired-ttl-reset: <yes or no>
              Set the TTL of expired records to the serve-expired-ttl value
              after a failed attempt to retrieve the record from upstream.
              This makes sure that the expired records will be served as long
              as there are queries for it. Default is "no".


  • I am the source of the feature been added to pfsense.

    So basically.

    The reaosn it was added is in the modern itnernet many mainstream services use DNS to route their traffic, and because of things like maintenance, DDOS attacks and so forth, they use extremely low TTL values, so they can reroute very quickly if required.

    TTL values of 30 seconds or less is now fairly common.

    As you can imagine, having to do a new DNS lookup so often has a performance hit.

    The issue with the prefetch feature is it only works if you do a DNS lookup when less than 10% of the TTL is left, so basically with a 30 secs TTL, if you dont do another lookup within the last 3 seconds of the TTL, then prefetch isnt providing you any benefit. Its operating scope is too narrow.

    So unbound implemented serve expired, what it does is when a record is expired, it will stay in the cache with the TTL value as 0, if another lookup comes in from the LAN (or to whatever networks your unbound is serving), then it will be served as a cached record for performance. However at the same time a new lookup is initiated from unbound to the authoritative server, so when there is a newer lookup later, it will server a newer record.

    So its important to note the same expired record isnt served forever, its only served once, then a new one is fetched.

    Newer versions of unbound allow this to be tweaked further and the good news is in the latest stable build of pfsense, we have the newer version (it was updated for security), I am considering getting another commit done to take advantage of it, as there is now an option as well that if e.g. you are uncomfortable perhaps using a cached record that might have been sitting there for a day you can set an effective expiry on the cached record itself using the more granular controls now available, I will see if i can get that field added to the UI as well.



  • Since I cannot edit (I cannot fix the typos sorry).

    But also to clarify, there is a reason this is off by default as you can imagine it is down to the admin if they are ok with records been served from a cache after they expired upstream :)

    The description in pfsense I tried to make as understanding as possible whilst as short as possible so it wasnt bloating the interface.


Log in to reply