Ideas on diagnosing intermittent DNS forward?

MrPete

We like to use filtered DNS service, and have enabled that. Until recently, no issues at all.
Long story short, I've discovered that the ipv6 DNS (at OpenDNS) intermittently times out before returning result, with potentially serious cascading impact***.

I'm trying to track down what exactly is failing, or more to the point, where it's failing. This is proving to be quite difficult:

If I do a frequent ipv6 DNS check, the issue goes away (pretty sure it's because this keeps the DNS server cache active)
If I do a 2 hour DNS monitor, most of the time it's clear that the cache has been cleared and a full lookup is performed (results come back in a few hundred ms instead of tens of ms)
BUT, it's also clear that various things not under my control are often also causing cache(s) to be refreshed.
One more complication: I can't at all be certain the issue is at OpenDNS. It could be at any hop between here and there.

So my question: does anyone have experience diagnosing intermittent DNS and/or path-cache failure, resulting in failed DNS lookup? I'd love any hints on digging into this.

***Cascading Impact, probably TMI/tldr ;)
(Please don't let the following distract from the real purpose of this question!)

In general, we tend to assume that most computer systems either are working, or are failed... whether DNS, databases, Internet links, etc.
Based on that assumption, we then build other security and health checks.
But if the first assumption is false -- such as due to intermittency -- then the whole structure can become a house of cards.

Example: email security / authentication

A variety of checks are commonly built into email servers that receive connections from the outside world, including:
- Is the claimed "HELO" FQDN valid?
- Is the claimed "From" FQDN valid?
- Is the connecting IP address authorized to send this message?
- Is there a valid rDNS for the connecting IP?
- (Stricter, often fails: does the rDNS name match the HELO name?)
DNS failures in any/all of these traditionally cause a temporary failure. Not an issue.
HOWEVER, today we have intense spam activity, so systems like exponential fail2ban ensure that bad actors don't overwhelm our systems.
The result: certain DNS failures produce identical log entries for bad actors, and for good citizens that happen to have hit an intermittent DNS failure.
And unfortunately, it is quite possible for such a failure (or a pair of them) to produce a ban.
Result: good email gets blocked for a while, sometimes quite a while.

SteveITS

@mrpete If you are on 23.01 there are a few long threads about DNS. Short version...if you are forwarding in the DNS Resolver settings:

disable DNSSEC (per Quad9 it can cause failures while forwarding)
if that doesn't work disable DNS over TLS

If you're not forwarding, if DHCP lease registration is enabled that will restart Unbound whenever a lease renews.

Gertjan

@mrpete

I'm having the resolver doing it's resolver thing. Seems to works pretty well.

The "Unbound memory usage" stats indicated pretty well the unbound restart/reload.
Mine restarts @00h15, as I told it to do so :

Firewall>pfBlockerNG :

Check yours :

grep 'start' /var/log/resolver.log

as 'frequent' restarting mean : a small DNS outage every time.

MrPete

@gertjan and @SteveITS -- Good hints.
Definitely am forwarding in the resolver settings.

I did play with DNSSEC briefly, without success.
That Quad9 link only claims issues if we don't have ipv6. Ours is working fine...

BUT: that brings up a good question. If something happens to our ipv6, we'd be in some trouble, correct?

Seems like there ought to be frequent checking and disabling of sensitive services if something like that happens.

SteveITS

@mrpete the part further down:

Disable Enable DNSSEC Support if enabled.
DNSSEC is already enforced by Quad9, and enabling DNSSEC at the forwarder level can cause false DNSSEC failures.