Major DNS Bug 23.01 with Quad9 on SSL

SteveITS

@robbiett So if you remove the IPv6 (or v4) servers from the DNS (forwarding) list that cuts the time more or less in half?

RobbieTT

@steveits said in Major DNS Bug 23.01 with Quad9 on SSL:

@robbiett So if you remove the IPv6 (or v4) servers from the DNS (forwarding) list that cuts the time more or less in half?

Yep, that seems to be the case.

I'd like someone else to check my work though. I think I have a bog-standard pfSense resolver setup (albeit now with the ASLR unset) but until it is peer-reviewed by someone nothing is proven.

️

RobbieTT

@stephenw10 said in Major DNS Bug 23.01 with Quad9 on SSL:

Hmm, well I guess that explains why using IPv6 servers makes it more likely to hit this.

Indeed, especially if reply latency had been compounded already. Throw in multiple near simultaneous requests, say when rendering a typical 'noisy' webpage and you probably have to be thankful that it works at all.

The man pages for unbound does have some optional parameters that may help but not currently used in the pfSense version - such as:

Fast-server-permil: <number>
Specify how many times out of 1000 to pick from the set of fastest servers. 0 turns the feature off. A value of 900 would pick from the fastest servers 90 percent of the time, and would perform normal exploration of random servers for the remaining time. When prefetch is enabled (or serve-expired), such prefetches are not sped up, because there is no one waiting for it, and it presents a good moment to perform server exploration. The fast-server-num option can be used to specify the size of the fastest servers set. The default for fast-server-permil is 0.
fast-server-num: <number>
Set the number of servers that should be used for fast server selection. Only use the fastest specified number of servers with the fast-server-permil option, that turns this on or off. The default is to use the fastest 3 servers.

I've no direct experience with these options though and I've not found anything that suggests an option to send ipv4 and ipv6 concurrently.

Still learning.

️

jimp

It would be best to split off any non-ASLR performance/tuning discussion to a new thread so this can stay relevant to the central underlying problem here.

RobbieTT

@jimp said in Major DNS Bug 23.01 with Quad9 on SSL:

It would be best to split off any non-ASLR performance/tuning discussion to a new thread so this can stay relevant to the central underlying problem here.

It's your house so happy to do whatever but my only caution is that these issues are already intertwined. ASLR became a partial fix but perhaps not the whole story on the DNS issues observed by the OP and others.

️

jimp

It's hard to know for sure since there are multiple discussions happening in this one thread. The original failures seem to be solved by disabling ASLR. Any slowness/performance issues where it's not acting as fast as you expect are not failures. If disabling ASLR is degrading performance (which is unlikely) that is still a separate discussion because it's still working, not failing to resolve.

RobbieTT

It's hard to know Jim, especially until we have some more verified proof.

My observations and issues were as the OP described, with things timing out, failing to load, becoming intermittent and then suddenly ok again, for no apparent reason.

Now that we are deeper in, I am positive that the ASLR change made a significant difference but not an outright fix, especially for those running ipv6. I think we are closer to working out why cases such as mine are still hovering at the 'timing-out', 'intermittent failure' cliff-edge. The raw DNS performance is there but not much needs to go wrong for the pfSense / Unbound combination to go wrong, certainly with the way things are working right now. DNS being slow can in itself cause a failure to resolve.

Latency amplification through TLS, using a slow server over a faster one and only running ipv6 look-ups when ipv4 has been completed don't appear to be ideal, even when ASLR-unset collectively moved us all a bit further back from that cliff-edge.

Again, I'm still learning as this has thrown a few surprises along the way.

️

jimp

It's hard to know that your situation is even the same or similar to OP's in this case. You can't properly isolate things by changing so many variables at the same time in multiple different environments and chasing all these different potential threads.

There are multiple confirmations that disabling ASLR has corrected the original reported problem behavior for people (between here and the other various reports), even on FreeBSD 13.2 directly where the only real relevant change was that ASLR was turned on by default.

Anything else you're observing is unlikely to be directly relevant to that change. There is likely room for performance improvement in your environment in various ways but it's unlikely to be the same root cause here.

RobbieTT

@jimp Ok, I'm back in my box.

️

jimp

If you want to keep discussing various ways to optimize the resolver, feel free to do so, just in a new thread where others can join in who maybe were not even hitting this original issue but might have other relevant observations.

RobbieTT

@jimp That's ok, I'll just drop the subject so no need for another thread.

️

A Former User

@johnpoz
So. Even though I understand and agree with your opinion, how do you explain that many users , including me, are still sticking with a DoT configuration? Aren't you, then, preaching in the desert?
If you could convice me to drop this setting, it would be remarkable. Othewise, your opinion is like the saying: "everybody has an opinion, and it's like an ass****, everybody has one and it stinks". Please enlighten us further?

johnpoz

@marchand-guy said in Major DNS Bug 23.01 with Quad9 on SSL:

DoT configuration? Aren't you, then, preaching in the desert?

I could care if preaching to nothing - you go ahead and send all your info to whoever you want, I have no desire ever to forward.. I will resolve thank you very much ;)

I see no point forwarding - it sure isn't hiding anything from anyone, it has its own complications.. If you had some isp that was intercepting your your dns ok.. I would then run my own vps, and then resolve from there and forward to my vps.

If you like the filtering they do - hey you more than welcome to trust them.. but you sure are not hiding where your going from your isp like you think you are. Until such time ech is everywhere, since esni is dead. (ie the sni is encrypted) your not hiding anything from your isp if the want to see it.

Each their own.. These guys are good sales folks and love to scare monger, etc.. If you think sending all your dns to company X is in your best interest.. Have at it.. I don't really care where you send your dns, I know where I am not going to send it ;) I will just talk to the owning NSs for the domains and tlds I want to look up..

If my isp was messing with my dns, I would for starters be looking for another isp. if I was in some country where they all did it, then I would use a vpn, and that vpn wouldn't be any of these services it would be my own vps that I run a vpn too.

jimp

If you want to discuss the merits/worth of DoT that should also be moved to a new thread. It's not relevant to solving this problem. Let's keep this on topic.

A Former User

@jimp Yessir! I'm done though.

haraldinho

There seems to be some good news:

"Jaap Akkerhuis 2023-06-01 12:41:18 UTC
A fix is developed by upstairs. There will be a new release within weeks with this fix. For the inpatients among us, a prerelease is made available https://github.com/NLnetLabs/unbound/issues/887#issuecomment-1570136710."

RobbieTT

@jimp

A potential upstream 'fix' or improvement for ASLR:

https://www.freebsd.org/security/advisories/FreeBSD-EN-23:15.sanitizer.asc

II.  Problem Description

Some of the Sanitizers cannot work correctly when ASLR is enabled. Therefore, at
the initialization of such Sanitizers, ASLR is detected via procctl(2). If ASLR
is enabled, it is first disabled, and then the main executable containing the
Sanitizer is re-executed, after printing an appropriate message.

However, the Sanitizers work by intercepting various function calls, and by
mistake the already-intercepted procctl(2) function was used. This causes an
internal error, which usually results in a segfault.

VI.  Correction details

This issue is corrected as of the corresponding Git commit hash or Subversion
revision number in the following stable and release branches:

Branch/path                             Hash                     Revision
- -------------------------------------------------------------------------
stable/14/                              1e4798e9677f    stable/14-n265803
releng/14.0/                            78b4c762b20b  releng/14.0-n265381
stable/13/                              7c25a53a2cb9    stable/13-n256726
- -------------------------------------------------------------------------

️

jimp

While we are likely to include the patch from that EN in future builds it isn't relevant to Unbound.

They only use those sanitizers for debug/test builds and not for normal/production builds.