Major DNS Bug 23.01 with Quad9 on SSL
-
@nononono Same issue here on a NG2100 running 23.01.
Running fine with DNSSEC disabled & SSL/TLS enabled on 22.05 but I've had to disable SSL/TLS on 23.01 to avoid intermittent DNS failures with Quad9.
This is on IPv4/IPv6 with the patch applied for redmine #13851.
-
Check the output of
sockstat | grep unbound
when it works and when it doesn't.I thought they fixed it but a while back unbound had an issue where it couldn't reuse SSL connections on the same open sockets so in some cases they kept piling up.
EDIT: This is what I was thinking of, but it's been fixed/closed for a couple years now: https://github.com/NLnetLabs/unbound/issues/47
-
This bug is still a problem, for Cloudflare DNS users as well. DNS stopped working on all clients immediately after updating to 23.01 this morning (Mar 15, 2023).
After finding this thread, I UNchecked only the following setting in the DNS Resolver settings, and DNS started working again:
This is clearly a bug because the current pfSense documentation itself advises that this setting be checked:
https://docs.netgate.com/pfsense/en/latest/recipes/dns-over-tls.html
-
That document is about configuring that specific feature, it's not "advising" that setting be checked in a general fashion for everyone.
It's working for many people and breaking for a few, but it's still not clear why.
-
@jimp Bullet number four of the previously included screenshot explicitly directs the user to check that setting. The pfSense documentation is actually more than advising. It is directing.
-
@isotope1842 said in Major DNS Bug 23.01 with Quad9 on SSL:
@jimp Bullet number four of the previously included screenshot explicitly directs the user to check that setting. The pfSense documentation is actually more than advising. It is directing.
The page you quoted is a document about configuring DNS over TLS -- it's saying to check that if you want DNS over TLS. That's what that entire document is for.
It's not a part of the general setup or DNS resolver docs and so on. It's a recipe for users who are interested in that feature and want to know how to set it up.
There is nothing saying users should be following all of those recipes, they're there for reference for things users may want to do.
-
@jimp The context in which the instructions are provided is exactly for users who want to use an upstream DNS over TLS provider. The original poster here was reporting a bug in using Quad9 and I reported the same bug when using Cloudflare.
See the top of the same linked documentation:
"Pick a DNS over TLS upstream provider, such as a private upstream DNS server or a public service like Cloudflare, Quad9, or Google public DNS."
Following the instructions prior to 23.01 worked. Immediately after upgrading to 23.01, DNS fails until unchecking a single setting.
-
Yes, and? They still work for that and for many providers and users who want to enable that feature.
But you're trying to imply this is something the docs have told everyone they should be doing which isn't true. They don't advise everyone to do it, just people who are interested in that feature.
But none of this is helpful. We still need more information about how and why it's failing. We have yet to be able to reproduce this in a lab environment, and there are plenty of us running DNS over TLS without problems.
-
@jimp Steps to reproduce the problem:
- Install pfSense 22.05 on a netgate device.
- Configure DNS over TLS with Cloudflare.
- Upgrade netgate device to pfSense 23.01.
- Observe broken DNS for downstream clients.
-
@isotope1842 said in Major DNS Bug 23.01 with Quad9 on SSL:
@jimp Steps to reproduce the problem:
- Install pfSense 22.05 on a netgate device.
- Configure DNS over TLS with Cloudflare.
- Upgrade netgate device to pfSense 23.01.
- Observe broken DNS for downstream clients.
It is not that simple.
I have multiple lab VMs using DNS over TLS to Cloudflare and Quad9 that successfully resolve and have no problems.
-
Mmm, I assume you are seeing the same intermittent behaviour as other users? It's not failing for every query with that configuration?
-
@jimp I just re-checked that single setting. DNS appears to continue to work. Curious to see whether it starts to fail again at some point.
-
@isotope1842 said in Major DNS Bug 23.01 with Quad9 on SSL:
@jimp I just re-checked that single setting. DNS appears to continue to work. Curious to see whether it starts to fail again at some point.
Since you hit it once it's likely to fail again at some point, but nobody has yet to be able to pinpoint exactly when/why it happens.
I've been periodically checking my lab systems and they all just keep resolving no matter what I do. But they are lab systems so the load is considerably lower than it would be in a live environment.
-
@isotope1842 There are a few threads on this topic, or variations thereof, and in another one someone posted their problem seemed likely to happen when opening a group/folder of bookmarks/favorites at once...implying a higher number of simultaneous requests might trigger it.
I was also unable to replicate my issue by simply (re)checking the DNSSEC option, but I left it off as recommended.
-
After playing alot more - it might be an issue with Quad9's TLS DNS limiting responses more than anything.
While there is nothing helpful in the pfsense logs, Quad9 just appears to stop replying and then start responding again - almost as if there is a limit being imposed by Quad9 on requests - but of course pfsense must have some role as it never occurred before 23.01
This network is a very high traffic network, so maybe others that see the same thing manage high traffic networks as well - either way the only long term solution has been doing TLS DNS through Cloudflare
As another point - DHCP lease registrations is definitely not fixed as claimed in 23.01, unbound still likes to reboot too much to consider enabling it - as such at the most I am still only registering the clients that need it via static mapping.
-
Hmm, that would explain why we haven't been able to replicate it in test setups without the loading a production box has.
-
There were memory leaks and a segfault that were fixed. There is still ongoing work to eliminate the unbound reloads associated with lease registration events. I'm hopeful it will make it in 23.05.
-
@cmcdonald said in Major DNS Bug 23.01 with Quad9 on SSL:
There were memory leaks and a segfault that were fixed. There is still ongoing work to eliminate the unbound reloads associated with lease registration events. I'm hopeful it will make it in 23.05.
That is great to hear! Hopefully that finally is solved
-
@nononono said in Major DNS Bug 23.01 with Quad9 on SSL:
DHCP lease registrations is definitely not fixed as claimed in 23.01
Where is that stated? What it says is a "crash" was fixed when unbound restarting
https://redmine.pfsense.org/issues/11316?#note-79
Where does it say that unbound is not going to restart on dhcp registrations.. The issue with unbound restarting every time there is something going on with dhcp is ok for some users.. but if you have clients renewing lever few minutes then your still going to have issues with unbound and dhcp - even if they fixed some crash that could happen.
-
@johnpoz said in Major DNS Bug 23.01 with Quad9 on SSL:
What it says is a "crash" was fixed when unbound restarting
It may be referring to https://docs.netgate.com/pfsense/en/latest/releases/23-01.html "A long-standing difficult-to-reproduce crash in Unbound...It is now safe again to enable DHCP registration alongside Unbound Python mode in pfBlockerNG." ...which is talking about pfBlocker vs DHCP, not that DHCP registration won't restart Unbound anymore. (I've seen others post about this sentence as well...)