Major DNS Bug 23.01 with Quad9 on SSL

Gertjan

@n8rfe said in Major DNS Bug 23.01 with Quad9 on SSL:

Please ignore my post above. TLS DNS is working just fine for me on 23.01

I'm pretty sure that even @nononono will post back in the future here with a "solved" statement.

The initial subject of the thread "Major DNS Bug 23.01 with Quad9 on SSL " boils down to .... nothing.

Cylosoft

@gertjan said in Major DNS Bug 23.01 with Quad9 on SSL:

@n8rfe said in Major DNS Bug 23.01 with Quad9 on SSL:

Please ignore my post above. TLS DNS is working just fine for me on 23.01

I'm pretty sure that even @nononono will post back in the future here with a "solved" statement.

The initial subject of the thread "Major DNS Bug 23.01 with Quad9 on SSL " boils down to .... nothing.

I don't think I'd go as far as saying it's nothing. I see people having issues here, Reddit, Twitter. Settings that work in 22 don't work in 23. The defaults with forwarding don't work. With 22 you could turn on forwarding and turn on TLS and it worked. Now not so much. We have done several dozen 22 to 23 updates for customers now with different ISPs and different setups and every time we end up having to sort out DNS issues.

I think the biggest issue is we have a few customers that need to have DHCP entries included in DNS. It worked fine with 22. With 23 the restarts eventually either crash unbound entirely or it just stops responding to DNS for periods of time. We have had to setup a separate DNS server for these customers.

JeGr

@cylosoft said in Major DNS Bug 23.01 with Quad9 on SSL:

I think the biggest issue is we have a few customers that need to have DHCP entries included in DNS. It worked fine with 22. With 23 the restarts eventually either crash unbound entirely or it just stops responding to DNS for periods of time. We have had to setup a separate DNS server for these customers.

Sorry but it's documented for far longer then 22.x that DHCP registration with unbound is a "avoid!" situation due to it restarting every time a dhcp client registers its name. Either use dnsmasq (the forwarder) or don't use DHCP registration. That is literally written in the docs for far longer then "it came up with 23.01": https://docs.netgate.com/pfsense/en/latest/services/dns/resolver-config.html
Besides, a fully dynamic DHCP address shouldn't ever be needed as a name at all. A client doesn't need to have dns resolution of its name. And if it does need it because it runs services under a FQDN/hostname, it should have a static/DHCP-set static IP so it won't accidentally change and thus would have a dns name, that can be read at runtime by unbound by the "static DHCP" option which is fine and doesn't cause problems.

Cylosoft

@jegr said in Major DNS Bug 23.01 with Quad9 on SSL:

@cylosoft said in Major DNS Bug 23.01 with Quad9 on SSL:

I think the biggest issue is we have a few customers that need to have DHCP entries included in DNS. It worked fine with 22. With 23 the restarts eventually either crash unbound entirely or it just stops responding to DNS for periods of time. We have had to setup a separate DNS server for these customers.

Sorry but it's documented for far longer then 22.x that DHCP registration with unbound is a "avoid!" situation due to it restarting every time a dhcp client registers its name. Either use dnsmasq (the forwarder) or don't use DHCP registration. That is literally written in the docs for far longer then "it came up with 23.01": https://docs.netgate.com/pfsense/en/latest/services/dns/resolver-config.html

Regardless of that. It works with 22. It doesn't with 23. So to our customers 23 doesn't work.

Gertjan

@cylosoft said in Major DNS Bug 23.01 with Quad9 on SSL:

I see people having issues here, Reddit, Twitter.

Sure.
For me, issues can be found here : Issues exist as those can be reproduced.

@cylosoft said in Major DNS Bug 23.01 with Quad9 on SSL:

Settings that work in 22 don't work in 23

What settings ?

@cylosoft said in Major DNS Bug 23.01 with Quad9 on SSL:

The defaults with forwarding don't work. With 22 you could turn on forwarding and turn on TLS and it worked.

So this (last change July 2022) DNS Forwarder Configuration or the TLS equivalent Configuring DNS over TLS or How to Enable DNS over TLS on pfSense with Cloudflare are not correct ?
They looks ok to me, I've used that page to set up forwarding to 1.1.1.1 and the IPv6 equivalent, with the cert host names.
https://1.1.1.1/help confirmed me all was ok - I was forwarding.

I'm pretty sure my ISP isn't DNS bugging on me.
My pfSense can create enough entropy (see above) so my pfSense can start many (like a lot) TCP connections over TLS but I found out ones what happens with a web server (and mail etc) if entropy isn't available any more. TLS goes down fast....
Maybe this entropy thing is a non issue. I don't know.

@cylosoft said in Major DNS Bug 23.01 with Quad9 on SSL:

that need to have DHCP entries included in DNS.

That one is known for years now.
Fast solution : DHCP Mac static lease.

Next best :

@cmcdonald said in Major DNS Bug 23.01 with Quad9 on SSL:

I'm hopeful it will make it in 23.05.

@cylosoft said in Major DNS Bug 23.01 with Quad9 on SSL:

Reddit, Twitter.

No, thanks.
I'm not a #metoo guy. I prefer the #menot.
I don't like issues for myself, neither for others.
If I find one, I'll try to find out the 'why' part'.

Btw : I was posting in the scope of this thread

So we all know what our choices are asap.

JeGr

@cylosoft We have clients that use the option just fine. So is it a 23.01 problem or a "your configuration/setup" problem?
Besides - that isn't the context of this thread where it's about a forwarding DNS problem that seems to disappear when I read those responses and seems more related to problems of Quad9.

But if that's the biggest problem you see, I'd say let's open a separate thread for the DHCP-register / crash problem. Last time I looked at those reports here, most cases where related to pfBlocker+Unbound in combination though. You're running pfB in those configs, too? Just trying to analyze the context/root of the problem :)

Cheers

Cylosoft

@jegr said in Major DNS Bug 23.01 with Quad9 on SSL:

@cylosoft We have clients that use the option just fine. So is it a 23.01 problem or a "your configuration/setup" problem?
Besides - that isn't the context of this thread where it's about a forwarding DNS problem that seems to disappear when I read those responses and seems more related to problems of Quad9.

But if that's the biggest problem you see, I'd say let's open a separate thread for the DHCP-register / crash problem. Last time I looked at those reports here, most cases where related to pfBlocker+Unbound in combination though. You're running pfB in those configs, too? Just trying to analyze the context/root of the problem :)

Cheers

Yeah I've gone down this road. I understand all the context. I've been on a bunch of these threads posting including this one. We have "fixed" all of our customer configurations except the DHCP. I think the main issue is that the defaults don't work with forwarding and TLS. So generally speaking everyone with forwarding likely has a similar config and now is going down slightly different variations trying to find stable.

We had a lot of customers with Quad9, TLS, forwarding. As in 24+ customers when we paused the v23 upgrades. We thought it was Quad9 originally. But switching to CF DNS didn't fix it. We also have some customers running Quad9 with TLS on a stand alone DNS server taking most of the DNS traffic and the PF box doing Quad9 with TLS. The PF box will stop and the stand alone DNS box has no issues while using the same IP, and literally traffic going out through the PF box.

Turning off forwarding or disable TLS with forwarding seems to be a fix. But for those that want forwarding and TLS we have had to disable DNSSEC, enable serve expired, and increase the message cache size. We had several customers where unbound would completely crash about once a week up until we increased message cache size.

stephenw10

Hmm, something odd there. I use DoT forwarding to Google DNS on my own firewall and I've not seen any issues.

SteveITS

@gertjan said in Major DNS Bug 23.01 with Quad9 on SSL:

What settings ?

https://forum.netgate.com/post/1091473

If you're trying to test with forwarding enabled, leave it on for a few days. It doesn't happen immediately. In the past couple months several have said they have to turn off TLS as well but that was not my experience, nor is that noted in the Quad9 KB...however I do not recall to where they were forwarding.

Cylosoft

@stephenw10 said in Major DNS Bug 23.01 with Quad9 on SSL:

Hmm, something odd there. I use DoT forwarding to Google DNS on my own firewall and I've not seen any issues.

It's not easy to troubleshoot. Especially when customers have limited patience because it will essentially take the internet down for users. It doesn't show up right away. Most seem to be with-in a day or two. But a few customers it would take a week. We haven't sorted out if it's volume based or if it's time based or what.

If you were doing home network volume and had 15 min outages of DNS every couple of days I'm not sure most people would notice. Spin up a docker of uptime kuma or something similar and have it generate DNS queries and see if you really are up 100%.

We did add DNS check monitoring and more logging to a bunch of customer locations so we can try to sort things out better.

SteveITS

@steveits I'm just going to point out, if this takes a while to appear then on any router where unbound IS restarting regularly, that may hide it. At my home the pfBlocker overnight update restarts it because DNSBL is on.

joedan

Well unfortunately this is still an issue for me despite my efforts to change ISP and restore / rebuild my configuration.

Although I am not using Quad9 but Cloudflare DNS over TLS (without DNSSEC enabled).

So back to plain old port 53 forwarding to Cloudflare.

Whilst I had some quiet time I did run some dnsperf load testing with an average of 81 queries per second from an Ubuntu VM and managed to recreate what I intermittently see when this issue occurs. DNS over TLS breaks fairly quickly and completely stops DNS queries. I have to restart unbound to fix it.

Errors in the syslog are as before when it occurred randomly..

[54217:3] debug: tcp error for address 1.1.1.1 port 853
[54217:0] debug: outnettcp got tcp error -1

I did bump up some of the unbound cache settings, buffers and queries per thread but it still borks up quite quickly.

Running the same forwarding to Cloudflare without DNS over TLS I can run around 165 queries per second, it eventually slows down but out of the 3 or 4 times I only managed to break it once where DNS just stopped, otherwise it appeared to recover.

If there is anything I can adjust or commands run, I am happy to give it a go if this test can lead us closer to figuring out what’s going on?

Gertjan

@joedan

Interesting.

The common factor is "1.1.1.1".
'Normal' DNS uses port 53, most probably most traffic is UDP. Request and answer will always fit in just one Ethernet packet.
When using port 853, it uses DNS over TLS, so traffic can only be TCP. I wonder if just one ethernet pack can handle the entire DNS sequence.
Still, both methods used would be a fraction of your your WAN throughput.

When you visit with your PC a web site, or your mail client drop a mail into your SMTP mail server, or when it retrieves a mail, all traffic will use TLS also.
Small difference : this traffic isn't generated by pfSense, pfSense just passes the 'packets' not knowing that it is 'TLS' stuff

Did someone do this test :
Instead of having pfSense being the forwarder, contact "1.1.1.1" on behalf of the entire LAN, what about setting up the DNS (statically) so they all get their DNS from "1.1.1.1" using TLS (port 853) ? I understood that Windows 11 can do DNS over TLS (Windows 10 needs an extra program to be able to do DNS over TLS).
If the issue isn't local or 'pfSense' but somewhere on the link (ISP) or with "1.1.1.1", the issue should be the same.
If the issue goes away : that a big finger pointing to pfSense::Unbound.

I tend to think : no way .... as me using unbound 1.17.1 (pfSense 23.01) as a forwarder too 1.1.1.1:853 works just fine. Maybe my LANs are to small / not enough devices hammering out enough DNS requests ( a hotel, with many clients connected on the captive portal all day long ) to make the issue show up ?

@steveits said in Major DNS Bug 23.01 with Quad9 on SSL:

It was working in 22.05 and earlier for me but problematic in 23.01.

Humm, I didn't test that.
I followed a long and tedious seminar years ago, to learn about DNSSEC, as I wanted to make my own DNS server DNSSEC 'ready'. It was pure pain. I really had the impression that they just made world's most used Internet service (DNS) also the most complicated one.
But it worked out : https://dnsviz.net/d/brit-hotel-fumel.fr/dnssec/
I admit, I don't know what what unbound does if it was asked to do 'dnnsec' while forwarding.
Does it, at first, just get the DNS request and forward "what is the A of 'www.google.com'" to "1.1.1.1" and waits for an answer. And then, in paralellel ( ?) it also gets the RSIG DNSSEC records of the TLD of dot com and then the RSIG records of the DNS name server of google.com to re create from the root key (preloaded and initialized unbound startup - its the upper "DNSKEYalg=8, id=20326 - 2048 bits" key you can see at the DNSSEC trust chain, see my link) to bottom the validity of the chain ? ?
I don't know if 1.1.1.1 will honor these RSIG DNSSEC requests.
No ... wait ....
These RSIG will get send over '53' to the corresponding zone ...they can't be send to 1.1.1.1.
And if they ware, 1.1.1.1 would surely ditch them.

What I do know : this means unbound is still resolving .... while forwarding, and that doesn't make sens.
Also : the root key "20326 " is already known at unbound start, it doesn't change often ( key roll over is a admin's DNS nightmare )
But when unbound contacts the TLD domain server (like dot com) it will ask for a NS record of the google.com domain name server AND the zone RSIG at signs the dot come zone at the same moment.
The same thing will happen when the domain name server is asked for a A for the ending google.com. The entire trust chain is calculated from top to bottom, and if ok, the answer, the A record is made available for the requesting client.

So, yeah, doing DNSSEC while forwarding is like putting gasoline in a Tesla.
You would still be able to drive the Tesla, but thinks would become very stinky at the long run.
I even wonder where you would put the gasoline.
The slights spark would even explode the car as their is not a gas tank to hold it.

It doesn't make sense to use DNSSSEC when forwarding, as all these RSIG requests will totally annihilate the gain of time. All DNSSEC traffic is 'non TLS' (DNSSEC isn't available over TLS).
When forwarding, you chose to trust the upstream DNS - "no matter what".

I will re activate forwarding to 1.1.1.1 over TLS (853) WITH DNSSEC enabled.
All bets are open
Will useless DNS traffic sky rocket ? Will my pfSense take fire ?

Extra : and will unbound really do these RSIG lookup directly to the implicated TLDs and domain name servers, completely bypassing 1.1.1.1 ? I have to check that.

....
Sorry for the complicated rent.
All I know is that I don't know everything

SteveITS

@gertjan Note I’m not saying DNSSEC is supposed to work just that it didn’t have failures for me in earlier versions and does now. Netgate says
https://docs.netgate.com/pfsense/en/latest/services/dns/resolver-config.html
“DNSSEC works best when using the root servers directly, unless the forwarding servers support DNSSEC. Even if the forwarding DNS servers support DNSSEC, the response cannot be fully validated.

If upstream DNS servers do not support DNSSEC in forwarding mode or with domain overrides, DNS queries are known to be intercepted upstream, or clients have issues with large DNS responses, DNSSEC may need to be disabled.”

Quad9’s setup doc says
https://support.quad9.net/hc/en-us/articles/4433380601229-Setup-pfSense-and-DNS-over-TLS
“DNSSEC is already enforced by Quad9, and enabling DNSSEC at the forwarder level can cause false DNSSEC failures.”

All I’m saying is I’ve suggested to maybe 15-20 (a guess) people here having DNS problems after upgrading to 23.01, to disable DNSSEC. I’d guess 2/3 said it fixed their issue and 1/3 had to disable TLS also. (Note that Quad9 doc is how to set up TLS…)

And yes it seems weird any rate limiting on the remote server end would be related to upgrading pfSense, but it’s unlikely multiple DNS providers set up rate limits all at the same time. So I’m guessing probably some change in unbound is related. Which is of course not very scientific! But the tests above do seem to indicate some sort of issue.

stephenw10

@joedan said in Major DNS Bug 23.01 with Quad9 on SSL:

Errors in the syslog are as before when it occurred randomly..
[54217:3] debug: tcp error for address 1.1.1.1 port 853
[54217:0] debug: outnettcp got tcp error -1

What logging level do you have set in Unbound when you see that?

I would expect to be able to see additional logging with those sorts of errors.

joedan

@stephenw10

Unbound Logging Level 3 (although I did manage to change it to 4 very briefly but had to stop due to the amount of data). Snippet of Level 3 logging downloaded from graylog below.
Hopefully I am ingesting everything.

bmeeks

There is a closed bug report on the unbound Github site with this exact same TCP error here: https://github.com/NLnetLabs/unbound/issues/535. The unbound developer attributed the TCP error to the other end of the connection (meaning not the unbound side) closing the TCP socket. Maybe this is a clue ???

SteveITS

@gertjan said in Major DNS Bug 23.01 with Quad9 on SSL:

what about setting up the DNS (statically) so they all get their DNS from "1.1.1.1" using TLS (port 853)

@joedan Can you run your test against 1.1.1.1 directly, not using pfSense as the DNS server?

Quoting from @bmeeks' referenced link, "This is normal if the other server restarts for example, or maybe because it wants to manage the TCP connections that it has; possibly with timeouts for how long they can be used."

Gertjan

I've activated these again :

with :

When the log level was set to 3, I saw some

debug: tcp error for address 2606:4700:4700::1111 port 853
and
debug: tcp error for address 2606:4700:4700::1001 port 853
and
debug: tcp error for address 1.1.1.1 port 853
and
debug: tcp error for address 1.0.0.1 port 853

Also some repeating :
....
outnettcp got tcp error -1
outnettcp got tcp error -1
outnettcp got tcp error -1
outnettcp got tcp error -1
....

Recent reading makes me think these are rather harmless.

I'll leave it for the weekend.

When DNSSEC was activated, I saw a lot of DS and other DNSSEC related records were requested. DNS resolving worked just fine, though.

stephenw10

Mmm, it looks like log level 4 might be needed to see any additional logging associated with those errors.