Major DNS Bug 23.01 with Quad9 on SSL

JonH

@joedan Thanks, I'll check it out

Gertjan

@joedan said in Major DNS Bug 23.01 with Quad9 on SSL:

Like the subject of the thread :

but arguably the same issue : 1.1.1.1 or 9.9.9.9, "what is the difference ?", I'm forwarding just to test 'if it works, or not'.
Up until today, I didn't find any issues.

Note that I'm still using

as I presume that error conditions would get logged, if they arrive.
The last log line form unbound tells me that it started a couple of day ago :

I'm going to restart unbound now, and disable address space layout randomization (ALSR), although I just can't wrap my head around this workaround: why would the position in (virtual mapped) memory matter ?
ALSR is used in every modern OS these days.
It's a extra layer of obscurity without any cost or negative side effects, and, as far as I know, only makes the life of a hacker more difficult. hack entry vectors by using stack or memory (aka buffer) overruns are become much harder, as the process uses another layout in memory every time it starts.

Btw : this is is what I think. I admit I don't know shit about this ALSR executable option, and was aware only vaguely about the concept.

I also think, or thought, that a coder that makes programs doesn't need to be aware of 'where' the code, data and other segments are placed in memory. We all code relocatable for decades now without being aware of it, as the compiler and linker takes care of all these things.
The unbound issue was marked as as FreeBSD bug first, and they, FreeBSD, said : go ask the unbound author. See post above.
Disabling ASLR is just a stop-gap. (edit : if this is even related to this bug, issue ... we'll see)
IMHO, the real issue is somewhere between unbound and ones of it's linked libraries "libcrypto.so.111" and "libssl.so.111", as I presume that the issue arrives when forwarding over TLS is used.

The default unbound mode is resolving doesn't use TLS, so, for me, that explains why the resolver is working fine while resolving.

Anyway, not a pfSense issue, more an unbound issue or even further away, the way how all this interoperates.
The good news : Its still an issue for Netgate, as they are very FreeBSD aware, they will find out what the real issue is.

[ end of me thinking out loud ]

stephenw10

I would love to see anyone who was hitting this issue repeatedly confirm the ASLR workaround here.

SwissSteph

@stephenw10
I'm testing right now and for the moment it's "OK" .... I just put back my DNS settings like on my 22.05 version (which was working without any problem)

SwissSteph

Gertjan

@swisssteph

Your are forwarding : ok
and
using TLS - port 853 ?
Right ?

edit :
I am forwarding to these two over TLS - and most (not all) traffic goes actually over 2620:fe::fe and
2620:fe::9, the IPv6 counterpart of 9.9.9.9 and 149.112.112.112.
I did not do the ASLR patch .... I'm still waiting for it to fail
As sson as I see the fail, I'll go patch, so I'll know what I don't want to see any more.

SwissSteph

@gertjan

YES

Gertjan

@swisssteph

Close.
You mean :

The "SSL/TLS Listen Port" (your image) is the port unbound uses on the LAN side, so it listens to that port for the DNS requests emitted by the pfSense LAN clients (if you have them, Windows 10 was not capable of doing DNS over TLS, I guess Windwos 11 can do it - didn't check).

SwissSteph

@gertjan Sorry

N0m0fud

@gertjan Windows 11 after a certain version supports DOT and DOH

JonH

@stephenw10 The long waits to resolve have plagued me since upgrade to 23.01-Release with python mode & TLS. For the past week+ I've been using unbound/53 with no problems. I updated unbound as soon as I saw Chris's post. For past 2 days I've been back on python mode/853 and it's working well for me. Currently using localhost w/ fallback to dot1 & quad9. Hope this was the 'fix'.

RobbieTT

@stephenw10 said in Major DNS Bug 23.01 with Quad9 on SSL:

I would love to see anyone who was hitting this issue repeatedly confirm the ASLR workaround here.

I don't know the syntax to reverse the ASLR command - anyone?

I did a crude but repeatable test - hammered a load of name servers, including my pfSense resolver which is pointing at Quad9 using DoT:

Before the ASLR hack:

1684002538158-2023-05-13-at-19.08.59-before.png

After the ASLR hack:

1684002587941-2023-05-13-at-19.16.20-after.png

Uncached minimums down from 34ms to 9ms
Uncached maximums down from 663ms to 392ms
Uncached average down from 103ms to 67ms
Uncached SD down from 159ms to 90ms

What's not to like?

️

[NB capturing the random 'pauses' and 'fail to loads' suffered (as described earlier) is much harder to represent]

jimp

@robbiett said in Major DNS Bug 23.01 with Quad9 on SSL:

@stephenw10 said in Major DNS Bug 23.01 with Quad9 on SSL:

I would love to see anyone who was hitting this issue repeatedly confirm the ASLR workaround here.

I don't know the syntax to reverse the ASLR command - anyone?

# elfctl /usr/local/sbin/unbound
File '/usr/local/sbin/unbound' features:
noaslr          'Disable ASLR' is unset.
[...]
# killall -9 unbound
# elfctl -e +noaslr /usr/local/sbin/unbound
# elfctl /usr/local/sbin/unbound
File '/usr/local/sbin/unbound' features:
noaslr          'Disable ASLR' is set.
[...]
# elfctl -e -noaslr /usr/local/sbin/unbound
# elfctl /usr/local/sbin/unbound
File '/usr/local/sbin/unbound' features:
noaslr          'Disable ASLR' is unset.
[...]

RobbieTT

@jimp
Thanks Jim

RobbieTT

@stephenw10

I should probably add that even with the ASLR unset I still get weird looking results when I attempt an individual DNS Lookup on a domain name that I know hasn't been cached:

2023-05-14 at 10.43.36.png

If I understand the pfSense diagnostics screen, when the internal DNS resolver has to use forwarding to answer a query I would expect a similar time to answer the query as the fastest responding name server (2629:fe::fe at 7ms in this example) plus the almost negligible processing delay from checking the cache. Yet it actually takes a snooze-worthy 168ms.

Why does the DNS resolver take 168ms for a simple forwarded (uncached) query when the forwarder itself has an answer from an upstream provider in just 7ms or, in other words, around 24 times slower than expected?

️

MoonKnight

@robbiett

Have been wondering about the same for some time now. It doesn't make sense

And if you do the same lookup just seconds after the first time "The query time" is on 0.
Wait 1 minute then back to 60 msec.

I have been having this behavior since 23.01 and maybe on 22.05 also .

RobbieTT

@moonknight said in Major DNS Bug 23.01 with Quad9 on SSL:

@robbiett
And if you do the same lookup just seconds after first time "The query time" is on 0.
Wait 1 minute then back to 60 msec.

I don't suffer the second part of your observation. Once my query is cached it stays cached until it is removed or reset - it obeys the settings I have given it.

If you stop the resolver for a moment and run the command:

unbound-control -c /var/unbound/unbound.conf dump_cache

...you can poke around and see what is in your cache.

️

MoonKnight

@robbiett
Thanks for the command, I'm going to test I later.
But I did enable "Serve Expired" and now the lookup stays on 0 msec on 2nd lookup of the same domain.

johnpoz

@moonknight problem with cnn.com is they have the TTL set to 60 seconds..

;; QUESTION SECTION:
;cnn.com.                       IN      A

;; ANSWER SECTION:
cnn.com.                60      IN      A       151.101.67.5
cnn.com.                60      IN      A       151.101.195.5
cnn.com.                60      IN      A       151.101.131.5
cnn.com.                60      IN      A       151.101.3.5

So if you forward to somewhere the ttl you can cache is going to be something shorter then 60 seconds, could be 59, could be 2..

There is no sane reason for them to have the ttl set so freaking low - other than they want lots of queries.. They charge their customers maybe by queries - that is hosted on aws dns..

;; AUTHORITY SECTION:
cnn.com.                3600    IN      NS      ns-1086.awsdns-07.org.
cnn.com.                3600    IN      NS      ns-1630.awsdns-11.co.uk.
cnn.com.                3600    IN      NS      ns-47.awsdns-05.com.
cnn.com.                3600    IN      NS      ns-576.awsdns-08.net.

So what you can do on your side is yeah allow for serving expired, and you could also set your min ttl.. I do both, have min ttl of 3600, and serve expired..

MoonKnight

@johnpoz

Thanks for the information :)
I set "Minimum TTL for RRsets and Messages" to 3600 and seems to work :)