Slow DNS after 22.05

Jax

@pcol-it-admin said in Slow DNS after 22.05:

I think many of the folks reporting this issue had said that they had "stock" pfSense DNS resolver settings, as I know I did. The only thing I did was to upgrade pfSense to 22.05, and I went from being problem-free in DNS resolving (for years), to having a problem.

That was precisely my experience: commercial user, Netgate 2100 ARM device, no DNS mods on my part, 18 months trouble-free, suddenly broken on 22.05 upgrade.

BTW ... have we ruled out ARM-specificity for this problem?

johnpoz

@pcol-it-admin said in Slow DNS after 22.05:

said that they had "stock" pfSense DNS resolver settings

I find this is rarely the case to be honest.. You also don't know what device they are on - be it arm maybe part of the issue, or more pronounced problem with them?

And issue with "stock" settings if you will - could be dnssec and user changed to forwarding, if you forward that should really be off. Lots of users love to use TLS forwarding, along with leaving dnssec on - again problematic if you ask me. But they didn't change anything else - so to them its "stock"

Might be they have no ipv6 at all, or maybe their ipv6 is solid - and only user that think they don't even have it? maybe their client is not using it, maybe the pfsense wan has it, and its using it as transport for queries.

Then we just get users saying they get some error in their browsers - for all we know "stock" in most browsers these days is using doh, and not even using local dns.

They might have "stock" but are using pfblocker and loading huge lists of blocks, via dnsbl.. And they are also with "stock" registering dhcp.. And now unbound is restarting every 10 minutes or something and that can present as a problem with dns. That has been a "stock" problem for a while - but really isn't an issue per say with unbound at all.

Part of the problem with any sort of dns issues - is to be honest many users don't actually understand how it works. Or at least at a level that allows for detailed troubleshooting. So its difficult to get actual details of what is actually going on vs they say their browser gives some error they have to refresh the page.. They don't know how to troubleshoot it, etc.

Like I said have had zero issues with 22.05 and dns - same goes for others in this thread.. So what is different with our setups, or our connections.. Clearly its not a base problem or everyone would be having the issue.

I have had it running now for a bit able to do ipv6 transport queries - and haven't noticed any issues. And I don't show any timeouts in the resolver status page..

I was meaning to setup a 3100 I have here (arm) and run through some base dnsperf test in a loop to see if an issue would present itself to help pinpoint where the issue is.

Currently the most likely issue is the ipv6 transport, but this could have other factors that exacerbate the problem for some users.

I do not see any currently reported bugs for 22.05 and unbound in redmine related - there is something about fqdn having issues in aliases.

Another variable is vpn connection, users love to use vpns on pfsense and force all traffic through them. What is required to try and pin it down is have users willing to actually provide details of what they are seeing, is unbound restarting, are they using IPv6 - a few have chimed in as of late with good details. There where all local stuff was working, so it wasn't like unbound was crashing or hung up completely, etc.

Jax

@johnpoz said in Slow DNS after 22.05:

I was meaning to setup a 3100 I have here (arm) and run through some base dnsperf test in a loop to see if an issue would present itself to help pinpoint where the issue is.

I'm glad to hear that. As a paying user who wishes to have confidence in the system, it would be reassuring to me if a committer (I take you to be a committer) would take the users seriously and do some forensics, as opposed to dismissing multiple reports and trying to ascribe the problem to user error.

Mikymike82

@johnpoz First of all thanks for the effort.
As I did a complete re-install with no back-up restore and only full manual configurations, I can assure you that I have not changed anything on the dns site of things.
No packages installed (for now)
Running intel atom c3758 (so no ARM), actual same hardware as the SG-7100.
Vlan config for internet access with a pppoe auth
And some port forwarding rules.

Gertjan

@johnpoz said in Slow DNS after 22.05:

I have currently changed my setting to allow access via my HE ipv6 tunnel, and lets see if that causes an issue.

I'm using he.net for many years now. It's my main IPv6 access, as my ISP doesn't have good IPv6 support (bad IPV6 really messes up things). I'm posting on this forum, using the he.net IPv6.
25 % of my incoming & outgoing traffic is IPv6 over he.net. Same thing for DNS related traffic.
It's a bit slow, as I'm limited to the he.net tunnel access point in Paris.
No other issues.

RTT is a bit slow, as I use a my server in a data centre nearby Paris as the dpinger monitoring point.

edit :

@mikymike82 said in Slow DNS after 22.05:

Talking albout the so said thousends of other users that are not experiencing this problem is not helping in resolving this clear issue in version 22.05 as is clearly stated that even with a clean install i can clearly replicate the issue when upgrading to 22.05 from 22.01

I agree.
An approach could be :
Let's enumerate all common - and not-common settings of all those thousands of 22.05 users.
An then compare these finding with yours.
Comparing would be even easier yet if there were users, or several users, using the same ISP - same uplink connection type, and even the same area as you have.

You get it : that's hard to do.

The easy thing would be : it's up to you to tell/show/mention what is different with your location/setup/hardware/uplink connection.
But again, without you really knowing what the other 'thousands' are using.

A test that might shed some more light :
When you re install, you have to login a first time with the admin user, and pfsense default password.
What about this set up :
Do not change the default LAN - keep DHCP etc.
Do not change WAN, keep it on DHCP-client mode (can you ?)
No other changes, do not use the keyboard anymore.
Do not import settings.
No packages.
Nothing.
Just the plain vanilla default Netgate initial config - with one LAN and one WAN assigned.

I understand, this setup might not be rally useful for you. It's just for testing.

If DNS fails at this moment, we will all know whats not the issue.
As settings are equal.
Hardware is equal.
LAN side is equal.
I presume that the device you use is a PC with default network settings (== DHCP client).

The only difference will be : your uplink (ISP or what ever you use as a connection).
For example, I would understand that if you said "I use Starlink" then that would explain a lot, as default settings won't be good for such a connection (I'm just guessing).

Also, if you use an arm device : do you have a small Intel desktop PC in a corner ? Add a extra 5$ NIC (no realtek, please) - slide in an empty, small SSD, throw pfSense at it and retest.
Issues are still the same ? Then Intel <> arm goes out of the windows. You most probably have an uplink / WAN side issue.

Use the mentioned no-ipv6 unbound option,
Remove this check :

Remove (disable) IPv6 from your test LAN device.
Now you have a close-to-IPv4-only network.
Retest.

Kempain

@lohphat said in Slow DNS after 22.05:

What is the current tally that disabling IPv6 was implicated in resolving the issue?

Just me so far but it doesn't look like anyone else has tried it yet and re-tested.
Still resolved for me. It honestly made a dramatic difference.

For those experiencing it if you can disable IPv6 then I'd follow the steps below:

Run 'unbound-control -c /var/unbound/unbound.conf stats_noreset | grep totalding 0%) unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total' and record your recursion levels (they will likely be high if you're experiencing the same issue as me):

total.recursion.time.avg=0.079624
total.recursion.time.median=0.0387577

Disable IPv6: Status - DNS Resolver - General Settings. Add the below to Custom Options:

server:
do-ip6: no

Run the command in 1. again and hopefully your recursion results will be much improved.

This has resolved all DNS issues for me and I haven't had any issues since.

Edit: Just seen @Gertjan has posted a more complete guide above so maybe follow that instead

Kempain

@gertjan said in Slow DNS after 22.05:

Remove this check :

Remove (disable) IPv6 from your test LAN device.

Just for reference I don't actually have Allow IPv6 un-checked and the no-ipv6 unbound option worked to resolve my issue.

I do however only specify IPv4 protocol in my allow rules which deviates from the OOTB config which I believe has default any rules for both protocols.

Thought this may be part of my issue but it sounds like some are having the issue even with the default config.
If the no-ipv6 unbound setting resolves this for them also then I assume this narrows it down a bit.

johnpoz

@kempain said in Slow DNS after 22.05:

I don't actually have Allow IPv6 un-checked

Yeah that wouldn't really matter for dns, if you set do-ip6 no then unbound shouldn't use IPv6 for a transport. But that would prevent say your browser from using IPv6.

That would be just an attempt to make sure there is no IPv6 being used for anything.

Kempain

@johnpoz

Yup makes sense that setting wouldn't matter for DNS.
Seems like the suggestion was to try and rule out any IPv6 issues by taking it out of the equation as much as possible.

Just wanted to let people know I didn't have to do this for it to be resolved for me.

Gertjan

@All

If "do-ipv6 no" solves an issue, then I would suspect the IPv6 connectivity.

Here IPv6 Tunnel Broker everybody can get a free IPv6 access.
They will give you a /64 - and, why not, a /48.

edit : if you can proof that you know what IPv6 is, they will give you a free T-Shirt !! That is, they did so in the past.

To use it : de activate your ISP IPv6.

Set up a "he IPv6 tunnel using the pfSense doc".

You wind up having this :

The question is : is (one of the) unbound issues related to bad IPv6 connectivity ?
Switch your IPv6 connectivity to a known good one, like he.net, and you'll find out.

Btw : most ISPs have now, in 2022, a good IPv4 implementation. ( so now they can start ditched this )
This is not the same for IPv6 : most ISP 'do it wrong', with boatloads of nasty side effect.

IMHO I think Hurricane Electric is one of the rare IPv6 suppliers that implemented IPv6 correctly - the way it should work.

No rocket-sience degree needed to implement it locally.

edit :
How to use "do-ip6" :

edit :

For what its worth :

And I can ping my PC from everywhere on the Internet :

No (ICMP) NAT needed !

( but I'm not sure that this is a good idea ... )

johnpoz

So in my morning routine of going to my different sites, reading morning news, viewing threads and looking for stuff.

In general I didn't really have any major issues. But I did run into something odd, trying to go to support.apple.com

This is a horrible dns setup if you ask me..

;; QUESTION SECTION:
;support.apple.com.             IN      A

;; ANSWER SECTION:
support.apple.com.      3589    IN      CNAME   prod-support.apple-support.akadns.net.
prod-support.apple-support.akadns.net. 3589 IN CNAME support-china.apple-support.akadns.net.
support-china.apple-support.akadns.net. 3589 IN CNAME support.apple.com.edgekey.net.
support.apple.com.edgekey.net. 21589 IN CNAME   e2063.e9.akamaiedge.net.
e2063.e9.akamaiedge.net. 3589   IN      A       23.218.161.64

that is a lot of cnames in the chain..

And a +trace was showing me a failure to get NS for

couldn't get address for 'a.ns.apple.com': not found
couldn't get address for 'b.ns.apple.com': not found
couldn't get address for 'c.ns.apple.com': not found
couldn't get address for 'd.ns.apple.com': not found
dig: couldn't get address for 'a.ns.apple.com': no more
[22.05-RELEASE][admin@sg4860.local.lan]/root:

And i was getting a SERVFAIL

$ dig support.apple.com @192.168.9.253

; <<>> DiG 9.16.30 <<>> support.apple.com @192.168.9.253
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 61133
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;support.apple.com.             IN      A

;; Query time: 1 msec
;; SERVER: 192.168.9.253#53(192.168.9.253)
;; WHEN: Thu Aug 11 07:15:19 Central Daylight Time 2022
;; MSG SIZE  rcvd: 46

I flipped back on the do-ip6: no setting - and no issues with that site..

Was this some issue with unbound and IPv6? I don't think so - because the dig +trace would have zero to do with unbound..

And doing another +trace it seems to be working fine - and notice those were obtained via IPv6 transport

apple.com.              172800  IN      NS      a.ns.apple.com.
apple.com.              172800  IN      NS      b.ns.apple.com.
apple.com.              172800  IN      NS      c.ns.apple.com.
apple.com.              172800  IN      NS      d.ns.apple.com.
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 86400 IN NSEC3 1 1 0 - CK0Q2D6NI4I7EQH8NA30NS61O48UL8G5 NS SOA RRSIG DNSKEY NSEC3PARAM
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 86400 IN RRSIG NSEC3 8 2 86400 20220815042401 20220808031401 32298 com. If06bSVXL7llnV+iyjrWR5yStSfeLZzeCKgsfVsqNQKKnP35dCsaGffz dRWOIGK+WzMKxVCiJZLiNyG5iYR9RybjH9jXMhDyYqro3M8eplcZtHnd DN0XXqhP/UDjMThDkJHxFERmmzraaU1wQLMcse/uOLMzmZVmOalWkbC+ TtXIfd8f4KB+h/M6X23C9UZ7oyYI78gqwS3Rq0fKInDxXA==
S0NU5UHI013IVS6TTUSM4FC3OQJB2EJF.com. 86400 IN NSEC3 1 1 0 - S0NU9MFCV1Q1NSBMAPQMU0H7MHN47JDT NS DS RRSIG
S0NU5UHI013IVS6TTUSM4FC3OQJB2EJF.com. 86400 IN RRSIG NSEC3 8 2 86400 20220818054103 20220811043103 32298 com. S3RZuwhflaFxPtxa1gjp5aRyT31BpAfWm9hdOXwx+JSGmHTyGfDaSCMa coQlorU8pjmvpGdQ4SnQKWkj81DdjIo7aCu9BeKRGG7wDkulttlovYcp vrg2qmHQZIEIP5pdjk8Ht15AyzZIi9vxcx6tg51WzkMV2SApd/AUFMzM 13iULU+d5+tll6yK6iGyD9ZT17SOTGOsZXFroo0CehBOCw==
;; Received 838 bytes from 2001:502:1ca1::30#53(e.gtld-servers.net) in 34 ms

What does this tell us - not sure as of yet. But clearly I had an issue with a +trace pulling info, and this was just for the 1st cname in the long chain required to get to the actual IP for support.apple.com

And unbound is not involved in doing a dig +trace.. Other than pulling the roots. when the client asks loopback.

I will leave do-ip6 no set for now, and see if run into any more issues, and then will flip it back off and allow ipv6 transport and see if see problems again.

My advice for anyone having issues with dns currently is set the no-ipv6: no for unbound - does this make it better?

mvikman

Is the unbound ipv6 setting "do-ip6" or do-ipv6" ?

I've seen in these forums, that some say "do-ipv6" and others "do-ip6", also google search results mostly "do-ip6"...

Kempain

@mvikman

Maybe both work but I have

johnpoz

@mvikman it is

My bad if was saying do-ipv6, guess habit of ipv6 which is common representation of it.

And seems @Gertjan has it listed as do-ipv6 in his post.

https://nlnetlabs.nl/documentation/unbound/unbound.conf/

I will go through my posts and make sure fix any of my typos - great catch

Gertjan

@mvikman

Nice catch !
You should not automatically believe what people propose here. Especially me included.
After all, this is unbound land, not really related to pfSense.

Settings, mentioned in the pfSense doc, or not, could influence your security, you should do the new thing : fact checking for yourself.

For example : the FreeBSD official man pages : unbound.conf
Or go to the source : the doc from the author : https://unbound.docs.nlnetlabs.nl/en/latest/manpages/unbound.conf.html
Real dingos would check the code and compile from source.

The image I've shown above isn't dangerous. unbound would simply bail out on the syntax error.

So it is

server: 
    do-ip6: no

and check that

edit !

@johnpoz said in Slow DNS after 22.05:

And seems @Gertjan has it listed as do-ipv6 in his post.

Yeah, was looking just fine was it ?
I was cutting corners, wanted to do it fast, and didn't really test (apply) it.
As it looks so simple and innocent.
I've tested it now.
This is what I got :

and yes, if parameters are not recognized at the pre config test, then if will show a message.
pfSense uses the unbound config test tools to check first. Nice.

johnpoz

@gertjan yeah it would scream at you if something wrong in your config options text box.

i think I have edited all my mistakes of using the wrong setting name in actual post text.

As check - if you look in your resolver status menu item in status - if you see IPv6 in the cache - then setting is not there. See my above examples showing the IPv6 address in the resolver status..

Now that I have put it back to no, no IPv6 listed there.. And have yet to see any issues in browser having issues with sites, etc. But then again, I only found that 1 issue when I was allowing use of my HE tunnel.. But different users go to different sites, etc. So if there is an issue with transport over ipv6, it might be so pronounced for some users that to them dns is just borked, or could be for others 1 out of 100 sites or something and simple refresh of page and it works so they don't even notice it, etc.

My recommended settings for resolving would be to

turn on prefetch
serve 0

These can help reduce any sort of first query delay when a resolver has to start from scratch and work all the way down from roots. Should they be required no, are they default - I don't think so. But prefetch can allow for something that is close to expiring in the cache to get a refresh before someone asks for it and cache is completely empty and has to be resolved from scratch again.

Same thing with serve 0 ttl - this allows for a client asking for whatever.something.com to get the last entry that was cached for it, and then in the background unbound will resolve it again.

This can also reduce any sort of delay with initial resolve all the way from roots.

Keep in mind even a full resolve all the way from roots shouldn't take very long - but these settings should help if whatever reason there is delays in a full resolve, long cname chains, connectivity issues with specific NS in the path to get to the specific fqdn, etc.

I have been using these for years, and have not seen any problems with them.

lohphat

Isn't disabling IPv6 in unbound just masking the problem? I may be a great work-around, but it seems a little short-sighted.

This behavior changed between 22.01 and 22.05 due to unbound itself changing. Since pfSense defaults to it instead of bindd then that dependency is clear and the problem needs to be escalated properly for review.

Yes, we're in a IPv6 transition period and yes, today, you MIGHT be able to disable it and walk away, but I would consider it a bit of a head-in-the-sand response.

What is the proper channel for escalating observations to the unbound team?

serbus

@lohphat said in Slow DNS after 22.05:

What is the proper channel for escalating observations to the unbound team?

Hello!

Isnt the fix for the do-ip6 workaround (as derived from https://github.com/NLnetLabs/unbound/issues/670) already in the pfsense development snaphots?

John

Gertjan

@lohphat said in Slow DNS after 22.05:

Isn't disabling IPv6 in unbound just masking the problem

See the "670" link below.
Short answer : yes, of course, it's a sledge hammer solution.
But read the bug thread @redmine from unbound : it looks like buffers fill up to the max, and outstanding requests get aborted. Reading the patch proposed also makes me think the code wasn't isn't resilient enough. The issue might even be bad DNS record / bad DNS setup on the DNS server side. And the result is : unbound bails out.
The "670" mentioned other variables you can set to bigger values so cache memory becomes bigger.

IPv6 isn't always well implemented. Or less tested. And there is more to test.

@serbus said in Slow DNS after 22.05:

Isnt the fix for the do-ip6 workaround (as derived from https://github.com/NLnetLabs/unbound/issues/670) already in the pfsense development snaphots?

If that unbound version can be back ported to 12.3 or whatever FreeBSD it has to use.
Netgate isn't using the latest and greatest. As the latest has also less known bugs ^^

@johnpoz said in Slow DNS after 22.05:

I only found that 1 issue when I was allowing use of my HE tunnel..

In the past, I had way more issues using he.net.
The web site of my own ISP, Netflix, and yes, Apple (the cloud stuff) : when I was using IPv6 ( using he.net ) pages won't load, stalled, CSS was broken.

So I used pfBlockerng-devel with the No-AAAA option, so I could list sites to which I wanted to talk to using IPv4 only.

That now over :

My Netflix works well over tghe he.net ipv6 tunnel. More and more sites are IPv6 without any IPv4 pollution.

And remember : the IPv6 of he.net can be considered as a VPN access.
These are the access points : https://tunnelbroker.net/status.php. I took the closest to me of course, but I could take an tunnel server in the states, and then I would have a US IPv6 ?
No need to recall : some sites don't like to be accessed using VPN !? :)

@johnpoz said in Slow DNS after 22.05:

turn on prefetch
serve 0

Yeah, reading the description made me think : these are to good to be true !
I've checked these from day 1.

Kempain

@johnpoz said in Slow DNS after 22.05:

My recommended settings for resolving would be to
turn on prefetch
serve 0

Would you recommend these instead of 'do-ip6: no' or as well as?

Happy to play around with settings and see what impact it has.

Strangely lost internet last night and didn't come back on it's own. I had to bounce pfSense and all my other networking gear to get it back for some reason. I did noticed the dpinger gateway monitoring service had died.

Suspect all my kit was overheating because it's pretty hot here in the UK at the moment and my makeshift comms cabinet has limited airflow. One of the things on my very long list to sort out! For now I've strung a load of computer fans together and chucked them in there which seems to be doing the job for now