Slow DNS after 22.05

Gertjan

@pcol-it-admin said in Slow DNS after 22.05:

yesterday I applied the "DNS Query Forwarding > Enable Forwarding Mode = yes" option in the Services > DNS Resolver > General Settings GUI, and boom - problem gone...

Before, you, that is, unbound, was questioning of the 13 world wide root servers.
If one doesn't work, or was slower, the other was used.

So, initially, my unbound talks to (I'll pick one out of 13) : 192.58.128.30 or j.root-servers.net.
As this is the closest to me.

Now, you're sending all your DNS request to an upstream resolver.
This upstream resolver doesn't exactly what unbound could be doing in the first place.
With four new possibilities :

Your upstream resolver decides what IP gets send back - it could be anything from the correct IP to a spoofed one. You will never know.
A resolver can have a safety net for spoofing (DNSSEC) - a forwarder can not.
single point of failure ! When 8.8.8.8 goes down (to name a known one) your network DNS goes out. This actually happened ..... just 48 hours ago.
you become a product.

Internet itself, works only with resolvers. Forwarders were useful in the past as our ISP could not give us expensive ISP routers with processors that could run local resolvers. So, every SOHO connection was forwarding.
Those who use pfSense do not have (small) SO HO connections. They, the admins, want the real thing.

IMHO : why does 8.8.8.8 1.1.1.1 etc etc exist today ?
Because root, tld and domain DNS servers are not reachable ? If that's the case, consider (a part of) Internet down? That would be world wide news.
So, no.
You know why they (still) exist.
It's a big money question for them - and yes, I know, their usage is free ;)

@pcol-it-admin said in Slow DNS after 22.05:

and boom

The boom was probably that you restarted unbound.
And you changed from resolver to forward mode.
I nice test would be : go back to resolver mode - this will restart unbound ones more.

Does it still work ? If so : you have now solid proof that "resolver" or "forward" mode wasn't the issue, so neither the solution.
Resolving doesn't work ? So root servers etc are not reachable ? Some one is doing MITM above your head ? For me, I would go in mayday mode if resolver OR forward mode doesn't work. Both should work out of the box, and if not, I have an urgent issue - or a very .doubtful ISP or whatever else is happening above my connection.

tentpiglet

To follow up on previous replies to my "me too" post saying I was typing in the name of the sites I was attempting to get to wrong, here's an example of what I encountered today when attempting to access my O365 email from a Chrome session on one of my test client using the pfSense DNS resolver (opposed to my pi-hole VM):

Clicking reload a couple of times resolved the problem and brought up the site.

here's the logfile from unbound when this was occurring.

Aug 10 15:51:09	unbound	69581	[69581:0] notice: Restart of unbound 1.15.0.
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.524288 1.000000 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.131072 0.262144 3
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.065536 0.131072 2
Aug 10 15:51:09	unbound	69581	[69581:0] info: lower(secs) upper(secs) recursions
Aug 10 15:51:09	unbound	69581	[69581:0] info: [25%]=0.114688 median[50%]=0.174763 [75%]=0.240299
Aug 10 15:51:09	unbound	69581	[69581:0] info: histogram of recursion processing times
Aug 10 15:51:09	unbound	69581	[69581:0] info: average recursion processing time 0.238208 sec
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 3: requestlist max 1 avg 0.7 exceeded 0 jostled 0
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 3: 11 queries, 1 answers from cache, 10 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.524288 1.000000 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.262144 0.524288 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.131072 0.262144 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.065536 0.131072 2
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.032768 0.065536 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.000000 0.000001 2
Aug 10 15:51:09	unbound	69581	[69581:0] info: lower(secs) upper(secs) recursions
Aug 10 15:51:09	unbound	69581	[69581:0] info: [25%]=1e-06 median[50%]=0.098304 [75%]=0.262144
Aug 10 15:51:09	unbound	69581	[69581:0] info: histogram of recursion processing times
Aug 10 15:51:09	unbound	69581	[69581:0] info: average recursion processing time 0.170908 sec
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 2: 10 queries, 2 answers from cache, 8 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.262144 0.524288 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.131072 0.262144 3
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.065536 0.131072 3
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.032768 0.065536 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.000000 0.000001 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: lower(secs) upper(secs) recursions
Aug 10 15:51:09	unbound	69581	[69581:0] info: [25%]=0.0709973 median[50%]=0.120149 [75%]=0.207531
Aug 10 15:51:09	unbound	69581	[69581:0] info: histogram of recursion processing times
Aug 10 15:51:09	unbound	69581	[69581:0] info: average recursion processing time 0.122444 sec
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 1: 11 queries, 2 answers from cache, 9 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.524288 1.000000 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.262144 0.524288 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.131072 0.262144 2
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.065536 0.131072 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.032768 0.065536 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.000000 0.000001 3
Aug 10 15:51:09	unbound	69581	[69581:0] info: lower(secs) upper(secs) recursions
Aug 10 15:51:09	unbound	69581	[69581:0] info: [25%]=7.5e-07 median[50%]=0.098304 [75%]=0.24576
Aug 10 15:51:09	unbound	69581	[69581:0] info: histogram of recursion processing times
Aug 10 15:51:09	unbound	69581	[69581:0] info: average recursion processing time 0.182405 sec
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 0: requestlist max 7 avg 0.777778 exceeded 0 jostled 0
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 0: 11 queries, 2 answers from cache, 9 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Aug 10 15:51:09	unbound	69581	[69581:0] info: service stopped (unbound 1.15.0).
Aug 10 15:50:28	unbound	69581	[69581:0] info: generate keytag query _ta-4f66. NULL IN
Aug 10 15:50:27	unbound	69581	[69581:0] info: start of service (unbound 1.15.0).
Aug 10 15:50:27	unbound	69581	[69581:0] notice: init module 1: iterator
Aug 10 15:50:27	unbound	69581	[69581:0] notice: init module 0: validator
Aug 10 15:50:27	unbound	69581	[69581:0] notice: Restart of unbound 1.15.0.

Of course, I cannot replicate this on a regular basis. It is random, and will happen to random sites. Usually clicking the 'reload' button in my browser will properly resolve the next time.

As with other people, from my clients, if I attempt to do nslookups I do get initial time-out errors virtually 100%:

C:\Users\tentpiglet>nslookup cnn.com
Server:  pfsense.tentpiglet.XXXXXXXXXXX.org
Address:  192.168.1.254

DNS request timed out.
    timeout was 2 seconds.
Non-authoritative answer:
Name:    cnn.com
Addresses:  2a04:4e42:600::323
          2a04:4e42:400::323
          2a04:4e42::323
          2a04:4e42:200::323
          151.101.193.67
          151.101.65.67
          151.101.129.67
          151.101.1.67

johnpoz

@tentpiglet that domain has multiple cnames that need to be followed

; QUESTION SECTION:
;login.microsoftonline.com.     IN      A

;; ANSWER SECTION:
login.microsoftonline.com. 30   IN      CNAME   ak.privatelink.msidentity.com.
ak.privatelink.msidentity.com. 30 IN    CNAME   www.tm.ak.prd.aadg.akadns.net.

If your having an issue with resolving - maybe because of flaky ipv6 then yeah such records would be more problematic than most.

edit: also if you setting strict qname you could have issues

PCOL IT Admin

@gertjan Just reporting what worked to resolve (no pun intended!) my issue (which was bad & disruptive, and only started after the 22.05 upgrade...) So I am going to try and re-enable it, but one thing I've noticed is that there's a lot of pushback from you and @johnpoz against anything being wrong with Unbound in 22.05... Can you at least accept that there is some issue going on (intermittently, which sucks for t-shooting) post 22.05 upgrade for some of us? Let's not blame the user just because "works on my machine"...

I have 22.05 now running on my 3100 (which was problematic), a 2100 under my admin (at my house of worship), and a 4100 I just deployed at work... (I also have three other Intel-based platforms running pfSense at work as well, would need to check the releases on those.) If we need data from these platforms to assist problem identification efforts, please let me know.

lohphat

What is the current tally that disabling IPv6 was implicated in resolving the issue?

PCOL IT Admin

This from the 2100 gateway that I just upgraded to 22.05-RELEASE over the weekend:

[22.05-RELEASE][admin@pcol-gw.pclawrenceville.lan]/root: unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total
total.num.queries=588600
total.num.queries_ip_ratelimited=0
total.num.cachehits=404414
total.num.cachemiss=184186
total.num.prefetch=0
total.num.expired=0
total.num.recursivereplies=184160
total.requestlist.avg=7.03968
total.requestlist.max=128
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=30
total.requestlist.current.user=7
total.recursion.time.avg=26.928807
total.recursion.time.median=0.0733518
total.tcpusage=0

total.recursion.time.avg looks bad to me... I did a test from a browser to a domain I never used from this location (www.sniffer.com) and it did lag for ~10 sec's before the page rendered.

tentpiglet

@johnpoz said in Slow DNS after 22.05:

edit: also if you setting strict qname you could have issues

As previously indicated, I have a fairly "stock", out-of-the-box DNS resolver setup. I think the only two items I checked in the setup pages were the DHCP Registration and Static DHCP options.

Much the same way I run a completely stock pi-hole setup as a VM which has zero issues.

johnpoz

@pcol-it-admin said in Slow DNS after 22.05:

against anything being wrong with Unbound in 22.05

Never said that - actually even pointed that there could be.. As mentioned in the other thread where do-ip6 was mentioned.. And what version unbound pfsense is on.

My point is there is lots of variables to take into account, there is no currently default bug in unbound that am aware of, if there was then the boards would be on fire, and can assure you if I was having issues with unbound on pfsense I would of reported it as a actual bug already with my exact findings and how to duplicate it. But currently since having zero issues with it - then it points something specific presenting with a specific configuration or even set of configurations.

If we can present a specific known issue that everyone with XYZ is presenting a specific issue, then we could push for unbound to be updated or rolledback in current version of pfsense, etc.

But have yet to been presented with specific setting or set of settings that cause an issue - if your having unbound try and use IPv6 and you have flaky IPv6 then yes that could be problematic.

I have currently changed my setting to allow access via my HE ipv6 tunnel, and lets see if that causes an issue.

I can see it doing queries via IPv6 to different NSers via just the resolver status page - so let that run for a while and see if I notice any issues with unbound..

PCOL IT Admin

@lohphat said in Slow DNS after 22.05:

Well, this failure mode is intermittent -- it hits then resolves, then his again later, rinse, repeat. It doesn't happen in DNS Forward mode. So I'm guessing whatever it is, is happening in the local cache.
The failure mode affects different devices on different internal networks which are of different architectures: Win11, iPad OS, Roku, Android.
The behavior started after the 22.05 update almost immediately. No other changes other than the base image were made.

I had the same experience... it was affecting the whole family here, on a variety of devices. Importantly, I had the stock DNS Resolver settings (whatever defaults pfSense has) until I started changing some settings (set DNS forwarding, turn off IPv6) to try and resolve the issue, as the fam was unhappy...

PCOL IT Admin

@johnpoz said in Slow DNS after 22.05:

Never said that - actually even pointed that there could be.. As mentioned in the other thread where do-ipv6 was mentioned.. And what version unbound pfsense is on.

Upon a review (a LOT of messages in this thread!) it seems that you just were asking for more precise detail, and did participate in trying to determine what might be the issue. My apologies.

However, there has been a lot of seeming finger-pointing by some other folks here at the users; it's surely possible that users setting options incorrectly may cause problems for themselves, but I think many of the folks reporting this issue had said that they had "stock" pfSense DNS resolver settings, as I know I did. The only thing I did was to upgrade pfSense to 22.05, and I went from being problem-free in DNS resolving (for years), to having a problem.

Jax

@pcol-it-admin said in Slow DNS after 22.05:

I think many of the folks reporting this issue had said that they had "stock" pfSense DNS resolver settings, as I know I did. The only thing I did was to upgrade pfSense to 22.05, and I went from being problem-free in DNS resolving (for years), to having a problem.

That was precisely my experience: commercial user, Netgate 2100 ARM device, no DNS mods on my part, 18 months trouble-free, suddenly broken on 22.05 upgrade.

BTW ... have we ruled out ARM-specificity for this problem?

johnpoz

@pcol-it-admin said in Slow DNS after 22.05:

said that they had "stock" pfSense DNS resolver settings

I find this is rarely the case to be honest.. You also don't know what device they are on - be it arm maybe part of the issue, or more pronounced problem with them?

And issue with "stock" settings if you will - could be dnssec and user changed to forwarding, if you forward that should really be off. Lots of users love to use TLS forwarding, along with leaving dnssec on - again problematic if you ask me. But they didn't change anything else - so to them its "stock"

Might be they have no ipv6 at all, or maybe their ipv6 is solid - and only user that think they don't even have it? maybe their client is not using it, maybe the pfsense wan has it, and its using it as transport for queries.

Then we just get users saying they get some error in their browsers - for all we know "stock" in most browsers these days is using doh, and not even using local dns.

They might have "stock" but are using pfblocker and loading huge lists of blocks, via dnsbl.. And they are also with "stock" registering dhcp.. And now unbound is restarting every 10 minutes or something and that can present as a problem with dns. That has been a "stock" problem for a while - but really isn't an issue per say with unbound at all.

Part of the problem with any sort of dns issues - is to be honest many users don't actually understand how it works. Or at least at a level that allows for detailed troubleshooting. So its difficult to get actual details of what is actually going on vs they say their browser gives some error they have to refresh the page.. They don't know how to troubleshoot it, etc.

Like I said have had zero issues with 22.05 and dns - same goes for others in this thread.. So what is different with our setups, or our connections.. Clearly its not a base problem or everyone would be having the issue.

I have had it running now for a bit able to do ipv6 transport queries - and haven't noticed any issues. And I don't show any timeouts in the resolver status page..

I was meaning to setup a 3100 I have here (arm) and run through some base dnsperf test in a loop to see if an issue would present itself to help pinpoint where the issue is.

Currently the most likely issue is the ipv6 transport, but this could have other factors that exacerbate the problem for some users.

I do not see any currently reported bugs for 22.05 and unbound in redmine related - there is something about fqdn having issues in aliases.

Another variable is vpn connection, users love to use vpns on pfsense and force all traffic through them. What is required to try and pin it down is have users willing to actually provide details of what they are seeing, is unbound restarting, are they using IPv6 - a few have chimed in as of late with good details. There where all local stuff was working, so it wasn't like unbound was crashing or hung up completely, etc.

Jax

@johnpoz said in Slow DNS after 22.05:

I was meaning to setup a 3100 I have here (arm) and run through some base dnsperf test in a loop to see if an issue would present itself to help pinpoint where the issue is.

I'm glad to hear that. As a paying user who wishes to have confidence in the system, it would be reassuring to me if a committer (I take you to be a committer) would take the users seriously and do some forensics, as opposed to dismissing multiple reports and trying to ascribe the problem to user error.

Mikymike82

@johnpoz First of all thanks for the effort.
As I did a complete re-install with no back-up restore and only full manual configurations, I can assure you that I have not changed anything on the dns site of things.
No packages installed (for now)
Running intel atom c3758 (so no ARM), actual same hardware as the SG-7100.
Vlan config for internet access with a pppoe auth
And some port forwarding rules.

Gertjan

@johnpoz said in Slow DNS after 22.05:

I have currently changed my setting to allow access via my HE ipv6 tunnel, and lets see if that causes an issue.

I'm using he.net for many years now. It's my main IPv6 access, as my ISP doesn't have good IPv6 support (bad IPV6 really messes up things). I'm posting on this forum, using the he.net IPv6.
25 % of my incoming & outgoing traffic is IPv6 over he.net. Same thing for DNS related traffic.
It's a bit slow, as I'm limited to the he.net tunnel access point in Paris.
No other issues.

RTT is a bit slow, as I use a my server in a data centre nearby Paris as the dpinger monitoring point.

edit :

@mikymike82 said in Slow DNS after 22.05:

Talking albout the so said thousends of other users that are not experiencing this problem is not helping in resolving this clear issue in version 22.05 as is clearly stated that even with a clean install i can clearly replicate the issue when upgrading to 22.05 from 22.01

I agree.
An approach could be :
Let's enumerate all common - and not-common settings of all those thousands of 22.05 users.
An then compare these finding with yours.
Comparing would be even easier yet if there were users, or several users, using the same ISP - same uplink connection type, and even the same area as you have.

You get it : that's hard to do.

The easy thing would be : it's up to you to tell/show/mention what is different with your location/setup/hardware/uplink connection.
But again, without you really knowing what the other 'thousands' are using.

A test that might shed some more light :
When you re install, you have to login a first time with the admin user, and pfsense default password.
What about this set up :
Do not change the default LAN - keep DHCP etc.
Do not change WAN, keep it on DHCP-client mode (can you ?)
No other changes, do not use the keyboard anymore.
Do not import settings.
No packages.
Nothing.
Just the plain vanilla default Netgate initial config - with one LAN and one WAN assigned.

I understand, this setup might not be rally useful for you. It's just for testing.

If DNS fails at this moment, we will all know whats not the issue.
As settings are equal.
Hardware is equal.
LAN side is equal.
I presume that the device you use is a PC with default network settings (== DHCP client).

The only difference will be : your uplink (ISP or what ever you use as a connection).
For example, I would understand that if you said "I use Starlink" then that would explain a lot, as default settings won't be good for such a connection (I'm just guessing).

Also, if you use an arm device : do you have a small Intel desktop PC in a corner ? Add a extra 5$ NIC (no realtek, please) - slide in an empty, small SSD, throw pfSense at it and retest.
Issues are still the same ? Then Intel <> arm goes out of the windows. You most probably have an uplink / WAN side issue.

Use the mentioned no-ipv6 unbound option,
Remove this check :

Remove (disable) IPv6 from your test LAN device.
Now you have a close-to-IPv4-only network.
Retest.

Kempain

@lohphat said in Slow DNS after 22.05:

What is the current tally that disabling IPv6 was implicated in resolving the issue?

Just me so far but it doesn't look like anyone else has tried it yet and re-tested.
Still resolved for me. It honestly made a dramatic difference.

For those experiencing it if you can disable IPv6 then I'd follow the steps below:

Run 'unbound-control -c /var/unbound/unbound.conf stats_noreset | grep totalding 0%) unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total' and record your recursion levels (they will likely be high if you're experiencing the same issue as me):

total.recursion.time.avg=0.079624
total.recursion.time.median=0.0387577

Disable IPv6: Status - DNS Resolver - General Settings. Add the below to Custom Options:

server:
do-ip6: no

Run the command in 1. again and hopefully your recursion results will be much improved.

This has resolved all DNS issues for me and I haven't had any issues since.

Edit: Just seen @Gertjan has posted a more complete guide above so maybe follow that instead

Kempain

@gertjan said in Slow DNS after 22.05:

Remove this check :

Remove (disable) IPv6 from your test LAN device.

Just for reference I don't actually have Allow IPv6 un-checked and the no-ipv6 unbound option worked to resolve my issue.

I do however only specify IPv4 protocol in my allow rules which deviates from the OOTB config which I believe has default any rules for both protocols.

Thought this may be part of my issue but it sounds like some are having the issue even with the default config.
If the no-ipv6 unbound setting resolves this for them also then I assume this narrows it down a bit.

johnpoz

@kempain said in Slow DNS after 22.05:

I don't actually have Allow IPv6 un-checked

Yeah that wouldn't really matter for dns, if you set do-ip6 no then unbound shouldn't use IPv6 for a transport. But that would prevent say your browser from using IPv6.

That would be just an attempt to make sure there is no IPv6 being used for anything.

Kempain

@johnpoz

Yup makes sense that setting wouldn't matter for DNS.
Seems like the suggestion was to try and rule out any IPv6 issues by taking it out of the equation as much as possible.

Just wanted to let people know I didn't have to do this for it to be resolved for me.

Gertjan

@All

If "do-ipv6 no" solves an issue, then I would suspect the IPv6 connectivity.

Here IPv6 Tunnel Broker everybody can get a free IPv6 access.
They will give you a /64 - and, why not, a /48.

edit : if you can proof that you know what IPv6 is, they will give you a free T-Shirt !! That is, they did so in the past.

To use it : de activate your ISP IPv6.

Set up a "he IPv6 tunnel using the pfSense doc".

You wind up having this :

The question is : is (one of the) unbound issues related to bad IPv6 connectivity ?
Switch your IPv6 connectivity to a known good one, like he.net, and you'll find out.

Btw : most ISPs have now, in 2022, a good IPv4 implementation. ( so now they can start ditched this )
This is not the same for IPv6 : most ISP 'do it wrong', with boatloads of nasty side effect.

IMHO I think Hurricane Electric is one of the rare IPv6 suppliers that implemented IPv6 correctly - the way it should work.

No rocket-sience degree needed to implement it locally.

edit :
How to use "do-ip6" :

edit :

For what its worth :

And I can ping my PC from everywhere on the Internet :

No (ICMP) NAT needed !

( but I'm not sure that this is a good idea ... )