Slow DNS after 22.05

Cool_Corona

unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total

total.num.queries=516
total.num.queries_ip_ratelimited=0
total.num.cachehits=312
total.num.cachemiss=204
total.num.prefetch=7
total.num.expired=0
total.num.recursivereplies=204
total.requestlist.avg=0.412322
total.requestlist.max=8
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=0
total.requestlist.current.user=0
total.recursion.time.avg=0.688891
total.recursion.time.median=0.073533
total.tcpusage=0

Gertjan

@kempain said in Slow DNS after 22.05:

root servers taking 64ms rather than 28ms which is a bit of a bump

Don't worry.
I have

in 89 ms

as I'm still using a VDSL - 23 Mbits down 2 Mbits up.
John is probably using fibre.

My google name servers ns1 to ns4 tell me

in 55 ms

One of yhem gave me an IP :

in 26 ms

@kempain said in Slow DNS after 22.05:

Key metric seems to be total recursion time so try:

I've

total.num.queries=17629
total.num.queries_ip_ratelimited=0
total.num.cachehits=11918
total.num.cachemiss=5711
total.num.prefetch=6157
total.num.expired=5857
total.num.recursivereplies=5711
total.requestlist.avg=3.53632
total.requestlist.max=62
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=0
total.requestlist.current.user=0
total.recursion.time.avg=0.437151
total.recursion.time.median=0.105229
total.tcpusage=0

and guess what : I don't care.
I've activated

so, ones in the unbound cache, unbound gets a fresh copy of that host when it times (TTL) out, ready to be served on my local lans if needed == no more waiting. I keep the initial '100 ms' delay' and then it's over for that host.

johnpoz

@gertjan fiber I wish ;) just cable connection

I keep checking for fiber options - what I would really like is the symmetrical connection. Limited to 50mbps up vs my 500 down. More up would help with my plex server to friends and family.. They have a gig plan but it doesn't get you more up, so not worth the extra money - 500 is more than adequate for down for my needs.

Yeah the prefetch option can help especially if your having longer resolve times than typical. I think it ties really good with the serve 0 option.. I am not a fan of these really short ttls many records are going with 30 seconds, 60 seconds - get out of here, there is no need for that other than you wanting to track something.. I have my min ttl set for 3600 seconds (1 hour) and have never ran into any issues with it.

PCOL IT Admin

Another report of DNS resolving prob's after 22.05-RELEASE upgrade (on a 3100 unit)... Had absolutely no issues before the upgrade (did last weekend); after the upgrade, many issues with slow/failed website rendering, app launch failures, etc. -- finally yesterday I applied the "DNS Query Forwarding > Enable Forwarding Mode = yes" option in the Services > DNS Resolver > General Settings GUI, and boom - problem gone... Will try the do-ip6: no server option and undo the forwarding, and see how I make out. But I do think some problem got introduced in the 22.05 version of unbound (1.15.0) that got shipped, perhaps only affecting the ARM platform.

Kempain

Really appreciate all the help, thanks @johnpoz , @Gertjan and others.

Still going strong here with really low recursion averages and speedy response times now so fingers and toes crossed it seems to be resolved for me at least.

Still a question as to why it was fine in the previous version but it was probably caused by me mis-configuring IPv6 and not disabling it fully. Need to spend some time learning about IPv6 as I initially disabled it so I didn't have to worry about IPv6 firewall rules between VLANS that I'm trying to segregate. Assume there will be some things that may require IPv6 these days so I probably shouldn't be blocking it anyway.

Kempain

@pcol-it-admin said in Slow DNS after 22.05:

perhaps only affecting the ARM platform

I'm on intel and it impacted me

johnpoz

@kempain yeah those times look way way better.

There are so many variables at play here, its quite possible there is something in this version that presents itself different with some flaky IPv6 setup.. There was a thread posted by I believe bmeeks that mentioned the do-ip6 no setting related to what others (non pfsense users) were seeing with this version of unbound.

Keep in mind the old version was like 1.13 or 1.12, or it was 1.13 and then backed off to 1.12 again, etc. And now currently its 1.15.0 I believe - while I believe current unbound is like 1.16.2

Hopefully we have gotten you to a stable config that works for your environment. IPv6 can introduce more variables into network, you mention firewall for example - there is a learning curve for sure, and things are done differently to be sure. Clients love to use temp IPv6 for their outbound connections, that can be trickier to firewall than just single IPv4 being used, etc.

I do believe IPv6 is the future, and yes sure it would behoove you to get familiar with it - but its not something that you have to understand today, or even tmrw or next month even. I have yet to have anyone provide a single example of HAVING to have IPv6 on their network - many an ISP don't even support it, mine doesn't - I have to get my IPv6 via a HE tunnel for example because my isp doesn't offer it, and have not even seen it mentioned as coming soon, etc.

You deciding not to use it at this time, sure isn't going to hold up the global deployment schedule ;) hehehe I have been tinkering with it for like 12 years.. And do have it available on my network - but in a limited fashion for limited devices that I want to play with it on..

Keep in mind setting unbound to not use IPv6 as a transport for doing dns queries or serving dns to your clients does not prevent you from using IPv6 on your clients.. Its just unbound won't use it as a transport is all - it can still serve up AAAA records just fine. For your clients to use IPv6 with.

Kempain

@johnpoz said in Slow DNS after 22.05:

Keep in mind setting unbound to not use IPv6 as a transport for doing dns queries or serving dns to your clients does not prevent you from using IPv6 on your clients.. Its just unbound won't use it as a transport is all - it can still serve up AAAA records just fine. For your clients to use IPv6 with.

Good to know cheers John

It has been interesting working through this and I'm definitely going to f'up my config in the future!
As long as I'm learning in the process that's fine, and I definitely learnt a few things here so thanks for sharing your knowledge.

Gertjan

@pcol-it-admin said in Slow DNS after 22.05:

yesterday I applied the "DNS Query Forwarding > Enable Forwarding Mode = yes" option in the Services > DNS Resolver > General Settings GUI, and boom - problem gone...

Before, you, that is, unbound, was questioning of the 13 world wide root servers.
If one doesn't work, or was slower, the other was used.

So, initially, my unbound talks to (I'll pick one out of 13) : 192.58.128.30 or j.root-servers.net.
As this is the closest to me.

Now, you're sending all your DNS request to an upstream resolver.
This upstream resolver doesn't exactly what unbound could be doing in the first place.
With four new possibilities :

Your upstream resolver decides what IP gets send back - it could be anything from the correct IP to a spoofed one. You will never know.
A resolver can have a safety net for spoofing (DNSSEC) - a forwarder can not.
single point of failure ! When 8.8.8.8 goes down (to name a known one) your network DNS goes out. This actually happened ..... just 48 hours ago.
you become a product.

Internet itself, works only with resolvers. Forwarders were useful in the past as our ISP could not give us expensive ISP routers with processors that could run local resolvers. So, every SOHO connection was forwarding.
Those who use pfSense do not have (small) SO HO connections. They, the admins, want the real thing.

IMHO : why does 8.8.8.8 1.1.1.1 etc etc exist today ?
Because root, tld and domain DNS servers are not reachable ? If that's the case, consider (a part of) Internet down? That would be world wide news.
So, no.
You know why they (still) exist.
It's a big money question for them - and yes, I know, their usage is free ;)

@pcol-it-admin said in Slow DNS after 22.05:

and boom

The boom was probably that you restarted unbound.
And you changed from resolver to forward mode.
I nice test would be : go back to resolver mode - this will restart unbound ones more.

Does it still work ? If so : you have now solid proof that "resolver" or "forward" mode wasn't the issue, so neither the solution.
Resolving doesn't work ? So root servers etc are not reachable ? Some one is doing MITM above your head ? For me, I would go in mayday mode if resolver OR forward mode doesn't work. Both should work out of the box, and if not, I have an urgent issue - or a very .doubtful ISP or whatever else is happening above my connection.

tentpiglet

To follow up on previous replies to my "me too" post saying I was typing in the name of the sites I was attempting to get to wrong, here's an example of what I encountered today when attempting to access my O365 email from a Chrome session on one of my test client using the pfSense DNS resolver (opposed to my pi-hole VM):

Clicking reload a couple of times resolved the problem and brought up the site.

here's the logfile from unbound when this was occurring.

Aug 10 15:51:09	unbound	69581	[69581:0] notice: Restart of unbound 1.15.0.
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.524288 1.000000 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.131072 0.262144 3
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.065536 0.131072 2
Aug 10 15:51:09	unbound	69581	[69581:0] info: lower(secs) upper(secs) recursions
Aug 10 15:51:09	unbound	69581	[69581:0] info: [25%]=0.114688 median[50%]=0.174763 [75%]=0.240299
Aug 10 15:51:09	unbound	69581	[69581:0] info: histogram of recursion processing times
Aug 10 15:51:09	unbound	69581	[69581:0] info: average recursion processing time 0.238208 sec
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 3: requestlist max 1 avg 0.7 exceeded 0 jostled 0
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 3: 11 queries, 1 answers from cache, 10 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.524288 1.000000 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.262144 0.524288 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.131072 0.262144 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.065536 0.131072 2
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.032768 0.065536 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.000000 0.000001 2
Aug 10 15:51:09	unbound	69581	[69581:0] info: lower(secs) upper(secs) recursions
Aug 10 15:51:09	unbound	69581	[69581:0] info: [25%]=1e-06 median[50%]=0.098304 [75%]=0.262144
Aug 10 15:51:09	unbound	69581	[69581:0] info: histogram of recursion processing times
Aug 10 15:51:09	unbound	69581	[69581:0] info: average recursion processing time 0.170908 sec
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 2: 10 queries, 2 answers from cache, 8 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.262144 0.524288 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.131072 0.262144 3
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.065536 0.131072 3
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.032768 0.065536 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.000000 0.000001 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: lower(secs) upper(secs) recursions
Aug 10 15:51:09	unbound	69581	[69581:0] info: [25%]=0.0709973 median[50%]=0.120149 [75%]=0.207531
Aug 10 15:51:09	unbound	69581	[69581:0] info: histogram of recursion processing times
Aug 10 15:51:09	unbound	69581	[69581:0] info: average recursion processing time 0.122444 sec
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 1: 11 queries, 2 answers from cache, 9 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.524288 1.000000 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.262144 0.524288 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.131072 0.262144 2
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.065536 0.131072 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.032768 0.065536 1
Aug 10 15:51:09	unbound	69581	[69581:0] info: 0.000000 0.000001 3
Aug 10 15:51:09	unbound	69581	[69581:0] info: lower(secs) upper(secs) recursions
Aug 10 15:51:09	unbound	69581	[69581:0] info: [25%]=7.5e-07 median[50%]=0.098304 [75%]=0.24576
Aug 10 15:51:09	unbound	69581	[69581:0] info: histogram of recursion processing times
Aug 10 15:51:09	unbound	69581	[69581:0] info: average recursion processing time 0.182405 sec
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 0: requestlist max 7 avg 0.777778 exceeded 0 jostled 0
Aug 10 15:51:09	unbound	69581	[69581:0] info: server stats for thread 0: 11 queries, 2 answers from cache, 9 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Aug 10 15:51:09	unbound	69581	[69581:0] info: service stopped (unbound 1.15.0).
Aug 10 15:50:28	unbound	69581	[69581:0] info: generate keytag query _ta-4f66. NULL IN
Aug 10 15:50:27	unbound	69581	[69581:0] info: start of service (unbound 1.15.0).
Aug 10 15:50:27	unbound	69581	[69581:0] notice: init module 1: iterator
Aug 10 15:50:27	unbound	69581	[69581:0] notice: init module 0: validator
Aug 10 15:50:27	unbound	69581	[69581:0] notice: Restart of unbound 1.15.0.

Of course, I cannot replicate this on a regular basis. It is random, and will happen to random sites. Usually clicking the 'reload' button in my browser will properly resolve the next time.

As with other people, from my clients, if I attempt to do nslookups I do get initial time-out errors virtually 100%:

C:\Users\tentpiglet>nslookup cnn.com
Server:  pfsense.tentpiglet.XXXXXXXXXXX.org
Address:  192.168.1.254

DNS request timed out.
    timeout was 2 seconds.
Non-authoritative answer:
Name:    cnn.com
Addresses:  2a04:4e42:600::323
          2a04:4e42:400::323
          2a04:4e42::323
          2a04:4e42:200::323
          151.101.193.67
          151.101.65.67
          151.101.129.67
          151.101.1.67

johnpoz

@tentpiglet that domain has multiple cnames that need to be followed

; QUESTION SECTION:
;login.microsoftonline.com.     IN      A

;; ANSWER SECTION:
login.microsoftonline.com. 30   IN      CNAME   ak.privatelink.msidentity.com.
ak.privatelink.msidentity.com. 30 IN    CNAME   www.tm.ak.prd.aadg.akadns.net.

If your having an issue with resolving - maybe because of flaky ipv6 then yeah such records would be more problematic than most.

edit: also if you setting strict qname you could have issues

PCOL IT Admin

@gertjan Just reporting what worked to resolve (no pun intended!) my issue (which was bad & disruptive, and only started after the 22.05 upgrade...) So I am going to try and re-enable it, but one thing I've noticed is that there's a lot of pushback from you and @johnpoz against anything being wrong with Unbound in 22.05... Can you at least accept that there is some issue going on (intermittently, which sucks for t-shooting) post 22.05 upgrade for some of us? Let's not blame the user just because "works on my machine"...

I have 22.05 now running on my 3100 (which was problematic), a 2100 under my admin (at my house of worship), and a 4100 I just deployed at work... (I also have three other Intel-based platforms running pfSense at work as well, would need to check the releases on those.) If we need data from these platforms to assist problem identification efforts, please let me know.

lohphat

What is the current tally that disabling IPv6 was implicated in resolving the issue?

PCOL IT Admin

This from the 2100 gateway that I just upgraded to 22.05-RELEASE over the weekend:

[22.05-RELEASE][admin@pcol-gw.pclawrenceville.lan]/root: unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total
total.num.queries=588600
total.num.queries_ip_ratelimited=0
total.num.cachehits=404414
total.num.cachemiss=184186
total.num.prefetch=0
total.num.expired=0
total.num.recursivereplies=184160
total.requestlist.avg=7.03968
total.requestlist.max=128
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=30
total.requestlist.current.user=7
total.recursion.time.avg=26.928807
total.recursion.time.median=0.0733518
total.tcpusage=0

total.recursion.time.avg looks bad to me... I did a test from a browser to a domain I never used from this location (www.sniffer.com) and it did lag for ~10 sec's before the page rendered.

tentpiglet

@johnpoz said in Slow DNS after 22.05:

edit: also if you setting strict qname you could have issues

As previously indicated, I have a fairly "stock", out-of-the-box DNS resolver setup. I think the only two items I checked in the setup pages were the DHCP Registration and Static DHCP options.

Much the same way I run a completely stock pi-hole setup as a VM which has zero issues.

johnpoz

@pcol-it-admin said in Slow DNS after 22.05:

against anything being wrong with Unbound in 22.05

Never said that - actually even pointed that there could be.. As mentioned in the other thread where do-ip6 was mentioned.. And what version unbound pfsense is on.

My point is there is lots of variables to take into account, there is no currently default bug in unbound that am aware of, if there was then the boards would be on fire, and can assure you if I was having issues with unbound on pfsense I would of reported it as a actual bug already with my exact findings and how to duplicate it. But currently since having zero issues with it - then it points something specific presenting with a specific configuration or even set of configurations.

If we can present a specific known issue that everyone with XYZ is presenting a specific issue, then we could push for unbound to be updated or rolledback in current version of pfsense, etc.

But have yet to been presented with specific setting or set of settings that cause an issue - if your having unbound try and use IPv6 and you have flaky IPv6 then yes that could be problematic.

I have currently changed my setting to allow access via my HE ipv6 tunnel, and lets see if that causes an issue.

I can see it doing queries via IPv6 to different NSers via just the resolver status page - so let that run for a while and see if I notice any issues with unbound..

PCOL IT Admin

@lohphat said in Slow DNS after 22.05:

Well, this failure mode is intermittent -- it hits then resolves, then his again later, rinse, repeat. It doesn't happen in DNS Forward mode. So I'm guessing whatever it is, is happening in the local cache.
The failure mode affects different devices on different internal networks which are of different architectures: Win11, iPad OS, Roku, Android.
The behavior started after the 22.05 update almost immediately. No other changes other than the base image were made.

I had the same experience... it was affecting the whole family here, on a variety of devices. Importantly, I had the stock DNS Resolver settings (whatever defaults pfSense has) until I started changing some settings (set DNS forwarding, turn off IPv6) to try and resolve the issue, as the fam was unhappy...

PCOL IT Admin

@johnpoz said in Slow DNS after 22.05:

Never said that - actually even pointed that there could be.. As mentioned in the other thread where do-ipv6 was mentioned.. And what version unbound pfsense is on.

Upon a review (a LOT of messages in this thread!) it seems that you just were asking for more precise detail, and did participate in trying to determine what might be the issue. My apologies.

However, there has been a lot of seeming finger-pointing by some other folks here at the users; it's surely possible that users setting options incorrectly may cause problems for themselves, but I think many of the folks reporting this issue had said that they had "stock" pfSense DNS resolver settings, as I know I did. The only thing I did was to upgrade pfSense to 22.05, and I went from being problem-free in DNS resolving (for years), to having a problem.

Jax

@pcol-it-admin said in Slow DNS after 22.05:

I think many of the folks reporting this issue had said that they had "stock" pfSense DNS resolver settings, as I know I did. The only thing I did was to upgrade pfSense to 22.05, and I went from being problem-free in DNS resolving (for years), to having a problem.

That was precisely my experience: commercial user, Netgate 2100 ARM device, no DNS mods on my part, 18 months trouble-free, suddenly broken on 22.05 upgrade.

BTW ... have we ruled out ARM-specificity for this problem?

johnpoz

@pcol-it-admin said in Slow DNS after 22.05:

said that they had "stock" pfSense DNS resolver settings

I find this is rarely the case to be honest.. You also don't know what device they are on - be it arm maybe part of the issue, or more pronounced problem with them?

And issue with "stock" settings if you will - could be dnssec and user changed to forwarding, if you forward that should really be off. Lots of users love to use TLS forwarding, along with leaving dnssec on - again problematic if you ask me. But they didn't change anything else - so to them its "stock"

Might be they have no ipv6 at all, or maybe their ipv6 is solid - and only user that think they don't even have it? maybe their client is not using it, maybe the pfsense wan has it, and its using it as transport for queries.

Then we just get users saying they get some error in their browsers - for all we know "stock" in most browsers these days is using doh, and not even using local dns.

They might have "stock" but are using pfblocker and loading huge lists of blocks, via dnsbl.. And they are also with "stock" registering dhcp.. And now unbound is restarting every 10 minutes or something and that can present as a problem with dns. That has been a "stock" problem for a while - but really isn't an issue per say with unbound at all.

Part of the problem with any sort of dns issues - is to be honest many users don't actually understand how it works. Or at least at a level that allows for detailed troubleshooting. So its difficult to get actual details of what is actually going on vs they say their browser gives some error they have to refresh the page.. They don't know how to troubleshoot it, etc.

Like I said have had zero issues with 22.05 and dns - same goes for others in this thread.. So what is different with our setups, or our connections.. Clearly its not a base problem or everyone would be having the issue.

I have had it running now for a bit able to do ipv6 transport queries - and haven't noticed any issues. And I don't show any timeouts in the resolver status page..

I was meaning to setup a 3100 I have here (arm) and run through some base dnsperf test in a loop to see if an issue would present itself to help pinpoint where the issue is.

Currently the most likely issue is the ipv6 transport, but this could have other factors that exacerbate the problem for some users.

I do not see any currently reported bugs for 22.05 and unbound in redmine related - there is something about fqdn having issues in aliases.

Another variable is vpn connection, users love to use vpns on pfsense and force all traffic through them. What is required to try and pin it down is have users willing to actually provide details of what they are seeing, is unbound restarting, are they using IPv6 - a few have chimed in as of late with good details. There where all local stuff was working, so it wasn't like unbound was crashing or hung up completely, etc.