Slow DNS after 22.05
-
@johnpoz said in Slow DNS after 22.05:
And what specific were you looking for - NX is not a dns failure, that is what you were looking for doesn't exist.
Randomly, any site I would go to. For example, the last time it happened when I said "enough is enough" and spun up a pihole container on my Proxmox server, it was "www.facebook.com".
I'm pretty sure www.facebook.com exists.
My wife was also reporting random disconnects from an online game she was playing, as well as getting similar site not found errors on her phone while reading reddit. Since moving DNS over to the pihole container, she's experienced no further issues.
-
@tentpiglet said in Slow DNS after 22.05:
I'm pretty sure www.facebook.com exists.
A NX is a dns response that says that domain doesn't exist - why you got that error have no idea, but that is not a failure to talk to dns, that is whatever your dns was doing ended up with NX.. Maybe you would doing a query for www.facebook.com.org or www.facebook.com.somethingelse.tld
Your browser trying to help you auto complete something etc..
I would expect you have all kinds of issues if your local dns is restarting 64 times in less than 24 hours.. So yeah that would be problematic for sure..
-
@johnpoz Like I described earlier I experienced this myself on duckduckgo.com when set as my primary search engine in the browser. Sometimes it would work othertimes not. Reloading a number of times would often work, but that could of course be an illusion if the cause was that some process was restarting.
What would be the preferred test to demonstrate the problem, given its intermittent nature?
-
@johnpoz @provels @Gertjan , maybe this is not a widespread problem.... but this seems to be an issue thats more than a single user/unit problem.
Talking albout the so said thousends of other users that are not experiencing this problem is not helping in resolving this clear issue in version 22.05 as is clearly stated that even with a clean install i can clearly replicate the issue when upgrading to 22.05 from 22.01. The issue immideatly presents itself. The clients obviously can't resolve the domains hense the client(-s) getting the errors @tentpiglet was describing.@Kempain @tentpiglet ; have you tried my suggestion for sesolving the issue (although maybe a temporary resolution), as stated i im running 1,5 week without problems at this moment
-
btw, i'm not running in a virtualized environment.
Running bare metal on the "same" mircoserver hardware as the SG-7100. Supermicro board with Atom C3758. -
and further; as i am not having the problem any more, i was just replying to help others out.... but.... my solution is not to my satisfaction as this should not be the resolution but more a workaround for a problem on which we do not have a clear cause (yet).
-
@kvhs said in Slow DNS after 22.05:
preferred test to demonstrate the problem, given its intermittent nature?
How about making sure unbound isn't restarting every 300 seconds for starters.. This is only going cause trouble trying to actually find the issue.
Simple stats output could very enlightening to what might be going on as far as problems.
[22.05-RELEASE][admin@sg4860.local.lan]/root: unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total. total.num.queries=66013 total.num.queries_ip_ratelimited=0 total.num.cachehits=52460 total.num.cachemiss=13553 total.num.prefetch=27990 total.num.expired=24624 total.num.recursivereplies=13553 total.requestlist.avg=0.319452 total.requestlist.max=30 total.requestlist.overwritten=0 total.requestlist.exceeded=0 total.requestlist.current.all=0 total.requestlist.current.user=0 total.recursion.time.avg=0.086462 total.recursion.time.median=0.0408701 total.tcpusage=0 [22.05-RELEASE][admin@sg4860.local.lan]/root:
You can see average recursion time, median etc..
There are lots of things that might present themselves just looking the stats..
But your saying something is causing NX in your browser - ok do that specific query.. www.facebook.com shouldn't come back NX.. but why did your browser say that? Did you actually look for www.facebook.com or was it something else..
Lets see a dig +trace so we could see if your having a connection issue to something in the resolve path, but again a connection issue wouldn't cause a NX.. A NX is a specific response to what you asked for and some NS saying sorry that does not exist - be it root for the .tld, be a gltd server for the domain, or the authoritative NS for the domain telling you that record does not exist, etc.
I don't buy you were told www.facebook.com was NX.. maybe it was www.facbook.c0m or some other typo, etc. If www.facebook.com came back as NX, lets see the query showing that.. etc..
Its hard to get to the bottom of what is going on when users just say me too, or having a dns problem since went to 22.05 with zero information on what they are doing or trying to do or what the specific failure actually is - like I said for all we know their browser is using doh, or maybe they are trying to route through a vpn, and that vpn is going down, or maybe something they are trying to look up is blocking their vpn connection, etc. etc..
There are loads of things that could be going on..
The only thing I can say for sure - is I have seen zero issues with dns going from 22.01 to 22.05 - zero!! So if someone is having an issue, we need to info to figure it out - it sure is not something specific wrong in unbound that is generic in nature, or then everyone would be seeing the issue and the board would be a flame with posts complaining that dns broke on 22.05.. When clearly that is not the case.
-
@johnpoz said in Slow DNS after 22.05:
Maybe you would doing a query for www.facebook.com.org or www.facebook.com.somethingelse.tld
Yeah... no, it wasn't that.
-
@mikymike82 can you reiterate what your solution was, I've scrolled back but there's a lot of fluff in here so I can't seem to find it.
-
@johnpoz For answering your question (from my experience), its not just "facebook.com", its everything, from apps to normal websites, random not resolving websites and apps not working. So not a specific client, website, browser etc..., mobile, desktop, laptops, narrowcasting etc.. everything thats trying to resolve an adress.
Again my "solution" seems to resolve the problem at hand... but not "normal" behaviour in my opinion. -
@tentpiglet well where did it fail, www.facebook.com is a cname
;; ANSWER SECTION: www.facebook.com. 30 IN CNAME star-mini.c10r.facebook.com. star-mini.c10r.facebook.com. 30 IN A 157.240.18.35
With a 30 second TTL, etc. where did you go ask after, are you doing qname forced strict? A NX is a specific response from a NS.. Its not a timeout or a servfail - its a specific response saying hey what your asking for doesn't exist..
even a typo of 4 wwww returns an answer not a NX
;; QUESTION SECTION: ;wwww.facebook.com. IN A ;; ANSWER SECTION: wwww.facebook.com. 3600 IN CNAME star.facebook.com. star.facebook.com. 3600 IN CNAME star.c10r.facebook.com. star.c10r.facebook.com. 3600 IN A 157.240.18.15
If I ask for some gibberish, the I get back NX, from the AUTHORITATIVE SOA for that domain..
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 33101 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;lsjfdsldjf.facebook.com. IN A ;; AUTHORITY SECTION: facebook.com. 3600 IN SOA a.ns.facebook.com. dns.facebook.com. 220198737 14400 1800 604800 300
So to troubleshoot a NX you are getting we need to know the specifics.. Its not a opps unbound is not running currently, or I trying to ask 1.2.3.4 but they are not answering..
-
total.num.queries=77
total.num.queries_ip_ratelimited=0
total.num.cachehits=22
total.num.cachemiss=55
total.num.prefetch=0
total.num.expired=0
total.num.recursivereplies=55
total.requestlist.avg=0.272727
total.requestlist.max=4
total.requestlist.overwritten=0
total.requestlist.exceeded=0
total.requestlist.current.all=0
total.requestlist.current.user=0
total.recursion.time.avg=0.127636
total.recursion.time.median=0.0505173
total.tcpusage=0What do you make of this. Unbound has restarted again...
-
@cool_corona said in Slow DNS after 22.05:
Unbound has restarted again...
yup - its going to be horrible as a caching resolver if it restarts ever few seconds..
-
@mikymike82 said in Slow DNS after 22.05:
@Kempain @tentpiglet ; have you tried my suggestion for sesolving the issue (although maybe a temporary resolution), as stated i im running 1,5 week without problems at this moment
I'm actually not keen on forwarding my DNS unnecessarily but I can understand how it would resolve the issue since it would then bypass unbound.
I was planning on doing a fresh install until I heard you're still experiencing issues after doing that.
Still might to rule it out for me too though.I've been monitoring unbound and it's been going strong for 13371 seconds now!
Memory usage of unbound doesn't seem to be particularly high although it is increasing slowly which is probably to be expected.Just set logging to level 5 to try and capture more info although not sure the differences between 4/5 will help as it seems to be related identifying which client is having an issue and I know I'm experiencing it across device types.
This has meant that unbound has been restarted so will see how it goes... -
unbound-control -c /var/unbound/unbound.conf stats_noreset | grep total total.num.queries=159 total.num.queries_ip_ratelimited=0 total.num.cachehits=61 total.num.cachemiss=98 total.num.prefetch=0 total.num.expired=0 total.num.recursivereplies=98 total.requestlist.avg=1.11224 total.requestlist.max=31 total.requestlist.overwritten=0 total.requestlist.exceeded=0 total.requestlist.current.all=0 total.requestlist.current.user=0 total.recursion.time.avg=1.462659 total.recursion.time.median=1.03385 total.tcpusage=0
Unbound has only just restarted so take those with a pinch of salt.
Interesting @tentpiglet mentioned 'DNS_PROBE_FINISHED_NXDOMAIN' errors in the browser because I've also been experiencing those at the same time as having DNS issues so I believe they are in some way related.
-
@kempain Same here, but in my production environment i can only troubleshoot so much.... im very curious if you can find anything else, rather then my workaround.
-
Mine is at home fortunately not in a corporate environment although I do have the wife's complaints to contend with
I have the image on USB ready to go with my backup in conf so I should be able to get back up and running pretty quickly once I finally decide to bite the bullet and do a re-install.
Just a bit wary of blowing out my settings because I'm using HAProxy and bunch of certs for internal services.
Relatively new to pfSense so don't want to F it up and spend all night fixing it. -
Could it be an issue with cache?
-
Doing more nslookups from client during the issue I noticed a few things that seem pretty consistent.
My initial request/s to pfSense seem to timeout despite my client knowing the IP of pfSense.
If I keep placing requests, eventually I get a response, and usually only to IPv6 first.
Then in the next response both IPv6 and IPv4 after more timeouts.After it does fully resolve, subsequent requests seem ok for a while.
It seems like it takes a few tries to resolve some un-cached addresses sometimes.Unbound is not restarting at this time as I can see it's been running for a while now.
nslookup youtu.be Server: pfsense.localdomain Address: 10.x.x.x DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. *** Request to pfsense.localdomain timed-out nslookup youtu.be Server: pfsense.localdomain Address: 10.x.x.x DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. Name: youtu.be Address: 2a00:1450:4009:81e::200e nslookup youtu.be Server: pfsense.localdomain Address: 10.x.x.x DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. DNS request timed out. timeout was 2 seconds. Name: youtu.be Address: 2a00:1450:4009:81e::200e 142.250.180.14
version: 1.15.0 verbosity: 5 threads: 4 modules: 2 [ validator iterator ] uptime: 19646 seconds options: control(ssl) unbound (pid 96286) is running...
total.num.queries=10304 total.num.queries_ip_ratelimited=0 total.num.cachehits=2806 total.num.cachemiss=7498 total.num.prefetch=0 total.num.expired=0 total.num.recursivereplies=7497 total.requestlist.avg=3.25433 total.requestlist.max=39 total.requestlist.overwritten=0 total.requestlist.exceeded=0 total.requestlist.current.all=3 total.requestlist.current.user=1 total.recursion.time.avg=9.417250 total.recursion.time.median=0.423254 total.tcpusage=0
-
IMHO, this says to me :
nslookup tries to contact a fist DNS server, after the time out, it decided it can't.
A next DNS server is tried. It can't neither.
A third one is tried (pfsense ?) and this time there is an answer.What do you have here ( Dashboard System information ) :
?
You have unbound running with maximum log details ?
Ok to debug, but think about putting that back to default as soon as possible.
Max logd details will overflow the (small) max log file size, so it will get rotated often == even more system and disk resources used.Run on the command line
grep 'start' /var/log/resolver.log
and try the settings I showed above, under Services > DNS Resolver > General Settings, remove the check from :
DHCP Registration
OpenVPN ClientsThese two should be unchecked if you use pfBlockerng-devel anyway.
Wait a day or so and run the command again.
unbound will also restart on interface events, like a WAN that changes his IP. Or some other interface goes down and up. These events can be seen in the main system log.
@tentpiglet said in Slow DNS after 22.05:
My wife was also reporting random disconnects from an online game she was playing,
This might be a red flag.
Game playing involves no DNS interfaction.
Its here PC/device against the male server. If this connection gets interrupted, then the issue is : you have a bad connection.
It could be local, like : the wifi is plain bad. Easy to test : that issue goes away as soon as you roll out a cable.
Or worse, your ISP uplink isn't as good as you think it is.
A bad uplink would also explain unreachable remote DNS servers.