DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times
-
Finally got one!!!
I pasted everything in the log back a few minutes here in case the totality of it is usefulhttps://pastebin.com/w2SGh8P0
@johnpoz Sorry for the delay in getting this I swear i was trying the whole time. thanks for the patience.
-
-
-
@johnpoz got another one! though it does seem to be happening with a lot less frequency for some reason, i've just gotten better at catching them during the quick window of opportunity:
-
This one got a NXDOMAIN error:
-
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
This one got a NXDOMAIN error:
That is a URL not a hostname so it should fail. Remove the /apps/staff (as shown in the prior post).
Searching for "exceeded the maximum number of sends" looks like DNSSEC...:
https://community.ipfire.org/t/servfail-exceeded-the-maximum-number-of-sends/7645
https://www.reddit.com/r/pihole/comments/11hqrco/intermittent_servfail_when_using_unbound/this one talks about not using UDP for DNS...?!
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270824This one talks about torrenting and DNSSEC:
https://www.reddit.com/r/opnsense/comments/1cinuyn/unbound_dns_issues_freezes_randomly/ -
About :
This very issue (or whatever it is) has its own thread on NLnetLabs (the author of unbound) exceeded the maximum nameserver nxdomains.
One of the authors of unbound is also answering.
Some tips are present.Btw : this is DNS at its finest. I'll take this one home tonight, need to read it again.
Latest posts in that thread are just hours ago.
Here you go :
server: qname-minimisation: no aggressive-nsec: no infra-keep-probing: yes infra-cache-max-rtt: 2000 infra-host-ttl: 0 outbound-msg-retry: 32 max-sent-count: 128
dono what the impact will be ....
I've never seen this "exceeded the maximum nameserver nxdomains" message myself. -
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
Here you go :
is the suggestion to throw this into the custom options section of the dns resolver? I'll check out all the links, was just looking to confirm the suggestion you had forwarded on...
-
Exact.
Like this :
-
@Gertjan awesome, thanks for clarification. It's been added. I'll read up on all these threads while I wait for it to fail...
-
Just to link that other thread, which two of us linked above, to this one:
https://forum.netgate.com/topic/188297/sporadic-dns-issues-cryptic-error-in-logs/ -
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
this one talks about not using UDP for DNS...?!
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270824reading through these now, this one caught my eye because one of the only packages i have installed is UDP Broadcast Relay (in order to forward across the VLAN's I set up). Though i'll be honest, I really don't know much about UDP/TCP and not sure if this is the same ballpark as the Broadcast Relay. I can try the tcp-upstream: yes option after I feel confident the last iteration of changes didn't solve the issue.
-
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
https://www.reddit.com/r/opnsense/comments/1cinuyn/unbound_dns_issues_freezes_randomly/
This is another interesting theory, but I searched my log and I don't see anything referencing a tracker. Though I did just notice my enphase solar controller also just got a bunch of servfails...
-
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
https://community.ipfire.org/t/servfail-exceeded-the-maximum-number-of-sends/7645
just to dovetail all my thoughts on reading these links: this one seems very promising but doesn't have the total solution included. Though it also claims that DNSSEC was the culprit and I'm quite certain I was still experiencing the issue with DNSSEC disabled....
-
Just to stay on top of things: I've yet to experience an outage like i'm used to (DNS_PROBE_FINISHED_NXDOMAIN). I was out of the house so its atypical but not unheard of (the length of time that is). However last night I tried to open an ebay link and the app wasn't able to bring up the item. It was an odd error in the app, but the really odd thing was that I was able to load webpages in chrome. Just documenting my journey, i'll be sure to grab any NXDOMAIN issues i catch in the coming days...
-
@Gertjan just noticed an interesting thing: all 5 of my Nest Protects (Google Fire Alarms) are reporting not being connected to the internet (WiFi issue). It was last checked about 18 hours ago (a few hours after I made the changes, so maybe it checked a few times before it stopped retrying). The History showed that they have been connected and without error for as long as the History goes back so seems likely connected to these specific changes....But also no outages yet besides that weird ebay one that was not the same as the usual.
-
@RickyBaker if I'm not mistaken this is WAY more servfail's that i was previously experiencing:
-
@RickyBaker 3 instances of squeakydoor.nest.com servfail and 543 instances (probably a third as many fails) of time.google.com or some derivation of. I was not seeing this before the latest round of changes...
-
@RickyBaker The images above were lost due to the forum error...do you have forwarding enabled? This:
-
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
if I'm not mistaken this is WAY ...
Scrap all the lines that terminate with "localdomain"
a3ayogyffhzcp1.iot.us-east-1.amazonaws.com.localdomain
as that domain doesn't exist - so that's a fail for sure.
The DNS request, coming from a LAN device, was wrong.
It should have asked fora3ayogyffhzcp1.iot.us-east-1.amazonaws.com.
You saw the terminating dot ? That means that .com. is the TLD. Without the terminating dot unbound starts by adding it's own local domain first, which will fail.
But there are a bunch of IOT in place here, and these aren't known for their nice DNS requesting.
Coupled with the huge forest of DNS domain servers, domain name server, as we have to deal with "amazonaws" here .... (world's best organized DNS mess ever).
To make things even worse : the Time To Live (TTL) is set to 60 seconds. So a request A record has to be request again within 60 seconds.
Yeah, things can get messy quick. "Lets add another IOT" ^^Btw : this nicely looking host name actually resolves :
[24.03-RELEASE][root@pfSense.bhf.tld]/root: dig a3ayogyffhzcp1.iot.us-east-1.amazonaws.com +short 54.209.119.230 52.71.151.159 54.164.100.117 52.87.91.214 54.162.199.177 52.70.244.97 54.147.162.149 52.4.223.197
-
AFAIK : he doesn't use the Forwarding mode. He's resolving.
What have you set here :
?
And the main Resolver setting :
right ?
edit :
@RickyBaker Let's forget the GUI, you (we) are dealing with lots of info here, and the GUI is nice, but not fast enough.
Go console or SSH that's the access you need. (and this isn't an over statement).
Get in.
Console menu option 8.
Type
cd /var/log/
ls -al resol*.*and now you see the main reslver.log file, and the x rolled over log files.
tail -f resolver.log
and now your looking at the resolver log file, in real time.
This will 'stop' or 'hang' after moments, as the file '"resolver.log" already reached its maximum size, so it gets rotated by the 'syslog' daemon into (for example resolver.log will get rename din resolver.log.0 and resolver.log.0 will get renamed in resolver.log.1 etc etc. And a new empty resolver.log will get created ....but our tail command is tailing a file ... that doesn't exist anymore.
Hence the need of comfortable huge log files .... but keep in mind the disk space left, right ;)I don't see any "query:" or "reply:" lines ... ?!
<31>1 2024-05-22T18:16:54.835744+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: comm point start listening 17 (-1 msec) <31>1 2024-05-22T18:16:54.853298+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: answer cb <31>1 2024-05-22T18:16:54.853317+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: Incoming reply id = ee3f <31>1 2024-05-22T18:16:54.853335+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: Incoming reply addr = ip4 150.171.16.34 port 53 (len 16) <31>1 2024-05-22T18:16:54.853347+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: lookup size is 1 entries <31>1 2024-05-22T18:16:54.853359+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: received udp reply. <31>1 2024-05-22T18:16:54.853408+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: udp message[73:0] EE3F8430000100010000000103646E73084D5366546E43734903634F4D00001C0001C00C001C000100000E100010FD3E4F5A5B810000000000000000000100002904D0000000000000 <31>1 2024-05-22T18:16:54.853425+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: outnet handle udp reply <31>1 2024-05-22T18:16:54.853440+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: serviced query: EDNS works for ip4 150.171.16.34 port 53 (len 16) <31>1 2024-05-22T18:16:54.853455+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: measured roundtrip at 17 msec <31>1 2024-05-22T18:16:54.853483+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: svcd callbacks start <31>1 2024-05-22T18:16:54.853498+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: good 0x20-ID in reply qname <31>1 2024-05-22T18:16:54.853511+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: worker svcd callback for qstate 0x14da413cbb90 <31>1 2024-05-22T18:16:54.853525+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: start <31>1 2024-05-22T18:16:54.853538+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_reply <30>1 2024-05-22T18:16:54.853550+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: iterator operate: query dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853562+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: process_response: new external response event <30>1 2024-05-22T18:16:54.853586+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: scrub for msftncsi.com. NS IN <31>1 2024-05-22T18:16:54.853605+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: sanitize: removing public name with private address <dns.msftncsi.com.> fd3e:4f5a:5b81::1#53 <30>1 2024-05-22T18:16:54.853621+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: response for dns.msftncsi.com. AAAA IN <30>1 2024-05-22T18:16:54.853635+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: reply from <msftncsi.com.> 150.171.16.34#53 <30>1 2024-05-22T18:16:54.853693+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0 ;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: dns.msftncsi.com. IN AAAA ;; ANSWER SECTION: ;; AUTHORITY SECTION: ;; ADDITIONAL SECTION: ;; MSG SIZE rcvd: 34 <31>1 2024-05-22T18:16:54.853726+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iter_handle processing q with state QUERY RESPONSE STATE <30>1 2024-05-22T18:16:54.853741+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: query response was nodata ANSWER <31>1 2024-05-22T18:16:54.853766+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iter_handle processing q with state FINISHED RESPONSE STATE <30>1 2024-05-22T18:16:54.853782+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: finishing processing for dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853795+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: iterator module exit state is module_finished <31>1 2024-05-22T18:16:54.853807+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone <30>1 2024-05-22T18:16:54.853822+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: validator operate: query dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853833+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator: nextmodule returned <31>1 2024-05-22T18:16:54.853846+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: val handle processing q with state VAL_INIT_STATE <31>1 2024-05-22T18:16:54.853858+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator classification nodata <30>1 2024-05-22T18:16:54.853877+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: no signer, using dns.msftncsi.com. TYPE0 CLASS0 <31>1 2024-05-22T18:16:54.853893+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: val handle processing q with state VAL_FINISHED_STATE <31>1 2024-05-22T18:16:54.853911+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: validator module exit state is module_finished <31>1 2024-05-22T18:16:54.857015+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: python module exit state is module_finished <30>1 2024-05-22T18:16:54.857219+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: send_udp over interface: 192.168.1.1 <31>1 2024-05-22T18:16:54.857292+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: query took 0.045000 sec <31>1 2024-05-22T18:16:54.857314+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: pythonmod: clear, id: 0, pq:0x14da4662c458 <30>1 2024-05-22T18:16:54.857335+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 33 recursion replies sent, 0 replies dropped, 0 states jostled out <30>1 2024-05-22T18:16:54.857348+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: average recursion processing time 0.095481 sec <30>1 2024-05-22T18:16:54.857360+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: histogram of recursion processing times <30>1 2024-05-22T18:16:54.857381+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: [25%]=0.0216503 median[50%]=0.0431942 [75%]=0.073728 <30>1 2024-05-22T18:16:54.857393+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: lower(secs) upper(secs) recursions <30>1 2024-05-22T18:16:54.857406+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.000000 0.000001 2 <30>1 2024-05-22T18:16:54.857418+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.004096 0.008192 2 <30>1 2024-05-22T18:16:54.857430+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.008192 0.016384 2 <30>1 2024-05-22T18:16:54.857442+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.016384 0.032768 7 <30>1 2024-05-22T18:16:54.857454+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.032768 0.065536 11 <30>1 2024-05-22T18:16:54.857467+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.065536 0.131072 6 <30>1 2024-05-22T18:16:54.857483+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.131072 0.262144 1 <30>1 2024-05-22T18:16:54.857495+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.262144 0.524288 1 <30>1 2024-05-22T18:16:54.857507+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 1.000000 2.000000 1 <31>1 2024-05-22T18:16:54.857527+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: cache memory msg=311360 rrset=648279 infra=254810 val=86940 <31>1 2024-05-22T18:16:54.857539+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: svcd callbacks end <31>1 2024-05-22T18:16:54.857556+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: close of port 10803 <31>1 2024-05-22T18:16:54.857573+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: close fd 17
This is a 100 ms (0,1 sec) snippet of my resolver.log using "Level 4" : pure madness.
And I'm doing mode full DNSSEC.
I left it running for an hours or so .... -
@Gertjan @SteveITS I'm still catching up on your posts (thanks for taking the time), but i'm unfortunately here to report that it happened again:
I realize it's neither here nor there, but I keep going back to WHAT caused this to start happening? I made no major (or minor that i know of!) change when it started and it's so consistent. It's all consuming