DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times
-
@Gertjan awesome, thanks for clarification. It's been added. I'll read up on all these threads while I wait for it to fail...
-
Just to link that other thread, which two of us linked above, to this one:
https://forum.netgate.com/topic/188297/sporadic-dns-issues-cryptic-error-in-logs/ -
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
this one talks about not using UDP for DNS...?!
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=270824reading through these now, this one caught my eye because one of the only packages i have installed is UDP Broadcast Relay (in order to forward across the VLAN's I set up). Though i'll be honest, I really don't know much about UDP/TCP and not sure if this is the same ballpark as the Broadcast Relay. I can try the tcp-upstream: yes option after I feel confident the last iteration of changes didn't solve the issue.
-
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
https://www.reddit.com/r/opnsense/comments/1cinuyn/unbound_dns_issues_freezes_randomly/
This is another interesting theory, but I searched my log and I don't see anything referencing a tracker. Though I did just notice my enphase solar controller also just got a bunch of servfails...
-
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
https://community.ipfire.org/t/servfail-exceeded-the-maximum-number-of-sends/7645
just to dovetail all my thoughts on reading these links: this one seems very promising but doesn't have the total solution included. Though it also claims that DNSSEC was the culprit and I'm quite certain I was still experiencing the issue with DNSSEC disabled....
-
Just to stay on top of things: I've yet to experience an outage like i'm used to (DNS_PROBE_FINISHED_NXDOMAIN). I was out of the house so its atypical but not unheard of (the length of time that is). However last night I tried to open an ebay link and the app wasn't able to bring up the item. It was an odd error in the app, but the really odd thing was that I was able to load webpages in chrome. Just documenting my journey, i'll be sure to grab any NXDOMAIN issues i catch in the coming days...
-
@Gertjan just noticed an interesting thing: all 5 of my Nest Protects (Google Fire Alarms) are reporting not being connected to the internet (WiFi issue). It was last checked about 18 hours ago (a few hours after I made the changes, so maybe it checked a few times before it stopped retrying). The History showed that they have been connected and without error for as long as the History goes back so seems likely connected to these specific changes....But also no outages yet besides that weird ebay one that was not the same as the usual.
-
@RickyBaker if I'm not mistaken this is WAY more servfail's that i was previously experiencing:
-
@RickyBaker 3 instances of squeakydoor.nest.com servfail and 543 instances (probably a third as many fails) of time.google.com or some derivation of. I was not seeing this before the latest round of changes...
-
@RickyBaker The images above were lost due to the forum error...do you have forwarding enabled? This:
-
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
if I'm not mistaken this is WAY ...
Scrap all the lines that terminate with "localdomain"
a3ayogyffhzcp1.iot.us-east-1.amazonaws.com.localdomain
as that domain doesn't exist - so that's a fail for sure.
The DNS request, coming from a LAN device, was wrong.
It should have asked fora3ayogyffhzcp1.iot.us-east-1.amazonaws.com.
You saw the terminating dot ? That means that .com. is the TLD. Without the terminating dot unbound starts by adding it's own local domain first, which will fail.
But there are a bunch of IOT in place here, and these aren't known for their nice DNS requesting.
Coupled with the huge forest of DNS domain servers, domain name server, as we have to deal with "amazonaws" here .... (world's best organized DNS mess ever).
To make things even worse : the Time To Live (TTL) is set to 60 seconds. So a request A record has to be request again within 60 seconds.
Yeah, things can get messy quick. "Lets add another IOT" ^^Btw : this nicely looking host name actually resolves :
[24.03-RELEASE][root@pfSense.bhf.tld]/root: dig a3ayogyffhzcp1.iot.us-east-1.amazonaws.com +short 54.209.119.230 52.71.151.159 54.164.100.117 52.87.91.214 54.162.199.177 52.70.244.97 54.147.162.149 52.4.223.197
-
AFAIK : he doesn't use the Forwarding mode. He's resolving.
What have you set here :
?
And the main Resolver setting :
right ?
edit :
@RickyBaker Let's forget the GUI, you (we) are dealing with lots of info here, and the GUI is nice, but not fast enough.
Go console or SSH that's the access you need. (and this isn't an over statement).
Get in.
Console menu option 8.
Type
cd /var/log/
ls -al resol*.*and now you see the main reslver.log file, and the x rolled over log files.
tail -f resolver.log
and now your looking at the resolver log file, in real time.
This will 'stop' or 'hang' after moments, as the file '"resolver.log" already reached its maximum size, so it gets rotated by the 'syslog' daemon into (for example resolver.log will get rename din resolver.log.0 and resolver.log.0 will get renamed in resolver.log.1 etc etc. And a new empty resolver.log will get created ....but our tail command is tailing a file ... that doesn't exist anymore.
Hence the need of comfortable huge log files .... but keep in mind the disk space left, right ;)I don't see any "query:" or "reply:" lines ... ?!
<31>1 2024-05-22T18:16:54.835744+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: comm point start listening 17 (-1 msec) <31>1 2024-05-22T18:16:54.853298+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: answer cb <31>1 2024-05-22T18:16:54.853317+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: Incoming reply id = ee3f <31>1 2024-05-22T18:16:54.853335+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: Incoming reply addr = ip4 150.171.16.34 port 53 (len 16) <31>1 2024-05-22T18:16:54.853347+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: lookup size is 1 entries <31>1 2024-05-22T18:16:54.853359+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: received udp reply. <31>1 2024-05-22T18:16:54.853408+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: udp message[73:0] EE3F8430000100010000000103646E73084D5366546E43734903634F4D00001C0001C00C001C000100000E100010FD3E4F5A5B810000000000000000000100002904D0000000000000 <31>1 2024-05-22T18:16:54.853425+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: outnet handle udp reply <31>1 2024-05-22T18:16:54.853440+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: serviced query: EDNS works for ip4 150.171.16.34 port 53 (len 16) <31>1 2024-05-22T18:16:54.853455+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: measured roundtrip at 17 msec <31>1 2024-05-22T18:16:54.853483+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: svcd callbacks start <31>1 2024-05-22T18:16:54.853498+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: good 0x20-ID in reply qname <31>1 2024-05-22T18:16:54.853511+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: worker svcd callback for qstate 0x14da413cbb90 <31>1 2024-05-22T18:16:54.853525+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: start <31>1 2024-05-22T18:16:54.853538+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_reply <30>1 2024-05-22T18:16:54.853550+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: iterator operate: query dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853562+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: process_response: new external response event <30>1 2024-05-22T18:16:54.853586+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: scrub for msftncsi.com. NS IN <31>1 2024-05-22T18:16:54.853605+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: sanitize: removing public name with private address <dns.msftncsi.com.> fd3e:4f5a:5b81::1#53 <30>1 2024-05-22T18:16:54.853621+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: response for dns.msftncsi.com. AAAA IN <30>1 2024-05-22T18:16:54.853635+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: reply from <msftncsi.com.> 150.171.16.34#53 <30>1 2024-05-22T18:16:54.853693+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0 ;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: dns.msftncsi.com. IN AAAA ;; ANSWER SECTION: ;; AUTHORITY SECTION: ;; ADDITIONAL SECTION: ;; MSG SIZE rcvd: 34 <31>1 2024-05-22T18:16:54.853726+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iter_handle processing q with state QUERY RESPONSE STATE <30>1 2024-05-22T18:16:54.853741+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: query response was nodata ANSWER <31>1 2024-05-22T18:16:54.853766+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iter_handle processing q with state FINISHED RESPONSE STATE <30>1 2024-05-22T18:16:54.853782+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: finishing processing for dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853795+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: iterator module exit state is module_finished <31>1 2024-05-22T18:16:54.853807+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone <30>1 2024-05-22T18:16:54.853822+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: validator operate: query dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853833+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator: nextmodule returned <31>1 2024-05-22T18:16:54.853846+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: val handle processing q with state VAL_INIT_STATE <31>1 2024-05-22T18:16:54.853858+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator classification nodata <30>1 2024-05-22T18:16:54.853877+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: no signer, using dns.msftncsi.com. TYPE0 CLASS0 <31>1 2024-05-22T18:16:54.853893+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: val handle processing q with state VAL_FINISHED_STATE <31>1 2024-05-22T18:16:54.853911+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: validator module exit state is module_finished <31>1 2024-05-22T18:16:54.857015+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: python module exit state is module_finished <30>1 2024-05-22T18:16:54.857219+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: send_udp over interface: 192.168.1.1 <31>1 2024-05-22T18:16:54.857292+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: query took 0.045000 sec <31>1 2024-05-22T18:16:54.857314+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: pythonmod: clear, id: 0, pq:0x14da4662c458 <30>1 2024-05-22T18:16:54.857335+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 33 recursion replies sent, 0 replies dropped, 0 states jostled out <30>1 2024-05-22T18:16:54.857348+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: average recursion processing time 0.095481 sec <30>1 2024-05-22T18:16:54.857360+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: histogram of recursion processing times <30>1 2024-05-22T18:16:54.857381+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: [25%]=0.0216503 median[50%]=0.0431942 [75%]=0.073728 <30>1 2024-05-22T18:16:54.857393+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: lower(secs) upper(secs) recursions <30>1 2024-05-22T18:16:54.857406+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.000000 0.000001 2 <30>1 2024-05-22T18:16:54.857418+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.004096 0.008192 2 <30>1 2024-05-22T18:16:54.857430+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.008192 0.016384 2 <30>1 2024-05-22T18:16:54.857442+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.016384 0.032768 7 <30>1 2024-05-22T18:16:54.857454+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.032768 0.065536 11 <30>1 2024-05-22T18:16:54.857467+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.065536 0.131072 6 <30>1 2024-05-22T18:16:54.857483+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.131072 0.262144 1 <30>1 2024-05-22T18:16:54.857495+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.262144 0.524288 1 <30>1 2024-05-22T18:16:54.857507+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 1.000000 2.000000 1 <31>1 2024-05-22T18:16:54.857527+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: cache memory msg=311360 rrset=648279 infra=254810 val=86940 <31>1 2024-05-22T18:16:54.857539+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: svcd callbacks end <31>1 2024-05-22T18:16:54.857556+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: close of port 10803 <31>1 2024-05-22T18:16:54.857573+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: close fd 17
This is a 100 ms (0,1 sec) snippet of my resolver.log using "Level 4" : pure madness.
And I'm doing mode full DNSSEC.
I left it running for an hours or so .... -
@Gertjan @SteveITS I'm still catching up on your posts (thanks for taking the time), but i'm unfortunately here to report that it happened again:
I realize it's neither here nor there, but I keep going back to WHAT caused this to start happening? I made no major (or minor that i know of!) change when it started and it's so consistent. It's all consuming
-
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
The images above were lost due to the forum error
are they lost for everyone? I see them still on my side....
Can you see these?
I think the overall point is that i wasn't seeing time.google.com servfail before and I def wasn't seeing anywhere near 1,692 in this log which basically covers only 2.5 minutes
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
do you have forwarding enabled?
![d223ad5d-b542-4b84-87de-65553f018f3a-image.png]i do not
-
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
What have you set here :
set to localdomain, is this related to your previous message? Is there a way to fix the absence of that trailing dot so that it doesn't append localdomain? What might have caused that?@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
And the main Resolver setting :
Also set to Transparent -
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
are they lost for everyone? I see them still on my side....
Per a Netgate post there was a period where images were being uploaded to an incorrect location. When they fixed it, images uploaded during that period were lost. If you scroll up there are some empty images now, e.g.
https://forum.netgate.com/topic/187510/dns_probe_finished_nxdomain-sporadically-for-anywhere-from-30secs-to-10min-works-flawlessly-at-all-other-times/24
-> https://forum.netgate.com/assets/uploads/files/1713968978542-8a15c3c9-c8b8-4916-8326-e3a1cbbfba8a-image.pngYes we can see recent ones.
-
-
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
I don't see any "query:" or "reply:" lines ... ?!
I think i added an option to custom commands via a @johnpoz recommendation. Here is the entirety of my Custom options box;
server: log-queries: yes log-replies: yes log-tag-queryreply: yes log-servfail: yes ede: yes qname-minimisation: no aggressive-nsec: no infra-keep-probing: yes infra-cache-max-rtt: 2000 infra-host-ttl: 0 outbound-msg-retry: 32 max-sent-count: 128
I saw something in that thread you posted earlier (that I'm still combing through) about someone saying the solution was to use another resolver, is it time I just abandon unbound? Is that even possible inside pfsense? advisable? I need to update to 2.7.2 which i guess i'm now going to have to super prioritize but it's just so hard to take the leap
-
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
and now your looking at the resolver log file, in real time.
I'm doing this, but not sure what to be looking for. it's the same as the gui but faster moving. Lots of servfail and ".localdomain" and "exceeded maximum sends"
-
I'm losing my mind:
My wife is going to kill me, she can't use her computer. She asked if we should call our ISP shudder but i don't think it's an issue with the ISP, right? Once again I go back to how sudden this problem came about without changes and how prevalent it has become....
Is there any indication the pfsense hardware is failing? Anything?
-
My MyQ wifi garage door opener is offline now too, i'm rolling back those changes i guess...