DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times
-
@Gertjan just noticed an interesting thing: all 5 of my Nest Protects (Google Fire Alarms) are reporting not being connected to the internet (WiFi issue). It was last checked about 18 hours ago (a few hours after I made the changes, so maybe it checked a few times before it stopped retrying). The History showed that they have been connected and without error for as long as the History goes back so seems likely connected to these specific changes....But also no outages yet besides that weird ebay one that was not the same as the usual.
-
@RickyBaker if I'm not mistaken this is WAY more servfail's that i was previously experiencing:
-
@RickyBaker 3 instances of squeakydoor.nest.com servfail and 543 instances (probably a third as many fails) of time.google.com or some derivation of. I was not seeing this before the latest round of changes...
-
@RickyBaker The images above were lost due to the forum error...do you have forwarding enabled? This:
-
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
if I'm not mistaken this is WAY ...
Scrap all the lines that terminate with "localdomain"
a3ayogyffhzcp1.iot.us-east-1.amazonaws.com.localdomain
as that domain doesn't exist - so that's a fail for sure.
The DNS request, coming from a LAN device, was wrong.
It should have asked fora3ayogyffhzcp1.iot.us-east-1.amazonaws.com.
You saw the terminating dot ? That means that .com. is the TLD. Without the terminating dot unbound starts by adding it's own local domain first, which will fail.
But there are a bunch of IOT in place here, and these aren't known for their nice DNS requesting.
Coupled with the huge forest of DNS domain servers, domain name server, as we have to deal with "amazonaws" here .... (world's best organized DNS mess ever).
To make things even worse : the Time To Live (TTL) is set to 60 seconds. So a request A record has to be request again within 60 seconds.
Yeah, things can get messy quick. "Lets add another IOT" ^^Btw : this nicely looking host name actually resolves :
[24.03-RELEASE][root@pfSense.bhf.tld]/root: dig a3ayogyffhzcp1.iot.us-east-1.amazonaws.com +short 54.209.119.230 52.71.151.159 54.164.100.117 52.87.91.214 54.162.199.177 52.70.244.97 54.147.162.149 52.4.223.197
-
AFAIK : he doesn't use the Forwarding mode. He's resolving.
What have you set here :
?
And the main Resolver setting :
right ?
edit :
@RickyBaker Let's forget the GUI, you (we) are dealing with lots of info here, and the GUI is nice, but not fast enough.
Go console or SSH that's the access you need. (and this isn't an over statement).
Get in.
Console menu option 8.
Type
cd /var/log/
ls -al resol*.*and now you see the main reslver.log file, and the x rolled over log files.
tail -f resolver.log
and now your looking at the resolver log file, in real time.
This will 'stop' or 'hang' after moments, as the file '"resolver.log" already reached its maximum size, so it gets rotated by the 'syslog' daemon into (for example resolver.log will get rename din resolver.log.0 and resolver.log.0 will get renamed in resolver.log.1 etc etc. And a new empty resolver.log will get created ....but our tail command is tailing a file ... that doesn't exist anymore.
Hence the need of comfortable huge log files .... but keep in mind the disk space left, right ;)I don't see any "query:" or "reply:" lines ... ?!
<31>1 2024-05-22T18:16:54.835744+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: comm point start listening 17 (-1 msec) <31>1 2024-05-22T18:16:54.853298+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: answer cb <31>1 2024-05-22T18:16:54.853317+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: Incoming reply id = ee3f <31>1 2024-05-22T18:16:54.853335+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: Incoming reply addr = ip4 150.171.16.34 port 53 (len 16) <31>1 2024-05-22T18:16:54.853347+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: lookup size is 1 entries <31>1 2024-05-22T18:16:54.853359+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: received udp reply. <31>1 2024-05-22T18:16:54.853408+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: udp message[73:0] EE3F8430000100010000000103646E73084D5366546E43734903634F4D00001C0001C00C001C000100000E100010FD3E4F5A5B810000000000000000000100002904D0000000000000 <31>1 2024-05-22T18:16:54.853425+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: outnet handle udp reply <31>1 2024-05-22T18:16:54.853440+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: serviced query: EDNS works for ip4 150.171.16.34 port 53 (len 16) <31>1 2024-05-22T18:16:54.853455+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: measured roundtrip at 17 msec <31>1 2024-05-22T18:16:54.853483+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: svcd callbacks start <31>1 2024-05-22T18:16:54.853498+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: good 0x20-ID in reply qname <31>1 2024-05-22T18:16:54.853511+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: worker svcd callback for qstate 0x14da413cbb90 <31>1 2024-05-22T18:16:54.853525+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: start <31>1 2024-05-22T18:16:54.853538+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_reply <30>1 2024-05-22T18:16:54.853550+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: iterator operate: query dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853562+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: process_response: new external response event <30>1 2024-05-22T18:16:54.853586+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: scrub for msftncsi.com. NS IN <31>1 2024-05-22T18:16:54.853605+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: sanitize: removing public name with private address <dns.msftncsi.com.> fd3e:4f5a:5b81::1#53 <30>1 2024-05-22T18:16:54.853621+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: response for dns.msftncsi.com. AAAA IN <30>1 2024-05-22T18:16:54.853635+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: reply from <msftncsi.com.> 150.171.16.34#53 <30>1 2024-05-22T18:16:54.853693+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0 ;; flags: qr aa ; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: dns.msftncsi.com. IN AAAA ;; ANSWER SECTION: ;; AUTHORITY SECTION: ;; ADDITIONAL SECTION: ;; MSG SIZE rcvd: 34 <31>1 2024-05-22T18:16:54.853726+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iter_handle processing q with state QUERY RESPONSE STATE <30>1 2024-05-22T18:16:54.853741+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: query response was nodata ANSWER <31>1 2024-05-22T18:16:54.853766+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: iter_handle processing q with state FINISHED RESPONSE STATE <30>1 2024-05-22T18:16:54.853782+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: finishing processing for dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853795+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: iterator module exit state is module_finished <31>1 2024-05-22T18:16:54.853807+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone <30>1 2024-05-22T18:16:54.853822+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: validator operate: query dns.msftncsi.com. AAAA IN <31>1 2024-05-22T18:16:54.853833+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator: nextmodule returned <31>1 2024-05-22T18:16:54.853846+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: val handle processing q with state VAL_INIT_STATE <31>1 2024-05-22T18:16:54.853858+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: validator classification nodata <30>1 2024-05-22T18:16:54.853877+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: no signer, using dns.msftncsi.com. TYPE0 CLASS0 <31>1 2024-05-22T18:16:54.853893+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: val handle processing q with state VAL_FINISHED_STATE <31>1 2024-05-22T18:16:54.853911+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: validator module exit state is module_finished <31>1 2024-05-22T18:16:54.857015+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: mesh_run: python module exit state is module_finished <30>1 2024-05-22T18:16:54.857219+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: send_udp over interface: 192.168.1.1 <31>1 2024-05-22T18:16:54.857292+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: query took 0.045000 sec <31>1 2024-05-22T18:16:54.857314+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: pythonmod: clear, id: 0, pq:0x14da4662c458 <30>1 2024-05-22T18:16:54.857335+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 33 recursion replies sent, 0 replies dropped, 0 states jostled out <30>1 2024-05-22T18:16:54.857348+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: average recursion processing time 0.095481 sec <30>1 2024-05-22T18:16:54.857360+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: histogram of recursion processing times <30>1 2024-05-22T18:16:54.857381+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: [25%]=0.0216503 median[50%]=0.0431942 [75%]=0.073728 <30>1 2024-05-22T18:16:54.857393+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: lower(secs) upper(secs) recursions <30>1 2024-05-22T18:16:54.857406+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.000000 0.000001 2 <30>1 2024-05-22T18:16:54.857418+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.004096 0.008192 2 <30>1 2024-05-22T18:16:54.857430+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.008192 0.016384 2 <30>1 2024-05-22T18:16:54.857442+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.016384 0.032768 7 <30>1 2024-05-22T18:16:54.857454+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.032768 0.065536 11 <30>1 2024-05-22T18:16:54.857467+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.065536 0.131072 6 <30>1 2024-05-22T18:16:54.857483+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.131072 0.262144 1 <30>1 2024-05-22T18:16:54.857495+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 0.262144 0.524288 1 <30>1 2024-05-22T18:16:54.857507+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] info: 1.000000 2.000000 1 <31>1 2024-05-22T18:16:54.857527+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: cache memory msg=311360 rrset=648279 infra=254810 val=86940 <31>1 2024-05-22T18:16:54.857539+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: svcd callbacks end <31>1 2024-05-22T18:16:54.857556+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: close of port 10803 <31>1 2024-05-22T18:16:54.857573+02:00 pfSense.bhf.tld unbound 68437 - - [68437:1] debug: close fd 17
This is a 100 ms (0,1 sec) snippet of my resolver.log using "Level 4" : pure madness.
And I'm doing mode full DNSSEC.
I left it running for an hours or so .... -
@Gertjan @SteveITS I'm still catching up on your posts (thanks for taking the time), but i'm unfortunately here to report that it happened again:
I realize it's neither here nor there, but I keep going back to WHAT caused this to start happening? I made no major (or minor that i know of!) change when it started and it's so consistent. It's all consuming
-
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
The images above were lost due to the forum error
are they lost for everyone? I see them still on my side....
Can you see these?
I think the overall point is that i wasn't seeing time.google.com servfail before and I def wasn't seeing anywhere near 1,692 in this log which basically covers only 2.5 minutes
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
do you have forwarding enabled?
![d223ad5d-b542-4b84-87de-65553f018f3a-image.png]i do not
-
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
What have you set here :
set to localdomain, is this related to your previous message? Is there a way to fix the absence of that trailing dot so that it doesn't append localdomain? What might have caused that?@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
And the main Resolver setting :
Also set to Transparent -
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
are they lost for everyone? I see them still on my side....
Per a Netgate post there was a period where images were being uploaded to an incorrect location. When they fixed it, images uploaded during that period were lost. If you scroll up there are some empty images now, e.g.
https://forum.netgate.com/topic/187510/dns_probe_finished_nxdomain-sporadically-for-anywhere-from-30secs-to-10min-works-flawlessly-at-all-other-times/24
-> https://forum.netgate.com/assets/uploads/files/1713968978542-8a15c3c9-c8b8-4916-8326-e3a1cbbfba8a-image.pngYes we can see recent ones.
-
-
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
I don't see any "query:" or "reply:" lines ... ?!
I think i added an option to custom commands via a @johnpoz recommendation. Here is the entirety of my Custom options box;
server: log-queries: yes log-replies: yes log-tag-queryreply: yes log-servfail: yes ede: yes qname-minimisation: no aggressive-nsec: no infra-keep-probing: yes infra-cache-max-rtt: 2000 infra-host-ttl: 0 outbound-msg-retry: 32 max-sent-count: 128
I saw something in that thread you posted earlier (that I'm still combing through) about someone saying the solution was to use another resolver, is it time I just abandon unbound? Is that even possible inside pfsense? advisable? I need to update to 2.7.2 which i guess i'm now going to have to super prioritize but it's just so hard to take the leap
-
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
and now your looking at the resolver log file, in real time.
I'm doing this, but not sure what to be looking for. it's the same as the gui but faster moving. Lots of servfail and ".localdomain" and "exceeded maximum sends"
-
I'm losing my mind:
My wife is going to kill me, she can't use her computer. She asked if we should call our ISP shudder but i don't think it's an issue with the ISP, right? Once again I go back to how sudden this problem came about without changes and how prevalent it has become....
Is there any indication the pfsense hardware is failing? Anything?
-
My MyQ wifi garage door opener is offline now too, i'm rolling back those changes i guess...
-
removed those Custom options, and restarted the resolver and the internet came back for my wife. Maybe i should've attempted to just restart the resolver first?
Current Custom options:
server: log-queries: yes log-replies: yes log-tag-queryreply: yes log-servfail: yes ede: yes #qname-minimisation: no #aggressive-nsec: no #infra-keep-probing: yes #infra-cache-max-rtt: 2000 #infra-host-ttl: 0 #outbound-msg-retry: 32 #max-sent-count: 128
edit; she said it broke a minute later and i restarted the resolver and it seems to be working again...for now. My wifi garage door opener or Nest Protects are still not online
edit2: took a few minutes but my garage door opener is back online as well as 3/5 Nest Protects...
-
Sorry for the slightly off topic q, but i, as @johnpoz suggested, feel that updating to 2.7.2 is my next logical step (but I can't do it from the gui or console). So I made a bootable usb stick with 2.7.2 and started the process....but then i forgot the options I selected when I first installed pfsense years ago. Is there a way to check on the gui or console which option i selected for ZFS or UFs? Or which drive I installed it to? I thought my Dell r210 II only had one drive in it but I was presented with 2 during the installation (maybe a partition) and I didn't remember which I did. I just want to make sure my installation is exactly the same as the original to avoid issues.
edit: safe to assume this means zfs on a single drive?
Also I read online there would be an upgrade option when I loaded up a flashable usb with psfense already installed on the target media but this was not a presented option....I also read online that this means the installation media can't read the pfsense install....how much should I be reading into this?
-
@RickyBaker ZFS was released with 2.6.0 so when did you install yours? Or check the dashboard Disk widget. In general you’d want ZFS.
Re upgrading, one can’t do that like you’re describing I think. However this is a thing: https://docs.netgate.com/pfsense/en/latest/backup/restore-during-install.html#recover-config-xml-from-existing-installation. Or just restore after.
The installer may show the usb stick? Can you tell by size?
-
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
Also I read online there would be an upgrade option when I loaded up a flashable usb with psfense already installed on the target media but this was not a presented option....I also read online that this means the installation media can't read the pfsense install....how much should I be reading into this?
As you can see, "just download pfSense" isn't an option anymore. It's gone. From now on, you load an installer, never used myself that one before.
This must be it : the online installer.I also upvote the upgrade to 2.7.2.
You'll be using a far newer unbound version, probably "1.19.3."About the install media : forget about CD, DVD etc.
Use : Prepare a USB Memstick.
If the stick isn't broken, it works. Tried this method several times on Netgate devices an bare bone stuff like your device.@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
So I made a bootable usb stick with 2.7.2 and started the process....but then i forgot the options I selected when I first installed pfsense years ago
Ah, ok, you have a 2.7.2 on USB ready.
When installing on a device, known, or not, try to know what de 'FreeBSD' name of the hard disk is. There can be one or more drives. I say drive name, not the partition names on that drive as they will will be lost.
If you can chose, go for the best file system : ZFS.
Enumerate also your NIC names : you can see them in the Interfaces >Interface Assignments menu.
And thats all there is to do.
Go for a manual default bare bone 192.168.1.1/24 LAN and DHCP WAN setup, if you can.
Make a backup of your current config first.
When installing, I never 'recover' the config (if found). I test the system first with the 'everything to default' settings. When that works out : LAN and WAN (and unbound ^^) ok, then you can import your backed up config later on using the GUI.
If any issues come back at that moment : you know it's your config ;) -
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
I test the system first with the 'everything to default' settings.
because of the sporadicness of the issue, it's going to be very difficult to test the default settings. Any suggestions? i don't want to set up all the downstream devices to deal with the new ip address et al.
and slightly off topic/dumb q: is there a way for me to view any and all drives pfsense sees in the GUI or the command line?
-
cat /etc/fstab