DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times
-
I've enabled DNSSEC again since it didn't really help my issue having it disabled. Is there another log I should increase the logging on? What level should I have the dns resolver on? I'm still experiencing the issue, yesterday for about 20 minutes (longest yet) I couldn't' open a webpage on my phone, but I was concurrently streaming the Knicks game on Youtube TV.
I've also added all those "no vendor" Mac address's I couldn't explain with randomly assigned DHCP leases to the whitelist block list and they've yet to come back and I've yet to discover anything broken. Just an update.
Any advice on getting it updated to 2.7.2 without doing a full clean install and restore?
-
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
I've enabled DNSSEC again since it didn't really help my issue having it disabled
if you are forwarding dnssec should be disabled.. While it might not seem like no issues with a few queries, but can tell you its going to be problematic at some point.. even quad9 faq for when forwarding says to disable it. It is pointless if you forward, where you forward does dnssec or they don't you telling unbound to do it isn't going to do anything other than cause you issues at some point.
-
@johnpoz I'm no longer forwarding per your first post in this thread.
-
@RickyBaker Long thread and haven't paid close attention, could tell if you had switched back to forwarding or not. Yeah if your resolving then dnssec is good to have enabled.
-
@johnpoz great thanks for circling back
-
https://pastebin.com/SFR8BXb0
Woke up from a nap and experienced one of the longest internet outages of this whole saga. It was out at 3:14 when I tried to open venmo and was out for over 20 minutes before it came back. the above is the DNS resolver log but I think i have the log level dialed too high cause 2000 entries didn't even go back 2 minutes. I've changed it back to Log Level 1 but could someone check it out and see if there's any clues in there (or what log level I should have it at)? Or is there another log that I should also be monitoring? Is it possible the problem is purely something with the wifi and Ubiquiti?
-
Same thoughts here : a high level of log details actually the details your looking for, as there is only for 2 minutes worth of info.
If you have some disk space left, you can make the log files bigger.If needed, you can make the log retention a bit smaller - I've "7", you can make it 5 or 4.
You can also make this, one
a bit bigger.
The actual goal is :
As soon as you find a situation where a device has no access anymore, you have to check :
Does the access without using DNS works ? For example, ping 8.8.8.8 from that device.
Also double check : does the device has a valid IP, gateway and dns set at that moment ?
Example :ipconfig /all
and check the duration of the lease, the gateway, the DNS (both should point to the IP of pfSense.
Check on the device if "DNS" works :
C:\Users\Gauche>nslookup www.google.com Serveur : pfSense.bhf.tld Address: 2a01:cb19:907:bedf:92ec:77ff:fe29:392c Réponse ne faisant pas autorité : Nom : www.google.com Addresses: 2a00:1450:4007:81a::2004 142.250.201.164
Take note : for me, both IPv6 and IPv4 works.
Then (also) check on pfSense if resolving works :
dig @127.0.0.1 www.google.com +short
and then
dig @192.168.1.1 www.google.com +short
where 192.168.1.1 is your LAN interface.
Check if unbound is up and running :
[24.03-RELEASE][root@pfSense.bhf.tld]/root: ps ax | grep 'unbound' 74113 - Ss 4:32.60 /usr/local/sbin/unbound -c /var/unbound/unbound.conf .... ....
and
[24.03-RELEASE][root@pfSense.bhf.tld]/root: sockstat | grep 'unbound' unbound unbound 74113 3 udp6 *:53 *:* unbound unbound 74113 4 tcp6 *:53 *:* unbound unbound 74113 5 udp4 *:53 *:* unbound unbound 74113 6 tcp4 *:53 *:* unbound unbound 74113 9 tcp4 127.0.0.1:953 *:* ... ... ... ...
With the unbound log details set to "1", it will still contains the number of restarts (a controlled stop and then a start :
grep "stopped" /var/log/resolver.log ..... <30>1 2024-05-06T00:15:24.852356+02:00 pfSense.bhf.tld unbound 12814 - - [12814:0] info: service stopped (unbound 1.19.3).
Btw : the actual unbound version is 1.19.3 as I'm using 24.03.
pfSense 2.8.0 will be coming out soon.
Not that the version really matters (imho) as I was using 1.17.x also a long time, and don't recall having any issues.@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
Is it possible the problem is purely something with the wifi and Ubiquiti?
For me, an AP should be what it should do :; being a radio to wire signal converter.
True, an AP can do a lot more, and really braking the connection for you.
When testing connectivity issues, add APs and other gadgets later on, when you know the wired connection works well.
The same thing goes for L3 'smart, VLAN based' switches : only use the when the bare bone network works well. -
@RickyBaker did you actually do some dns queries while you were having the issue, to both unbound and say external dns?
You should log queries and replies as well if your wanting to troubleshoot dns not working.
int your options box in unbound
server: log-queries: yes log-replies: yes
what is the response, timeout talking to unbound, servfail, nx?
fire up your fav dns tool, nslookup, dig, doggo, host, etc. and actual validate what is failing.. If you look some fqdn do you get a response. if so what is the response, did it work, did unbound return servfail, or nx domain ?
Does unbound answer local resources, like pfsense fqdn? Does something that is cached work, only new queries fail. You can view what is in your cache
[23.09.1-RELEASE][admin@sg4860.home.arpa]/root: unbound-control -c /var/unbound/unbound.conf dump_cache | grep forum.netgate.com forum.netgate.com. 2452 IN A 208.123.73.71 msg forum.netgate.com. IN A 32896 1 2452 3 1 1 3 6 forum.netgate.com. IN A 0 [23.09.1-RELEASE][admin@sg4860.home.arpa]/root:
If that fails, then do a query directed to some external NS like quad9 or google - do those work?
-
@Gertjan said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
As soon as you find a situation where a device has no access anymore, you have to check :
Does the access without using DNS works ? For example, ping 8.8.8.8 from that device.
Also double check : does the device has a valid IP, gateway and dns set at that moment ?this is really helpful, thank you, i will try to screenshot this post and enact it as well as possible the minute i notice an outage. BTW I tested a hard wired PC when i had an outage and also observed dns connectivity issues fwiw. but all of this is a very good framework for continuing the troubleshooting
-
@RickyBaker how are you knowing the dns is failing? Are you doing an actual query with a tool? like dig or nslookup?
Or your browser just doesn't load - for all you know your browser is using doh..
When you have the issue, can your client ping its gateway (pfsense) can you ping the internet via IP, 8.8.8.8 for example.
If you can not ping pfsense, then you have a local network issue most likely. If you can not ping the internet - maybe just your internet is out. If you can ping pfsense, can you do a query for pfsense name, this should always work even if the internet is down. Only reason it wouldn't is you can't actually talk to pfsense, or unbound is not running.
Doing some basic connectivity tests and dns queries should point to where your actual problem is.
-
@johnpoz said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
did you actually do some dns queries while you were having the issue, to both unbound and say external dns?
at one point when i finally realized i could use dig on the pfsense itself I ran the command you posted to 8.8.8.8 and it worked successfully but I need to test this more thoroughly (i.e. other linux devices not the pfsense) and try 8.8.8.8 as well as google.com. thanks for reminding.
@johnpoz said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
int your options box in unbound
I'm sorry where would i find/set these options set?
@johnpoz said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
what is the response, timeout talking to unbound, servfail, nx?
again, very sorry but how would I do this? I don't even KNOW what servfail, nx is? In fact, reading the rest of the suggestions i can tell this is an important framework for isolating the issue but it's just far beyond my grasp of the tools at play. I will google each individual term in hopes of understanding better but if there's a more specific you could include for me to enact and post that would be very helpful.
-
@RickyBaker Windows has nslookup...run "nslookup netgate.com" and see what it returns. Do the same with "dig netgate.com" on the router, and/or use Diagnostics/DNS Lookup to test there.
Servfail is an error. NXDOMAIN means the host doesn't exist.
At the time of the outage also "ping 8.8.8.8" to ensure your Internet is working, even if DNS is not.
In other words try to narrow down your problem.
re: options, he means the settings in Services/DNS Resolver under "Display Custom Options" which is blank by default.
-
@SteveITS said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
s. Do the same with "dig netgate.com" on the router, and/or use Diagnostics/DNS Lookup to test there.
great, thanks for that explanation
-
@RickyBaker also when your dig, make sure its pointed to pfsense IP with an at.. Linux likes to use 127.0.0.53 which doesn't really tell you who got asked. So put the @ipaddress in your query..
example
user@UC:~$ dig www.netgate.com ; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> www.netgate.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14863 ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 65494 ;; QUESTION SECTION: ;www.netgate.com. IN A ;; ANSWER SECTION: www.netgate.com. 30 IN CNAME 1826203.group3.sites.hubspot.net. 1826203.group3.sites.hubspot.net. 30 IN CNAME group3.sites.hscoscdn00.net. group3.sites.hscoscdn00.net. 30 IN A 199.60.103.30 group3.sites.hscoscdn00.net. 30 IN A 199.60.103.226 ;; Query time: 12 msec ;; SERVER: 127.0.0.53#53(127.0.0.53) (UDP) ;; WHEN: Mon May 06 15:21:41 CDT 2024 ;; MSG SIZE rcvd: 160 user@UC:~$
You don't what ns it actually asked.. so do
user@UC:~$ dig @192.168.2.253 www.netgate.com ; <<>> DiG 9.18.18-0ubuntu0.22.04.2-Ubuntu <<>> @192.168.2.253 www.netgate.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19274 ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;www.netgate.com. IN A ;; ANSWER SECTION: www.netgate.com. 3553 IN CNAME 1826203.group3.sites.hubspot.net. 1826203.group3.sites.hubspot.net. 3553 IN CNAME group3.sites.hscoscdn00.net. group3.sites.hscoscdn00.net. 3553 IN A 199.60.103.30 group3.sites.hscoscdn00.net. 3553 IN A 199.60.103.226 ;; Query time: 4 msec ;; SERVER: 192.168.2.253#53(192.168.2.253) (UDP) ;; WHEN: Mon May 06 15:22:28 CDT 2024 ;; MSG SIZE rcvd: 160 user@UC:~$
Where 192.168.2.253 is the IP address of pfsense..
You can normally see where 127.0.0.53 is forwarding to with
user@UC:~$ resolvectl status Global Protocols: -LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported resolv.conf mode: stub Link 2 (ens3) Current Scopes: DNS Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported Current DNS Server: 192.168.3.10 DNS Servers: 192.168.3.10 DNS Domain: home.arpa user@UC:~$
-
@johnpoz thanks, this is very helpful, i will race to enact during the next outing and repost here. thanks for your patience.
-
@johnpoz couldn't get to a laptop fast enough yesterday to enact these troubleshooting steps but I was able to fire up a local session on juicessh on my phone:
-
and you stopped testing ...
That laptop moved that fast ??
No more issues right after the ping google test ?You don't need to react 'fast', as the logs, example : unbound /var/log/resolver/log and the /var/log/system = system logs are there for days.
-
@Gertjan i know i know. here's the dns log: https://pastebin.com/FCRuijbe i can't seem to find the right degree of logging. If I set it to 1 I get almost no information, if I set it to 2 or above 2000 lines doesn't go back 2 minutes. I'll be setting at 2 though cause this is useless.
I'll be totally honest, it happened at the worst possible time. I was watching both my kids and making dinner (trying to look up a recipe). the closest computer was floors away and I just didn't have time. I very quickly ran the ping on my phone before i burned the garlic:). I'm "hoping" for a more opportune break down tonight.
-
@johnpoz said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
int your options box in unbound
This look right? -
@RickyBaker said in DNS_PROBE_FINISHED_NXDOMAIN sporadically for anywhere from 30secs to 10min. works flawlessly at all other times:
If I set it to 1 I get almost no information
Leave it to 1.
This - most recent at the top :
May 6 14:05:33 unbound 41106 [41106:0] info: average recursion processing time 0.809539 sec May 6 14:05:33 unbound 41106 [41106:0] info: server stats for thread 0: requestlist max 62 avg 2.13559 exceeded 0 jostled 0 May 6 14:05:33 unbound 41106 [41106:0] info: server stats for thread 0: 46864 queries, 34916 answers from cache, 11948 recursions, 0 prefetch, 0 rejected by ip ratelimiting May 6 14:05:33 unbound 41106 [41106:0] info: service stopped (unbound 1.17.1). May 6 13:04:22 unbound 41106 [41106:0] info: generate keytag query _ta-4f66. NULL IN May 6 01:40:18 unbound 41106 [41106:0] info: generate keytag query _ta-4f66. NULL IN May 5 14:44:33 unbound 41106 [41106:0] info: generate keytag query _ta-4f66. NULL IN May 5 02:54:05 unbound 41106 [41106:0] info: generate keytag query _ta-4f66. NULL IN May 4 15:34:00 newsyslog 97047 logfile turned over due to size>500K May 4 15:34:00 newsyslog 97047 logfile turned over due to size>500K May 4 15:33:17 unbound 41106 [41106:1] info: generate keytag query _ta-4f66. NULL IN May 4 15:33:17 unbound 41106 [41106:0] info: start of service (unbound 1.17.1). May 4 15:33:17 unbound 41106 [41106:0] notice: init module 1: iterator
is somewhat strange.
Up until May 4 15:34:17 you get hundreds of log line per second. Not an issue, but this will flood the logs. To make logs more useful, and if possible (disk size), make logs files way bigger then just "500K", for example 2Mbytes each.Then there is the line from the system logger at 15h34 that says : time's up, file to big, rotating.
From that moment, no more unbound logs ....
Lines start to re appear the next day with the usual, hourly "info: generate keytag query _ta-4f66.NULL IN" It looks like unbound now logs at "level 1"And the the log continues at May 6. ..... again a rather big gap in the logging.
I'm pretty sure unbound was working for you all this time.
Why it's not logging the "level 1" classic hourly "generate keytag..." is puzzling to me. The log system (syslog) fails ? Something else ?I'm not sure if it was already asked, but :
What's the system you're running pfSense on, bare metal, VM ?
How much RAM ? Disk size ? pfSense 2.7.2, right ?