23.01 breaks DNS resolver and pFblocker
-
@johnpoz As a pfSense and networking newbie, I just wanted to thank you John for your in-depth explanations, with pictures even! Much appreciated!
-
@johnpoz said in 23.01 breaks DNS resolver and pFblocker:
@gertjan good settings, another one I might suggest is serve zero. I have been using it for years, and have never had a problem with it.
serve 0 allows unbound to serve up the last IP it had in cache for some fqdn even if the ttl has expired. The ttl it hands off to the client will be 0, so if for some reason that doesn't work the client will have to ask again for that fqdn, and by this time unbound will have looked it up.
An aside from the crux of this thread, but I saw that and I'd like to try it. Is this the actual entry? It looks like it, but I thought I'd ask:
Serve Expired [] Serve cache records even with TTL of 0 When enabled, allows unbound to serve one query even with a TTL of 0, if TTL is 0 then new record will be requested in the background when the cache is served to ensure cache is updated without latency on service of the DNS request.
-
@areckethennu yeah that is the serve 0 setting.. Check the box..
-
Just upgraded to 23.01 myself and am experiencing DNS resolver issues. A quick restart of the service fixes it for clients. Wish I had saved the nslookup error on a domain I was testing. Can't recall the exact wording, but it was one I had not seen before about a server error.
I did enable "Serve Expired" in the advanced DNS resolver settings and will monitor to see if it happens again.
-
@johnpoz What I see when I log onto Flickr:
This site can’t be reached
www.flickr.com’s server IP address could not be found.
Try:
• Checking the connection
• Checking the proxy, firewall, and DNS configuration
ERR_NAME_NOT_RESOLVEDThen after 30+ seconds I see the Flickr homepage frame sans content, and things still appear to be loading after 2 minutes (but do not complete). Once I hit refresh (using Chrome), it loads right up, no problems.
When I open the Chrome debug window (note that I do not see the same thing you show in your screenshots, e.g. no TIMINGS menu) the timings on the initial page load don't seem too bad:
I am not seeing the 0 time for DNS; I suppose because it was not yet cached (I restarted Unbound and cleared my local DNS cache). But then, things get ugly when some fonts and scripts try to load:
The flickr.com load time is what is detailed above. The font failures are not, I suspect, fatal. I think the problem is the combo?yui:3.16.0/yui.../loader-hermes/... line. This references combo.static.flickr.com. When this load fails (seems to be jscript?), it torpedoes the rest of the website load (and accounts for the hang I see before refresh). My guess is this script does some init work and then loads other scripts.
When I hit REFRESH in the "hung" browser window, Flickr loads as noted above. The big difference is combo?yui?3.16.10…. call to that first Hermes URL (whatever that is -- scripts is my guess, as noted above) succeeds, and whatever that loads leads to a long list of more Hermes calls that also succeed. The website now loads properly.
I tried doing DIGs on flickr.com and combo.static.flickr.com (the URL for the Hermes URLs). They do not look alike, and I do not know enough about DIG to interpret that (happy to upload if you want a look).
I tried running these same tests on my older computer (slower), and I never run into the website load problems. My best guess is that my newer, faster computer is timing out. I did Malware and AV scans on my newer computer to ensure that was not causing issues. So the computer used to load the website makes a difference. I have replicated these problems using both Chrome and Firefox, and on multiple websites (see ironic note below). I've only drilled deeper on Flickr.
FWIW, when I run a Windows console app to do simple DNS queries (not DIG, just one call), I've seen these fail for google.com and other common sites too. This is why I suspected DNS issues. It now appears that it is not quite that simple, and is tied to machine speed somehow (or software config, or both or...?).
At this point I am well outside my toolset and knowledge (give me local code and a debugger and it's a different story), and have spent far more time than I have to spare on this issue. I can't have a production machine failing to load websites, and worse failing to do online backups or software updates, for the sake of finding this problem. If Netgate wants to get involved, then I'd be happy to set aside some time to work through this with them (I have the current 23.01 config on another USB stick). I remain hopeful that going back to 22.05 will have my production environment working again.
I do want to thank you for the time and effort you put into your responses. Without your screenshots and comments I would not have tried DIG or getting the timings included here.
On the side of irony, when I loaded the forum to enter this response, I hit the same This site can't be reached error. I think the difference between this and Flickr is that the forum isn't loading a script whose failure to load leaves the website non-functional a la Flickr. After a few seconds, the Netgate forum site just loads up and resolves normally. Given my tests with the DNS Query code I wrote (see google.com DNS query failure noted above), it is clearly related to DNS and timing issues; I just cannot pinpoint how.
-
@draco Replying to my last post: I decided to try a reboot from the Console before re-applying 22.05. Someone mentioned rebooting on another thread, which I had not tried because pfSense reboots as part of the upgrade. But I tried it anyhow...
So far my SG-5100 has been up for almost an hour and I have not repro'd the Flickr problem. What would rebooting change that leaves things working all of a sudden? My primary PC was not rebooted or changed, just the SG-5100.
-
Just following up that I tried the Serve Expired setting and a simple reboot and unfortunately the problem still persists.
windows client nslookup:
forum.netgate.com
Server: firewall.blah.com
Address: 192.168.150.1
*** firewall.blah.com can't find forum.netgate.com: Server failedrestart DNS Resolver service
forum.netgate.com
Server: firewall.blah.com
Address: 192.168.150.1Non-authoritative answer:
Name: forum.netgate.com
Addresses: 2610:160:11:18::199
208.123.73.199I did have telegraf scraping stats from the resolver as well, but have since turned it off and will continue to monitor.
-
No errors in the resolver log when it's failing to resolve?
If you turn up the logging does it at least show the incoming requests?
-
I actually just caught it again. An entire page filled with these:
Mar 6 20:18:33 unbound 84264 [84264:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN Mar 6 20:18:32 unbound 84264 [84264:1] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN Mar 6 20:18:32 unbound 84264 [84264:3] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN Mar 6 20:18:32 unbound 84264 [84264:3] info: generate keytag query _ta-4f66. NULL IN
Found this: https://forum.netgate.com/topic/152338/unbound-failed-to-prime-trust-anchor-could-not-fetch-dnskey-rrset-dnskey-in and am thinking it's dnssec related. I do in fact have forwarding and dnssec enabled, so going to play with the settings for a bit. Might also mess with dynamic dhcp client reg options. Haven't changed anything here though since the upgrade in a very very long time.
-
Hmm. Yes, that's DNSSec. I would at least try disabling and see if that removes the issue.
That really shouldn't be a problem though... -
@llebgrate said in 23.01 breaks DNS resolver and pFblocker:
I actually just caught it again. An entire page filled with these:
Mar 6 20:18:33 unbound 84264 [84264:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
Before unbound is started, some house keeping is done.
unbound is started with a single command that asks it to download a copy of the DNSSEC root key file. Here you can see that file, at the top.
One of the tasks is : prepare a good know copy of root DNSKEY, id 20236 (for now, as it can change when needed).The thing is, and this is probably your real issue :
It can't !!
This means your unbound isn't able to download a small file, 1 kilo byte file (here it is) from the Internet.
That's not promising. Why would it have to try many times ?
This smells 'uplink issues'.When you see :
info: generate keytag query _ta-4f66. NULL IN
you know the root key file has been downloaded successfully.
Because hex 4f66 is 20326 decimal, the key ID.@llebgrate: good news : because you are forwarding, you have to trust the resolver you are forwarding to, you can disable DNSSEC.
Still, it might be worthwhile why unbound has issues getting 'stuff' from the Internet.
Something is impacting your traffic that was generated by unbound. That your DNS traffic, it's not much but very important. -
@gertjan appreciate the detailed reply.
After some diagnostics on my end, it does not appear to be DNSSEC settings (I've re-enabled it w/out issue) but rather the Use SSL/TLS for outgoing DNS Queries to Forwarding Servers. I currently use Google DNS (8.8.8.8/8.8.4.4 > dns.google) and have not had any issues in many years with this enabled so not sure what happened since the upgrade. I have read that this setting is generally incompatible with DNSSEC, so I've unchecked both for now and everything is working just fine.
-
Generally you would not have DNSSec enabled with DoT but only because you will be in forwarding mod for DoT. You should be able to use them together but it's likely far less tested because there's little point.