23.01 breaks DNS resolver and pFblocker
-
@johnpoz correct me please if I am wrong, but the dig you show, I believe, is just for the initial resolution of Reddit DNS name/IP address. I’m getting to the initial website with a fairly short delay, but then things will, on Flickr for instance, hang for quite some time. I don’t have the tools to see what’s going on in that HTTPS connection, but if you look at the page source, there is a lot of “asset loading“ from other URLs. I don’t think dig picks that up, and I think that is part of what is killing my performance (until the cache fills).
A similar problem happens on my back up program. If I rerun it in the foreground after it, fails, it completes without issue.
I’m not sure why you think this is related to my Internet connection or DNS configuration, when none of that changed with my switch to 23.01?
Please don’t get me wrong. I truly appreciate your help! I just don’t see a way to capture timings for all of the DNS look-ups as each program (or website) runs for the first time, or to emulate that. perhaps clearing my DNS logs and turning on more detailed unbound logging would capture that? I’m not sure.
Before I posted, I did look at the logs for Unbound (and all the pfBlockerNG logs, which looked fine), and Unbound was not restarting all the time. Only when pfBlockerNG ran its Cron job and there were changes to the IP block lists or DNS blacklists/whitelists.
The simplest way I can think of to test this is to go back to the 22.05 configuration. If the problems are gone, then it’s pretty certain the problem is tied to something that changed in 23.01. If that doesn’t fix it, I can quickly repave to 22.05 and continue testing.
-
@draco said in 23.01 breaks DNS resolver and pFblocker:
Flickr for instance, hang for quite some time
Ok how long does that take, in your browser web tools or whatever look to see what is trying to be resolved..
Again if your trying to troubleshoot something need specifics.. If fickr is a problem - what on that is the hold up..
That was even faster than reddit
; <<>> DiG 9.18.8 <<>> www.flickr.com +trace ;; global options: +cmd . 76294 IN NS l.root-servers.net. . 76294 IN NS a.root-servers.net. . 76294 IN NS c.root-servers.net. . 76294 IN NS j.root-servers.net. . 76294 IN NS k.root-servers.net. . 76294 IN NS h.root-servers.net. . 76294 IN NS g.root-servers.net. . 76294 IN NS b.root-servers.net. . 76294 IN NS d.root-servers.net. . 76294 IN NS f.root-servers.net. . 76294 IN NS m.root-servers.net. . 76294 IN NS e.root-servers.net. . 76294 IN NS i.root-servers.net. . 76294 IN RRSIG NS 8 0 518400 20230317050000 20230304040000 951 . kJaY9uAyEtbTnrjZ1qAQTsqHExUgSViSqmXstFQXUmBOgAbAHKQlp9Nj BCAb0pUbm3sDWOrGvOaqxN6QKFXd8331v6lxtsDKd3kIGE5Wo7kLwzw4 XzZeGRwfuRPnmwXtfYnJTo+X4tGgg2xK6c0uy5QdsFVzHEPwJNXURZVE rXQ/erzXJmKUXFuZim8sfm7UjTTJJsBwk8+P8uM+B9CKDtfE0CvxtyIS BbGi9pg4PDlJz0zB3V9VM/9+IcJuQ4NfnBDvw3pD9Q0LVx9qN2GzG1TK 06r6LMEBB9RRhO5wkZ7UwZuVzloYxntIpBMVL3zdTl2vVCIQFlzqSqJL 5OfW1A== ;; Received 525 bytes from 127.0.0.1#53(127.0.0.1) in 1 ms com. 172800 IN NS e.gtld-servers.net. com. 172800 IN NS f.gtld-servers.net. com. 172800 IN NS c.gtld-servers.net. com. 172800 IN NS l.gtld-servers.net. com. 172800 IN NS b.gtld-servers.net. com. 172800 IN NS a.gtld-servers.net. com. 172800 IN NS k.gtld-servers.net. com. 172800 IN NS d.gtld-servers.net. com. 172800 IN NS h.gtld-servers.net. com. 172800 IN NS j.gtld-servers.net. com. 172800 IN NS i.gtld-servers.net. com. 172800 IN NS g.gtld-servers.net. com. 172800 IN NS m.gtld-servers.net. com. 86400 IN DS 30909 8 2 E2D3C916F6DEEAC73294E8268FB5885044A833FC5459588F4A9184CF C41A5766 com. 86400 IN RRSIG DS 8 1 86400 20230317170000 20230304160000 951 . hxP6AHA8/MhX3JTy2BcSkd4CeviA+3lw1LFWfIPNDgpdka84SUuKYc50 8hhh7bcW5/MDeKZJ82JkxBlZkrWWaNpGncQKOLdjmlkesYB03WPoOo/I aJohqNzLawFsVK4+2c48yrCeX1uesJQCiJnvJEUHyJmd8KtrRYeUnqDn nBiIlzuEHm5r3TQodZTO8AiH+Dp722SzlP8E8JI8LPdsozvClNKTGcCp KZVPMq3yCeuZA8+T859Ah8HuJjyh4NAEIAQe2K4uuD9B2ZSCt9lEf5i1 qcBwMXtUf9Od86hnXK/cjI6uCNMCPBBeN6QJ7uIQK64zHBZhejcPq0EN PU5D9w== ;; Received 1205 bytes from 2001:500:12::d0d#53(g.root-servers.net) in 45 ms flickr.com. 172800 IN NS ns-573.awsdns-07.net. flickr.com. 172800 IN NS ns-421.awsdns-52.com. flickr.com. 172800 IN NS ns-1683.awsdns-18.co.uk. flickr.com. 172800 IN NS ns-1244.awsdns-27.org. CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 86400 IN NSEC3 1 1 0 - CK0Q2D6NI4I7EQH8NA30NS61O48UL8G5 NS SOA RRSIG DNSKEY NSEC3PARAM CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 86400 IN RRSIG NSEC3 8 2 86400 20230311052252 20230304041252 36739 com. exNISCQI4v/S0m9ksCZH3zghILb9b1aARin3TLpc3yxNweWFzrozuCSm GnYNeNNy8OjdvPFw3/uue0qCY6vux7LlhCALbK4pGq58BFz2p7JZz7Um dCN3AnraZXWMhkG80d0ovafSyqOLPwBMg6rGXJyQnvFDkA2Y46ClZhOz r6PU3UvTEmtsa1IDaG8UeDdySojtSmMjSqaEepy7US86Gg== 8AEGLV925R77BHJM7FFD4RKA8CGTNSFK.com. 86400 IN NSEC3 1 1 0 - 8AEGTREIKABQ6N53PE432PFN3BMU2HM1 NS DS RRSIG 8AEGLV925R77BHJM7FFD4RKA8CGTNSFK.com. 86400 IN RRSIG NSEC3 8 2 86400 20230309061449 20230302050449 36739 com. kejD8AEDs1s8jUO2xUTJ2IN6Bgh2A5ItECrYExvbbQYZzSSnlbPEzyL7 n6uDtlE6TrYpOU/uH4wM+0Pt/USS4EmSUty+07+RF4hoM512BfYkUjxj QpQoLeFTRh3oFtFUQfQgYPD5oVJOtFcGErUhJ3lz3J4y9yavaa9phYxu Web4Fx3MJlvsA67u7Kp9NlrfTiF0JXHfqXBLyhXDbWM+zQ== ;; Received 745 bytes from 192.55.83.30#53(m.gtld-servers.net) in 19 ms www.flickr.com. 60 IN A 99.84.171.73 flickr.com. 300 IN NS ns-1244.awsdns-27.org. flickr.com. 300 IN NS ns-1683.awsdns-18.co.uk. flickr.com. 300 IN NS ns-421.awsdns-52.com. flickr.com. 300 IN NS ns-573.awsdns-07.net. ;; Received 196 bytes from 205.251.196.220#53(ns-1244.awsdns-27.org) in 27 ms [23.01-RELEASE][admin@sg4860.local.lan]/:
And yes with dig that is a FULL resolve - this would be the slowest lookup of anything, because its a full resolve, down from the roots.. Even once the ttl for www.flickr.com expires - which 60 second ttl is just Fing insane.. You would only have to go talk to the NS for flickr.com directly..
Going back to 22.05 doesn't really tell you what the problem is - keep in mind unbound changed from like 1.15 to 1.17.1
I would troubleshoot the exactly problem vs rolling back all of pfsense.. Which really gives you no where to even start to what the actual problem is..
I can tell you right now there is nothing wrong with 23.01 or unbound 1.17.1 at least how I have mine configured.. Because again I have zero issues. maybe something specific with your hardware, your config, your connection, etc..
You know when you rollback - is when you are in a limited change window to update.. And something is not working and the change window is expiring.. And you hit the rollback mark, that is when you rollback ;)
There is not really a website on the planet anymore that loads just www.domain.tld -- they are all going to load sub domains or other resources off other domains, etc.. So call up your browser tools... How long does the page take to load?
I am not seeing any issues with www.flickr.com loading.. But then all I get is a page saying start for free.. What exactly are you loading..
If call up the browser console - I see quite a few "errors" etc.. but nothing look dns related - and the page popped pretty much instant and the background keeps changing pictures instantly.
If I view what is going on in the network and how long stuff takes - I see my ad blocker is blocking some stuff
But don't see anything failing from dns, or time out, etc.
Click the timing button - what does it show for dns resolution.. Clear your browser cache, restart unbound so its cache is clear, clear your os cache as well windows ipconfig /flushdns etc..
-
@johnpoz As a pfSense and networking newbie, I just wanted to thank you John for your in-depth explanations, with pictures even! Much appreciated!
-
@johnpoz said in 23.01 breaks DNS resolver and pFblocker:
@gertjan good settings, another one I might suggest is serve zero. I have been using it for years, and have never had a problem with it.
serve 0 allows unbound to serve up the last IP it had in cache for some fqdn even if the ttl has expired. The ttl it hands off to the client will be 0, so if for some reason that doesn't work the client will have to ask again for that fqdn, and by this time unbound will have looked it up.
An aside from the crux of this thread, but I saw that and I'd like to try it. Is this the actual entry? It looks like it, but I thought I'd ask:
Serve Expired [] Serve cache records even with TTL of 0 When enabled, allows unbound to serve one query even with a TTL of 0, if TTL is 0 then new record will be requested in the background when the cache is served to ensure cache is updated without latency on service of the DNS request.
-
@areckethennu yeah that is the serve 0 setting.. Check the box..
-
Just upgraded to 23.01 myself and am experiencing DNS resolver issues. A quick restart of the service fixes it for clients. Wish I had saved the nslookup error on a domain I was testing. Can't recall the exact wording, but it was one I had not seen before about a server error.
I did enable "Serve Expired" in the advanced DNS resolver settings and will monitor to see if it happens again.
-
@johnpoz What I see when I log onto Flickr:
This site can’t be reached
www.flickr.com’s server IP address could not be found.
Try:
• Checking the connection
• Checking the proxy, firewall, and DNS configuration
ERR_NAME_NOT_RESOLVEDThen after 30+ seconds I see the Flickr homepage frame sans content, and things still appear to be loading after 2 minutes (but do not complete). Once I hit refresh (using Chrome), it loads right up, no problems.
When I open the Chrome debug window (note that I do not see the same thing you show in your screenshots, e.g. no TIMINGS menu) the timings on the initial page load don't seem too bad:
I am not seeing the 0 time for DNS; I suppose because it was not yet cached (I restarted Unbound and cleared my local DNS cache). But then, things get ugly when some fonts and scripts try to load:
The flickr.com load time is what is detailed above. The font failures are not, I suspect, fatal. I think the problem is the combo?yui:3.16.0/yui.../loader-hermes/... line. This references combo.static.flickr.com. When this load fails (seems to be jscript?), it torpedoes the rest of the website load (and accounts for the hang I see before refresh). My guess is this script does some init work and then loads other scripts.
When I hit REFRESH in the "hung" browser window, Flickr loads as noted above. The big difference is combo?yui?3.16.10…. call to that first Hermes URL (whatever that is -- scripts is my guess, as noted above) succeeds, and whatever that loads leads to a long list of more Hermes calls that also succeed. The website now loads properly.
I tried doing DIGs on flickr.com and combo.static.flickr.com (the URL for the Hermes URLs). They do not look alike, and I do not know enough about DIG to interpret that (happy to upload if you want a look).
I tried running these same tests on my older computer (slower), and I never run into the website load problems. My best guess is that my newer, faster computer is timing out. I did Malware and AV scans on my newer computer to ensure that was not causing issues. So the computer used to load the website makes a difference. I have replicated these problems using both Chrome and Firefox, and on multiple websites (see ironic note below). I've only drilled deeper on Flickr.
FWIW, when I run a Windows console app to do simple DNS queries (not DIG, just one call), I've seen these fail for google.com and other common sites too. This is why I suspected DNS issues. It now appears that it is not quite that simple, and is tied to machine speed somehow (or software config, or both or...?).
At this point I am well outside my toolset and knowledge (give me local code and a debugger and it's a different story), and have spent far more time than I have to spare on this issue. I can't have a production machine failing to load websites, and worse failing to do online backups or software updates, for the sake of finding this problem. If Netgate wants to get involved, then I'd be happy to set aside some time to work through this with them (I have the current 23.01 config on another USB stick). I remain hopeful that going back to 22.05 will have my production environment working again.
I do want to thank you for the time and effort you put into your responses. Without your screenshots and comments I would not have tried DIG or getting the timings included here.
On the side of irony, when I loaded the forum to enter this response, I hit the same This site can't be reached error. I think the difference between this and Flickr is that the forum isn't loading a script whose failure to load leaves the website non-functional a la Flickr. After a few seconds, the Netgate forum site just loads up and resolves normally. Given my tests with the DNS Query code I wrote (see google.com DNS query failure noted above), it is clearly related to DNS and timing issues; I just cannot pinpoint how.
-
@draco Replying to my last post: I decided to try a reboot from the Console before re-applying 22.05. Someone mentioned rebooting on another thread, which I had not tried because pfSense reboots as part of the upgrade. But I tried it anyhow...
So far my SG-5100 has been up for almost an hour and I have not repro'd the Flickr problem. What would rebooting change that leaves things working all of a sudden? My primary PC was not rebooted or changed, just the SG-5100.
-
Just following up that I tried the Serve Expired setting and a simple reboot and unfortunately the problem still persists.
windows client nslookup:
forum.netgate.com
Server: firewall.blah.com
Address: 192.168.150.1
*** firewall.blah.com can't find forum.netgate.com: Server failedrestart DNS Resolver service
forum.netgate.com
Server: firewall.blah.com
Address: 192.168.150.1Non-authoritative answer:
Name: forum.netgate.com
Addresses: 2610:160:11:18::199
208.123.73.199I did have telegraf scraping stats from the resolver as well, but have since turned it off and will continue to monitor.
-
No errors in the resolver log when it's failing to resolve?
If you turn up the logging does it at least show the incoming requests?
-
I actually just caught it again. An entire page filled with these:
Mar 6 20:18:33 unbound 84264 [84264:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN Mar 6 20:18:32 unbound 84264 [84264:1] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN Mar 6 20:18:32 unbound 84264 [84264:3] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN Mar 6 20:18:32 unbound 84264 [84264:3] info: generate keytag query _ta-4f66. NULL IN
Found this: https://forum.netgate.com/topic/152338/unbound-failed-to-prime-trust-anchor-could-not-fetch-dnskey-rrset-dnskey-in and am thinking it's dnssec related. I do in fact have forwarding and dnssec enabled, so going to play with the settings for a bit. Might also mess with dynamic dhcp client reg options. Haven't changed anything here though since the upgrade in a very very long time.
-
Hmm. Yes, that's DNSSec. I would at least try disabling and see if that removes the issue.
That really shouldn't be a problem though... -
@llebgrate said in 23.01 breaks DNS resolver and pFblocker:
I actually just caught it again. An entire page filled with these:
Mar 6 20:18:33 unbound 84264 [84264:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
Before unbound is started, some house keeping is done.
unbound is started with a single command that asks it to download a copy of the DNSSEC root key file. Here you can see that file, at the top.
One of the tasks is : prepare a good know copy of root DNSKEY, id 20236 (for now, as it can change when needed).The thing is, and this is probably your real issue :
It can't !!
This means your unbound isn't able to download a small file, 1 kilo byte file (here it is) from the Internet.
That's not promising. Why would it have to try many times ?
This smells 'uplink issues'.When you see :
info: generate keytag query _ta-4f66. NULL IN
you know the root key file has been downloaded successfully.
Because hex 4f66 is 20326 decimal, the key ID.@llebgrate: good news : because you are forwarding, you have to trust the resolver you are forwarding to, you can disable DNSSEC.
Still, it might be worthwhile why unbound has issues getting 'stuff' from the Internet.
Something is impacting your traffic that was generated by unbound. That your DNS traffic, it's not much but very important. -
@gertjan appreciate the detailed reply.
After some diagnostics on my end, it does not appear to be DNSSEC settings (I've re-enabled it w/out issue) but rather the Use SSL/TLS for outgoing DNS Queries to Forwarding Servers. I currently use Google DNS (8.8.8.8/8.8.4.4 > dns.google) and have not had any issues in many years with this enabled so not sure what happened since the upgrade. I have read that this setting is generally incompatible with DNSSEC, so I've unchecked both for now and everything is working just fine.
-
Generally you would not have DNSSec enabled with DoT but only because you will be in forwarding mod for DoT. You should be able to use them together but it's likely far less tested because there's little point.