Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    23.01 breaks DNS resolver and pFblocker

    Scheduled Pinned Locked Moved General pfSense Questions
    23 Posts 9 Posters 4.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • areckethennuA
      areckethennu @johnpoz
      last edited by areckethennu

      @johnpoz said in 23.01 breaks DNS resolver and pFblocker:

      @gertjan good settings, another one I might suggest is serve zero. I have been using it for years, and have never had a problem with it.

      serve 0 allows unbound to serve up the last IP it had in cache for some fqdn even if the ttl has expired. The ttl it hands off to the client will be 0, so if for some reason that doesn't work the client will have to ask again for that fqdn, and by this time unbound will have looked it up.

      An aside from the crux of this thread, but I saw that and I'd like to try it. Is this the actual entry? It looks like it, but I thought I'd ask:

      Serve Expired [] Serve cache records even with TTL of 0
      When enabled, allows unbound to serve one query even with a TTL of 0, if TTL is 0 then new record will be requested in the background when the cache is served to ensure cache is updated without latency on service of the DNS request.
      

      I'm just a home user with pfSense 23.09-RELEASE (amd64) on a Protecli VP2410

      johnpozJ 1 Reply Last reply Reply Quote 0
      • johnpozJ
        johnpoz LAYER 8 Global Moderator @areckethennu
        last edited by

        @areckethennu yeah that is the serve 0 setting.. Check the box..

        An intelligent man is sometimes forced to be drunk to spend time with his fools
        If you get confused: Listen to the Music Play
        Please don't Chat/PM me for help, unless mod related
        SG-4860 24.11 | Lab VMs 2.8, 24.11

        1 Reply Last reply Reply Quote 0
        • L
          llebgrate
          last edited by

          Just upgraded to 23.01 myself and am experiencing DNS resolver issues. A quick restart of the service fixes it for clients. Wish I had saved the nslookup error on a domain I was testing. Can't recall the exact wording, but it was one I had not seen before about a server error.

          I did enable "Serve Expired" in the advanced DNS resolver settings and will monitor to see if it happens again.

          1 Reply Last reply Reply Quote 0
          • D
            Draco @johnpoz
            last edited by

            @johnpoz What I see when I log onto Flickr:

            This site can’t be reached
            www.flickr.com’s server IP address could not be found.
            Try:
            • Checking the connection
            • Checking the proxy, firewall, and DNS configuration
            ERR_NAME_NOT_RESOLVED

            Then after 30+ seconds I see the Flickr homepage frame sans content, and things still appear to be loading after 2 minutes (but do not complete). Once I hit refresh (using Chrome), it loads right up, no problems.

            When I open the Chrome debug window (note that I do not see the same thing you show in your screenshots, e.g. no TIMINGS menu) the timings on the initial page load don't seem too bad:

            netgate initial timings.png

            I am not seeing the 0 time for DNS; I suppose because it was not yet cached (I restarted Unbound and cleared my local DNS cache). But then, things get ugly when some fonts and scripts try to load:
            Netgate scripts loading.png

            The flickr.com load time is what is detailed above. The font failures are not, I suspect, fatal. I think the problem is the combo?yui:3.16.0/yui.../loader-hermes/... line. This references combo.static.flickr.com. When this load fails (seems to be jscript?), it torpedoes the rest of the website load (and accounts for the hang I see before refresh). My guess is this script does some init work and then loads other scripts.

            When I hit REFRESH in the "hung" browser window, Flickr loads as noted above. The big difference is combo?yui?3.16.10…. call to that first Hermes URL (whatever that is -- scripts is my guess, as noted above) succeeds, and whatever that loads leads to a long list of more Hermes calls that also succeed. The website now loads properly.

            I tried doing DIGs on flickr.com and combo.static.flickr.com (the URL for the Hermes URLs). They do not look alike, and I do not know enough about DIG to interpret that (happy to upload if you want a look).

            I tried running these same tests on my older computer (slower), and I never run into the website load problems. My best guess is that my newer, faster computer is timing out. I did Malware and AV scans on my newer computer to ensure that was not causing issues. So the computer used to load the website makes a difference. I have replicated these problems using both Chrome and Firefox, and on multiple websites (see ironic note below). I've only drilled deeper on Flickr.

            FWIW, when I run a Windows console app to do simple DNS queries (not DIG, just one call), I've seen these fail for google.com and other common sites too. This is why I suspected DNS issues. It now appears that it is not quite that simple, and is tied to machine speed somehow (or software config, or both or...?).

            At this point I am well outside my toolset and knowledge (give me local code and a debugger and it's a different story), and have spent far more time than I have to spare on this issue. I can't have a production machine failing to load websites, and worse failing to do online backups or software updates, for the sake of finding this problem. If Netgate wants to get involved, then I'd be happy to set aside some time to work through this with them (I have the current 23.01 config on another USB stick). I remain hopeful that going back to 22.05 will have my production environment working again.

            I do want to thank you for the time and effort you put into your responses. Without your screenshots and comments I would not have tried DIG or getting the timings included here.

            On the side of irony, when I loaded the forum to enter this response, I hit the same This site can't be reached error. I think the difference between this and Flickr is that the forum isn't loading a script whose failure to load leaves the website non-functional a la Flickr. After a few seconds, the Netgate forum site just loads up and resolves normally. Given my tests with the DNS Query code I wrote (see google.com DNS query failure noted above), it is clearly related to DNS and timing issues; I just cannot pinpoint how.

            D 1 Reply Last reply Reply Quote 0
            • D
              Draco @Draco
              last edited by

              @draco Replying to my last post: I decided to try a reboot from the Console before re-applying 22.05. Someone mentioned rebooting on another thread, which I had not tried because pfSense reboots as part of the upgrade. But I tried it anyhow...

              So far my SG-5100 has been up for almost an hour and I have not repro'd the Flickr problem. What would rebooting change that leaves things working all of a sudden? My primary PC was not rebooted or changed, just the SG-5100.

              1 Reply Last reply Reply Quote 0
              • L
                llebgrate
                last edited by

                Just following up that I tried the Serve Expired setting and a simple reboot and unfortunately the problem still persists.

                windows client nslookup:

                forum.netgate.com
                Server: firewall.blah.com
                Address: 192.168.150.1
                *** firewall.blah.com can't find forum.netgate.com: Server failed

                restart DNS Resolver service

                forum.netgate.com
                Server: firewall.blah.com
                Address: 192.168.150.1

                Non-authoritative answer:
                Name: forum.netgate.com
                Addresses: 2610:160:11:18::199
                208.123.73.199

                I did have telegraf scraping stats from the resolver as well, but have since turned it off and will continue to monitor.

                1 Reply Last reply Reply Quote 0
                • stephenw10S
                  stephenw10 Netgate Administrator
                  last edited by

                  No errors in the resolver log when it's failing to resolve?

                  If you turn up the logging does it at least show the incoming requests?

                  L 1 Reply Last reply Reply Quote 0
                  • L
                    llebgrate @stephenw10
                    last edited by llebgrate

                    @stephenw10

                    I actually just caught it again. An entire page filled with these:

                    Mar 6 20:18:33	unbound	84264	[84264:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
                    Mar 6 20:18:32	unbound	84264	[84264:1] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
                    Mar 6 20:18:32	unbound	84264	[84264:3] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN
                    Mar 6 20:18:32	unbound	84264	[84264:3] info: generate keytag query _ta-4f66. NULL IN
                    

                    Found this: https://forum.netgate.com/topic/152338/unbound-failed-to-prime-trust-anchor-could-not-fetch-dnskey-rrset-dnskey-in and am thinking it's dnssec related. I do in fact have forwarding and dnssec enabled, so going to play with the settings for a bit. Might also mess with dynamic dhcp client reg options. Haven't changed anything here though since the upgrade in a very very long time.

                    GertjanG 1 Reply Last reply Reply Quote 0
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      Hmm. Yes, that's DNSSec. I would at least try disabling and see if that removes the issue.
                      That really shouldn't be a problem though...

                      1 Reply Last reply Reply Quote 0
                      • GertjanG
                        Gertjan @llebgrate
                        last edited by

                        @llebgrate said in 23.01 breaks DNS resolver and pFblocker:

                        I actually just caught it again. An entire page filled with these:

                        Mar 6 20:18:33 unbound 84264 [84264:0] info: failed to prime trust anchor -- DNSKEY rrset is not secure . DNSKEY IN

                        Before unbound is started, some house keeping is done.
                        unbound is started with a single command that asks it to download a copy of the DNSSEC root key file. Here you can see that file, at the top.
                        One of the tasks is : prepare a good know copy of root DNSKEY, id 20236 (for now, as it can change when needed).

                        The thing is, and this is probably your real issue :
                        It can't !!
                        This means your unbound isn't able to download a small file, 1 kilo byte file (here it is) from the Internet.
                        That's not promising. Why would it have to try many times ?
                        This smells 'uplink issues'.

                        When you see :

                        info: generate keytag query _ta-4f66. NULL IN
                        

                        you know the root key file has been downloaded successfully.
                        Because hex 4f66 is 20326 decimal, the key ID.

                        @llebgrate: good news : because you are forwarding, you have to trust the resolver you are forwarding to, you can disable DNSSEC.
                        Still, it might be worthwhile why unbound has issues getting 'stuff' from the Internet.
                        Something is impacting your traffic that was generated by unbound. That your DNS traffic, it's not much but very important.

                        No "help me" PM's please. Use the forum, the community will thank you.
                        Edit : and where are the logs ??

                        L 1 Reply Last reply Reply Quote 0
                        • L
                          llebgrate @Gertjan
                          last edited by llebgrate

                          @gertjan appreciate the detailed reply.

                          After some diagnostics on my end, it does not appear to be DNSSEC settings (I've re-enabled it w/out issue) but rather the Use SSL/TLS for outgoing DNS Queries to Forwarding Servers. I currently use Google DNS (8.8.8.8/8.8.4.4 > dns.google) and have not had any issues in many years with this enabled so not sure what happened since the upgrade. I have read that this setting is generally incompatible with DNSSEC, so I've unchecked both for now and everything is working just fine.

                          1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator
                            last edited by

                            Generally you would not have DNSSec enabled with DoT but only because you will be in forwarding mod for DoT. You should be able to use them together but it's likely far less tested because there's little point.

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.