Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Unbound stops resolving when Domain Overrides DNS not answering

    Scheduled Pinned Locked Moved DHCP and DNS
    23 Posts 7 Posters 4.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • iorxI
      iorx
      last edited by

      Windows client was used for tests.
      I cleared local DNS before trying again, yes. Forgot mentioning that.
      ipconfig /flushdns, and also to be very sure restarted the DNS Client service.

      1 Reply Last reply Reply Quote 0
      • johnpozJ
        johnpoz LAYER 8 Global Moderator
        last edited by johnpoz

        So the 1 NS your forwarding to becomes unresponsive - ie you can not get to it right.. Then yeah your going to run into a timeout timer.. Which prob like 15 minutes

        Here
        https://nlnetlabs.nl/documentation/unbound/info-timeout/

        Summary

        Unbound implements timeout management with exponential backoff and keeps track of average and variance of the ping times. If a server starts to become unresponsive, a probing scheme is applied in which a few queries are selected to probe the IP address. If that fails, the server is blocked for 15 minutes (infra-ttl) and re-probed with one query after that.

        Queries that failed to attain probe status, or if the server is blocked due to timeouts, get a reply with the SERVFAIL error. Also, if the available IP addresses for a domain have been probed for 5 times by a query it is also replied with SERVFAIL. New queries must come in to continue the probing.

        The status of an IP address can be looked up and flushed. The infra-cache is not flushed on a reload, so the list of blocked sites and ping times is not wiped. If you wish to remove it the flush_infra control command can be used.

        edit:
        If your running into a neg TTL thing, that default value is set on the SOA, could be say an hour.. You can adjust that on your NS for this domain your forwarding to SOA record. I would suggest when you run into this to actually look at the infra_cache for this domain, etc. you can do what with the unbound-control cmd.

        unbound and bind can handle this stuff different for sure.. pretty sure they work the same for neg cache and the min ttl from the SOA... But off the top not sure exactly what bind does on a NS that does not respond.

        Really going to need more info when this happens. Unbound could also maybe just be glitched when the interface goes down.. This is a vpn interface, and your having unbound bind directly to the interface? You can work around such issues by having unbound use its loopback as the outbound query vs actually using the interface directly, etc.

        More info is needed to help figure out exactly where the problem is.

        An intelligent man is sometimes forced to be drunk to spend time with his fools
        If you get confused: Listen to the Music Play
        Please don't Chat/PM me for help, unless mod related
        SG-4860 24.11 | Lab VMs 2.8, 24.11

        1 Reply Last reply Reply Quote 1
        • iorxI
          iorx
          last edited by

          Ah, could have read up on Unbound, my bad. But, thank you so much for taking the time to check from that angle.

          The scenario here is from multiple installation I got with the same outcome. A central office which terminates remote offices OpenVPN connection. Some central offices has multiple Windows DC with DNS and domain overrides then look like this:
          domain.suffix pointing to a Windows DC#1 DNS
          domain.suffix pointing to a Windows DC#2 DNS
          Smaller installations who only has one pfsense or a single DC DNS the domain override only has one DNS to ask.

          In one installation, as a test, I've now defined the DCs fqdn as host names in the remote office Unbound. This solves the problem when the OpenVPN gltiches and override not answering, and hostname becomes unsolvable.

          But. I just tried reproducing the problem:
          (Using pfsense Diagnostics/NSLookup from the OpenVPN client side)

          Resolving the host name. Worked as it should, everything is connected.

          Disconnected the tunnel. Tried to resolve the host name again, but this time I got an answer which looked like the the remote NS was answering, so obviously I got a cached answer.

          I restarted Unbound and tried resolving. Of course It could not resolve the name as the cache had been cleared.

          Now to the interesting part, in my initial description and tests with name resolving should not work if I connected the tunnel again. But connecting the tunnel and then trying to resolve the host name worked now after the tunnel came up again.
          So, my diagnose and its reproducibility of the problem is flawed as it worked now when testing.

          I thank involved for the time and energy you put into this post. I got to get a better picture of when and why domain override host names are not resolving.

          I'll reconnect here when I get a more consistent data on how to report this issue.

          Brgs,

          1 Reply Last reply Reply Quote 0
          • johnpozJ
            johnpoz LAYER 8 Global Moderator
            last edited by johnpoz

            Please come back when you have some more info, and make sure you check the infra_cache, when something not working from this forwarded domain.. Also what is currently cached for that domain as well, etc.

            You can almost always tell when something is returned from cache, because you will normally see a less that round number for the ttl on the returned info..

            If you dig for it and you get back say 3600, good bet it was resolved - vs say if you get back ttl of 1481 or something - yeah that more than likely was served from cache ;)

            With your domain override your forwarding to the authoritative ns for that domain, so it will return the full ttl vs something its cache, etc. unlike when you forward to say some public resolver like googledns or quad9, etc.

            An intelligent man is sometimes forced to be drunk to spend time with his fools
            If you get confused: Listen to the Music Play
            Please don't Chat/PM me for help, unless mod related
            SG-4860 24.11 | Lab VMs 2.8, 24.11

            1 Reply Last reply Reply Quote 0
            • iorxI
              iorx
              last edited by

              Will do. And thanks for the troubleshooting tips. Valued as I'm not that experienced on the subject.

              Just a thought. Is interface down/up event different for Unbound/pfSense when bringing the tunnel down/up manually or when connection is lost (which causes a OpenVPN reconnect)?
              Trying to figure out why my test didn't showed the result I was expecting.

              1 Reply Last reply Reply Quote 0
              • johnpozJ
                johnpoz LAYER 8 Global Moderator
                last edited by

                Yeah a interface down going to be different than just loss of connection.. Any way you can pull the plug on the wire or anything.. Or simulate from the other end by killing the openvpn server or something..

                I would change your outbound interface on unbound to the loopback, this should get around any sort of binding issues with interfaces like a vpn one, etc.

                An intelligent man is sometimes forced to be drunk to spend time with his fools
                If you get confused: Listen to the Music Play
                Please don't Chat/PM me for help, unless mod related
                SG-4860 24.11 | Lab VMs 2.8, 24.11

                1 Reply Last reply Reply Quote 0
                • J
                  John41
                  last edited by

                  I am running 2.4.4 and have what appears to be a similar problem. This is over an ipssec tunnel. It has been this way for many versions of pfSense. When the dns server used for forwarding goes down (probably beyond the timeout mentioned above) forwarding stops. I haven't worked through the debugging steps in this thread. However, in "DNS Resolver General Settings" if I add Localhost to Outgoing Network Interfaces the forwarding name resolution does not happen at all.

                  Still investigating...

                  1 Reply Last reply Reply Quote 0
                  • DerelictD
                    Derelict LAYER 8 Netgate
                    last edited by

                    That is because sourcing traffic from the firewall can be problematic over VPNs. It can be done but you might have to make some changes. For instance, selecting an outgoing interface that makes the source traffic be interesting to IPsec (matches the traffic selector(s)) would probably fix your problem. This hack might also work:

                    https://docs.netgate.com/pfsense/en/latest/vpn/ipsec/accessing-firewall-services-over-ipsec-vpns.html

                    If vital infrastructure is necessary for that site to function it might be prudent to add redundancy and move it off the firewall. You could, for instance, run an authoritative slave DNS server (can you still say slave DNS server?) at that site that local users query. That way they could get work done even if the VPN was down for some reason.

                    Chattanooga, Tennessee, USA
                    A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                    DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                    Do Not Chat For Help! NO_WAN_EGRESS(TM)

                    1 Reply Last reply Reply Quote 0
                    • J
                      John41
                      last edited by

                      I will take a look at those options.

                      As you propose I have been thinking of running a DNS server so I can be secondary for the zone I am currently forwarding to. This is not a critical application so in my case might not be worth the overhead.

                      Thanks,

                      John

                      1 Reply Last reply Reply Quote 0
                      • iorxI
                        iorx
                        last edited by

                        Hi again!

                        Now I'm experiencing this with 2.4.5p1. Newly installed.
                        IPsec to main office.
                        The fix with LAN gateway and route.
                        Domain override in unbound.

                        If connection is lost for a brief moment making unbound timeout it stops resolving for the overridden domain.
                        I believe we came to the conclusion that unbound marks this as unreachable or something and just doesn't bother to ask again.

                        Any new idea on how to make pfsense/unbound not give up so easily? Or if it is possible in a script detect the unbound has "tombstoned" the entries?

                        Switching back to DNS Forwarder a solution maybe?

                        1 Reply Last reply Reply Quote 0
                        • iorxI
                          iorx
                          last edited by

                          No response? This is an issue, how to go about getting some attention for it?

                          bmeeksB 1 Reply Last reply Reply Quote 0
                          • bmeeksB
                            bmeeks @iorx
                            last edited by

                            @iorx said in Unbound stops resolving when Domain Overrides DNS not answering:

                            No response? This is an issue, how to go about getting some attention for it?

                            You can register and submit bug reports on the Redmine site here: https://redmine.pfsense.org/projects/pfsense.

                            Be prepared to fully describe in the report the actual bug and the steps required to reliably recreate the bug.

                            1 Reply Last reply Reply Quote 0
                            • johnpozJ
                              johnpoz LAYER 8 Global Moderator
                              last edited by

                              To your other question you can ask unbound who it would ask for something

                              unbound-control -c /var/unbound/unbound.conf lookup www.example.com
                              

                              It should list your domain override NS, and then info about that NS..

                              You could use the flush_negative command with that to flush all negative data

                              An intelligent man is sometimes forced to be drunk to spend time with his fools
                              If you get confused: Listen to the Music Play
                              Please don't Chat/PM me for help, unless mod related
                              SG-4860 24.11 | Lab VMs 2.8, 24.11

                              1 Reply Last reply Reply Quote 0
                              • iorxI
                                iorx
                                last edited by

                                @johnpoz said in Unbound stops resolving when Domain Overrides DNS not answering:

                                unbound-control -c /var/unbound/unbound.conf lookup

                                Nice. I will see if I can find a way to trigger a flush when resolution stops for the overrides (go around the problem until a better solution)

                                For the moment I'm testing to use DNS Forwarder instead, but have experience some weirdness there too. But the Forwarder is "dumb" isn't it? No caching? So maybe last time it stopped working was a related to the IPsec, need to check that further.

                                But unbound I know have this issue. I'll try to create a bug report with reproducible steps to trigger the problem.

                                1 Reply Last reply Reply Quote 0
                                • johnpozJ
                                  johnpoz LAYER 8 Global Moderator
                                  last edited by johnpoz

                                  @iorx said in Unbound stops resolving when Domain Overrides DNS not answering:

                                  But the Forwarder is "dumb" isn't it? No caching?

                                  Not sure where you would of gotten that idea, it caches. It would really be pretty pointless if it didn't

                                  Here I enabled dnsmasq on port 5353 (so I didn't have to turn off unbound), then asked it how big its cache is

                                  $ dig @192.168.9.253 -p 5353 +short chaos txt cachesize.bind
                                  "10000"
                                  

                                  As simple way to see if something is cached or not, is look to see how fast it resolves.. If you get an answer in 0 or couple of ms vs how long it would take to forward to where your forwarding and back, it was cached and your answer was returned from cache.

                                  You can also ask like the command above what is the hit rate on your cache.

                                  $ dig @192.168.9.253 -p 5353 +short chaos txt hits.bind
                                  "2"
                                  

                                  Do a query for something a few times, and then check it again - see the number go up..

                                  $ dig @192.168.9.253 -p 5353 +short chaos txt hits.bind
                                  "7"
                                  

                                  You can ask it how many misses its had

                                  $ dig @192.168.9.253 -p 5353 +short chaos txt misses.bind
                                  "1"
                                  

                                  Keep in mind I just enabled it 30 seconds ago and have only done query for www.google.com, not actually using it, etc.

                                  You can get info for cachesize.bind, insertions.bind, evictions.bind, misses.bind, hits.bind, auth.bind and servers.bind

                                  There is a way you can get it to dump its cache to syslog too.. you have to set it to log queries and then

                                  -q, --log-queries
                                       Log the results of DNS queries handled by dnsmasq. Enable a full 
                                       cache dump on receipt of SIGUSR1.
                                  

                                  Unbound is much more robust dns option..

                                  Check out the dnsmasq man page for other info
                                  https://linux.die.net/man/8/dnsmasq

                                  BTW, that is caches is right in its description ;)

                                  Name
                                  dnsmasq - A lightweight DHCP and caching DNS server. 
                                  

                                  An intelligent man is sometimes forced to be drunk to spend time with his fools
                                  If you get confused: Listen to the Music Play
                                  Please don't Chat/PM me for help, unless mod related
                                  SG-4860 24.11 | Lab VMs 2.8, 24.11

                                  1 Reply Last reply Reply Quote 0
                                  • iorxI
                                    iorx
                                    last edited by

                                    Got the forwarder (dnsmasq) capabilities and function backwards I understand. Didn't read up enough on that, my apologies.
                                    Many thanks for the awesome explanation!

                                    I'll go forth trying to make reproducible lookup scenario. Going to try out both dnsmasq and unbounds behavior on domain overrides.

                                    1 Reply Last reply Reply Quote 0
                                    • johnpozJ
                                      johnpoz LAYER 8 Global Moderator
                                      last edited by

                                      A simple test I would do when you feel your not resolving something over your vpn connection be it ipsec or openvpn... Is just do a direct query yourself via your fav lookup too, dig, host, nslookup - do you get a response?

                                      If not then there is no possible way unbound or dnsmasq could either. If you do, then you need to figure out why unbound or dnsmasq is not - did they loose their binding to interface that would allow them to query down the vpn connection? Where exactly sort of response do you get, do you get timeout, refused, servfail, nx?

                                      Was what you were looking for not cached? If it was cached you should of gotten response be it you could talk to that other ns either way.

                                      I am not clear enough on how routing and pfsense works with ipsec, and what interface your binding unbound too. But least likely to fail sort of setup is to set unbound to only use localhost as as its outbound interface.. Now it should use routing to get to where you setup a domain override, or normal resolving/forwarding. If it has route to where the IP is that you setup in your domain override that says go over the vpn, it should do that.

                                      If had some binding issue with its outbound interface, that has failed for some reason - reconnection of vpn, without restart of unbound.. Then sure it could have problems.. Which use of localhost as outbound interface could remedy.

                                      Another option when your doing odd stuff with vpn connections that could reconned, and effect some applications binding to an interface/ip is to move the NS off pfsense, and put it on your network, so anything it would be trying to talk to would be normally routed just like any other client on your network.

                                      An intelligent man is sometimes forced to be drunk to spend time with his fools
                                      If you get confused: Listen to the Music Play
                                      Please don't Chat/PM me for help, unless mod related
                                      SG-4860 24.11 | Lab VMs 2.8, 24.11

                                      iorxI 1 Reply Last reply Reply Quote 1
                                      • iorxI
                                        iorx @johnpoz
                                        last edited by iorx

                                        @johnpoz

                                        (necroposting, sorry for that. but I felt the need to follow up)

                                        To begin with, I never thanked you for educating and helping me on the subject! Thanks!

                                        This has been brewing for a while, I've gone back and forth, tested stuff and given up.

                                        Short info/summary:
                                        "remotesite.local" points to a DNS on the other side of a VPN connection. An override in Unbound.
                                        "localsite.n23" is the local network where I am.
                                        Unbound stops resolving "remotesite.local" hosts after a while. Works for a while again after restarting Unbound and the stops resolving at remotesite.local

                                        Today using some extreme googe-fu after I realized something. The only overrides that stops resolving are those ending with .local.

                                        What lead me to this conclusion was this:

                                        As one can see (logs below) 17:18 it was able to resolve hosts at the remote site. At 17:19 it couldn't anymore. Checking the logs for Unbound i found that it's not even trying to resolve anything on the .local domain.
                                        Googled around on the issue and found that someone had a similar problem with .local that just stopped responding.
                                        domain-overrides-stop-resolving-periodically-they-only-resume-after-the-service-has-been-restarted
                                        The solution there was to make an override ".local" to point out a DNS. Tested to do that, a "local" override that points to 127.0.0.1.

                                        This was a couple of hours ago and it looks like it's working.
                                        The reason .local was used at the remove domain is ancient, it's a windows domain created when Microsoft "best practice" was to create local FQDN with .local at the end.

                                        Unbound log:

                                        Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: validation success host01.remotesite.local. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: validator operate: query host01.remotesite.local. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: finishing processing for host01.remotesite.local. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: resolving host01.remotesite.local. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: validator operate: query host01.remotesite.local. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: validation success host01.remotesite.local.localsite.n23. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: validator operate: query host01.remotesite.local.localsite.n23. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: finishing processing for host01.remotesite.local.localsite.n23. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: resolving host01.remotesite.local.localsite.n23. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: validator operate: query host01.remotesite.local.localsite.n23. AAAA IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: validation success host01.remotesite.local.localsite.n23. A IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: validator operate: query host01.remotesite.local.localsite.n23. A IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: finishing processing for host01.remotesite.local.localsite.n23. A IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: resolving host01.remotesite.local.localsite.n23. A IN
                                        Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: validator operate: query host01.remotesite.local.localsite.n23. A IN
                                        Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: validation success host01.remotesite.local. A IN
                                        Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: validator operate: query host01.remotesite.local. A IN
                                        Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: finishing processing for host01.remotesite.local. A IN
                                        Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: resolving host01.remotesite.local. A IN
                                        Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: validator operate: query host01.remotesite.local. A IN 
                                        
                                        iorxI 1 Reply Last reply Reply Quote 1
                                        • iorxI
                                          iorx @iorx
                                          last edited by

                                          This post is deleted!
                                          1 Reply Last reply Reply Quote 0
                                          • M
                                            masupilamie
                                            last edited by masupilamie

                                            Can confirm iorx's "workaround" works. It seems the tld needs to be added as a domain override pointing to itself when a subdomain of that tld is used for local resolution and another subdomain is used for remote resolution via domain override.

                                            In my case my local network uses main.lan and the remote site uses remote.lan
                                            Only adding remote.lan as domain override to the remote site's DNS server made it work for less than a minute after flushing unbound's cache. Adding "lan" as domain override pointing to 127.0.0.1 made DNS resolution to remote.lan stable.

                                            configured Domain Overrides
                                            Screenshot 2025-01-19 at 20.55.04.png

                                            pfsense version: 2.7.2

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.