Unbound stops resolving when Domain Overrides DNS not answering



  • Hi,

    Been suffering from this a while now.
    Hope I succeed in explaining the issue here.

    Scenario:
    Latest pfSense all over.
    Peer to peer OpenVPN connections
    Domain Overrides for remote domain lookup

    Problem:
    If the OpenVPN connection glitches, goes down for a moment, and Ubound tries to resolve something on that Domain Override it look like it marks that lookup as "not found" even if the connection comes backup again.
    Work around / temp fix:
    A restart of Ubound service is required to get resolving to work for that domain again.

    Note:
    Bind didn't show this behavior and worked better with retrying resolving names.

    Solution:
    Any settings/timeout that can be set to make Unbound retry harder or to not mark host names as not available because of its not reaching the Domain Override DNS?

    Testing it:
    Ping hostname on other side of OpenVPN tunnel. Works.
    Disconnect OpenVPN tunnel.
    Ping the same hostname again. Of course its not resolved, host lookup fails.
    Connect OpenVPN tunnel
    Ping the same hostname again. Host lookup fails even though remote DNS (Unbound) is now available for answers.

    My thoughts:
    Unbound caches / marks a host name as not available without rechecking if it can resolve it.
    So can Unbounds behavior here be changed?

    Brgs,



  • Hi again,

    I'll offer one bump on this subject.

    Feel free to comment if more information is needed to get a grasp of the situation or environment.

    Brgs,


  • LAYER 8 Moderator

    @iorx said in Unbound stops resolving when Domain Overrides DNS not answering:

    Ping the same hostname again. Host lookup fails even though remote DNS (Unbound) is now available for answers.

    You are taking into account that the "still not available" could simply be YOUR testmachine caching the DNS answer until (like specified by DNS rules) the negative caching TTL is reached? Something that unbound does, too?



  • Windows client was used for tests.
    I cleared local DNS before trying again, yes. Forgot mentioning that.
    ipconfig /flushdns, and also to be very sure restarted the DNS Client service.


  • LAYER 8 Global Moderator

    So the 1 NS your forwarding to becomes unresponsive - ie you can not get to it right.. Then yeah your going to run into a timeout timer.. Which prob like 15 minutes

    Here
    https://nlnetlabs.nl/documentation/unbound/info-timeout/

    Summary

    Unbound implements timeout management with exponential backoff and keeps track of average and variance of the ping times. If a server starts to become unresponsive, a probing scheme is applied in which a few queries are selected to probe the IP address. If that fails, the server is blocked for 15 minutes (infra-ttl) and re-probed with one query after that.

    Queries that failed to attain probe status, or if the server is blocked due to timeouts, get a reply with the SERVFAIL error. Also, if the available IP addresses for a domain have been probed for 5 times by a query it is also replied with SERVFAIL. New queries must come in to continue the probing.

    The status of an IP address can be looked up and flushed. The infra-cache is not flushed on a reload, so the list of blocked sites and ping times is not wiped. If you wish to remove it the flush_infra control command can be used.

    edit:
    If your running into a neg TTL thing, that default value is set on the SOA, could be say an hour.. You can adjust that on your NS for this domain your forwarding to SOA record. I would suggest when you run into this to actually look at the infra_cache for this domain, etc. you can do what with the unbound-control cmd.

    unbound and bind can handle this stuff different for sure.. pretty sure they work the same for neg cache and the min ttl from the SOA... But off the top not sure exactly what bind does on a NS that does not respond.

    Really going to need more info when this happens. Unbound could also maybe just be glitched when the interface goes down.. This is a vpn interface, and your having unbound bind directly to the interface? You can work around such issues by having unbound use its loopback as the outbound query vs actually using the interface directly, etc.

    More info is needed to help figure out exactly where the problem is.



  • Ah, could have read up on Unbound, my bad. But, thank you so much for taking the time to check from that angle.

    The scenario here is from multiple installation I got with the same outcome. A central office which terminates remote offices OpenVPN connection. Some central offices has multiple Windows DC with DNS and domain overrides then look like this:
    domain.suffix pointing to a Windows DC#1 DNS
    domain.suffix pointing to a Windows DC#2 DNS
    Smaller installations who only has one pfsense or a single DC DNS the domain override only has one DNS to ask.

    In one installation, as a test, I've now defined the DCs fqdn as host names in the remote office Unbound. This solves the problem when the OpenVPN gltiches and override not answering, and hostname becomes unsolvable.

    But. I just tried reproducing the problem:
    (Using pfsense Diagnostics/NSLookup from the OpenVPN client side)

    Resolving the host name. Worked as it should, everything is connected.

    Disconnected the tunnel. Tried to resolve the host name again, but this time I got an answer which looked like the the remote NS was answering, so obviously I got a cached answer.

    I restarted Unbound and tried resolving. Of course It could not resolve the name as the cache had been cleared.

    Now to the interesting part, in my initial description and tests with name resolving should not work if I connected the tunnel again. But connecting the tunnel and then trying to resolve the host name worked now after the tunnel came up again.
    So, my diagnose and its reproducibility of the problem is flawed as it worked now when testing.

    I thank involved for the time and energy you put into this post. I got to get a better picture of when and why domain override host names are not resolving.

    I'll reconnect here when I get a more consistent data on how to report this issue.

    Brgs,


  • LAYER 8 Global Moderator

    Please come back when you have some more info, and make sure you check the infra_cache, when something not working from this forwarded domain.. Also what is currently cached for that domain as well, etc.

    You can almost always tell when something is returned from cache, because you will normally see a less that round number for the ttl on the returned info..

    If you dig for it and you get back say 3600, good bet it was resolved - vs say if you get back ttl of 1481 or something - yeah that more than likely was served from cache ;)

    With your domain override your forwarding to the authoritative ns for that domain, so it will return the full ttl vs something its cache, etc. unlike when you forward to say some public resolver like googledns or quad9, etc.



  • Will do. And thanks for the troubleshooting tips. Valued as I'm not that experienced on the subject.

    Just a thought. Is interface down/up event different for Unbound/pfSense when bringing the tunnel down/up manually or when connection is lost (which causes a OpenVPN reconnect)?
    Trying to figure out why my test didn't showed the result I was expecting.


  • LAYER 8 Global Moderator

    Yeah a interface down going to be different than just loss of connection.. Any way you can pull the plug on the wire or anything.. Or simulate from the other end by killing the openvpn server or something..

    I would change your outbound interface on unbound to the loopback, this should get around any sort of binding issues with interfaces like a vpn one, etc.



  • I am running 2.4.4 and have what appears to be a similar problem. This is over an ipssec tunnel. It has been this way for many versions of pfSense. When the dns server used for forwarding goes down (probably beyond the timeout mentioned above) forwarding stops. I haven't worked through the debugging steps in this thread. However, in "DNS Resolver General Settings" if I add Localhost to Outgoing Network Interfaces the forwarding name resolution does not happen at all.

    Still investigating...


  • LAYER 8 Netgate

    That is because sourcing traffic from the firewall can be problematic over VPNs. It can be done but you might have to make some changes. For instance, selecting an outgoing interface that makes the source traffic be interesting to IPsec (matches the traffic selector(s)) would probably fix your problem. This hack might also work:

    https://docs.netgate.com/pfsense/en/latest/vpn/ipsec/accessing-firewall-services-over-ipsec-vpns.html

    If vital infrastructure is necessary for that site to function it might be prudent to add redundancy and move it off the firewall. You could, for instance, run an authoritative slave DNS server (can you still say slave DNS server?) at that site that local users query. That way they could get work done even if the VPN was down for some reason.



  • I will take a look at those options.

    As you propose I have been thinking of running a DNS server so I can be secondary for the zone I am currently forwarding to. This is not a critical application so in my case might not be worth the overhead.

    Thanks,

    John


Log in to reply