Unbound stops resolving when Domain Overrides DNS not answering

johnpoz

So the 1 NS your forwarding to becomes unresponsive - ie you can not get to it right.. Then yeah your going to run into a timeout timer.. Which prob like 15 minutes

Here
https://nlnetlabs.nl/documentation/unbound/info-timeout/

Summary

Unbound implements timeout management with exponential backoff and keeps track of average and variance of the ping times. If a server starts to become unresponsive, a probing scheme is applied in which a few queries are selected to probe the IP address. If that fails, the server is blocked for 15 minutes (infra-ttl) and re-probed with one query after that.

Queries that failed to attain probe status, or if the server is blocked due to timeouts, get a reply with the SERVFAIL error. Also, if the available IP addresses for a domain have been probed for 5 times by a query it is also replied with SERVFAIL. New queries must come in to continue the probing.

The status of an IP address can be looked up and flushed. The infra-cache is not flushed on a reload, so the list of blocked sites and ping times is not wiped. If you wish to remove it the flush_infra control command can be used.

edit:
If your running into a neg TTL thing, that default value is set on the SOA, could be say an hour.. You can adjust that on your NS for this domain your forwarding to SOA record. I would suggest when you run into this to actually look at the infra_cache for this domain, etc. you can do what with the unbound-control cmd.

unbound and bind can handle this stuff different for sure.. pretty sure they work the same for neg cache and the min ttl from the SOA... But off the top not sure exactly what bind does on a NS that does not respond.

Really going to need more info when this happens. Unbound could also maybe just be glitched when the interface goes down.. This is a vpn interface, and your having unbound bind directly to the interface? You can work around such issues by having unbound use its loopback as the outbound query vs actually using the interface directly, etc.

More info is needed to help figure out exactly where the problem is.

iorx

Ah, could have read up on Unbound, my bad. But, thank you so much for taking the time to check from that angle.

The scenario here is from multiple installation I got with the same outcome. A central office which terminates remote offices OpenVPN connection. Some central offices has multiple Windows DC with DNS and domain overrides then look like this:
domain.suffix pointing to a Windows DC#1 DNS
domain.suffix pointing to a Windows DC#2 DNS
Smaller installations who only has one pfsense or a single DC DNS the domain override only has one DNS to ask.

In one installation, as a test, I've now defined the DCs fqdn as host names in the remote office Unbound. This solves the problem when the OpenVPN gltiches and override not answering, and hostname becomes unsolvable.

But. I just tried reproducing the problem:
(Using pfsense Diagnostics/NSLookup from the OpenVPN client side)

Resolving the host name. Worked as it should, everything is connected.

Disconnected the tunnel. Tried to resolve the host name again, but this time I got an answer which looked like the the remote NS was answering, so obviously I got a cached answer.

I restarted Unbound and tried resolving. Of course It could not resolve the name as the cache had been cleared.

Now to the interesting part, in my initial description and tests with name resolving should not work if I connected the tunnel again. But connecting the tunnel and then trying to resolve the host name worked now after the tunnel came up again.
So, my diagnose and its reproducibility of the problem is flawed as it worked now when testing.

I thank involved for the time and energy you put into this post. I got to get a better picture of when and why domain override host names are not resolving.

I'll reconnect here when I get a more consistent data on how to report this issue.

Brgs,

johnpoz

Please come back when you have some more info, and make sure you check the infra_cache, when something not working from this forwarded domain.. Also what is currently cached for that domain as well, etc.

You can almost always tell when something is returned from cache, because you will normally see a less that round number for the ttl on the returned info..

If you dig for it and you get back say 3600, good bet it was resolved - vs say if you get back ttl of 1481 or something - yeah that more than likely was served from cache ;)

With your domain override your forwarding to the authoritative ns for that domain, so it will return the full ttl vs something its cache, etc. unlike when you forward to say some public resolver like googledns or quad9, etc.

iorx

Will do. And thanks for the troubleshooting tips. Valued as I'm not that experienced on the subject.

Just a thought. Is interface down/up event different for Unbound/pfSense when bringing the tunnel down/up manually or when connection is lost (which causes a OpenVPN reconnect)?
Trying to figure out why my test didn't showed the result I was expecting.

johnpoz

Yeah a interface down going to be different than just loss of connection.. Any way you can pull the plug on the wire or anything.. Or simulate from the other end by killing the openvpn server or something..

I would change your outbound interface on unbound to the loopback, this should get around any sort of binding issues with interfaces like a vpn one, etc.

John41

I am running 2.4.4 and have what appears to be a similar problem. This is over an ipssec tunnel. It has been this way for many versions of pfSense. When the dns server used for forwarding goes down (probably beyond the timeout mentioned above) forwarding stops. I haven't worked through the debugging steps in this thread. However, in "DNS Resolver General Settings" if I add Localhost to Outgoing Network Interfaces the forwarding name resolution does not happen at all.

Still investigating...

Derelict

That is because sourcing traffic from the firewall can be problematic over VPNs. It can be done but you might have to make some changes. For instance, selecting an outgoing interface that makes the source traffic be interesting to IPsec (matches the traffic selector(s)) would probably fix your problem. This hack might also work:

https://docs.netgate.com/pfsense/en/latest/vpn/ipsec/accessing-firewall-services-over-ipsec-vpns.html

If vital infrastructure is necessary for that site to function it might be prudent to add redundancy and move it off the firewall. You could, for instance, run an authoritative slave DNS server (can you still say slave DNS server?) at that site that local users query. That way they could get work done even if the VPN was down for some reason.

John41

I will take a look at those options.

As you propose I have been thinking of running a DNS server so I can be secondary for the zone I am currently forwarding to. This is not a critical application so in my case might not be worth the overhead.

Thanks,

John

iorx

Hi again!

Now I'm experiencing this with 2.4.5p1. Newly installed.
IPsec to main office.
The fix with LAN gateway and route.
Domain override in unbound.

If connection is lost for a brief moment making unbound timeout it stops resolving for the overridden domain.
I believe we came to the conclusion that unbound marks this as unreachable or something and just doesn't bother to ask again.

Any new idea on how to make pfsense/unbound not give up so easily? Or if it is possible in a script detect the unbound has "tombstoned" the entries?

Switching back to DNS Forwarder a solution maybe?

iorx

No response? This is an issue, how to go about getting some attention for it?

bmeeks

@iorx said in Unbound stops resolving when Domain Overrides DNS not answering:

No response? This is an issue, how to go about getting some attention for it?

You can register and submit bug reports on the Redmine site here: https://redmine.pfsense.org/projects/pfsense.

Be prepared to fully describe in the report the actual bug and the steps required to reliably recreate the bug.

johnpoz

To your other question you can ask unbound who it would ask for something

unbound-control -c /var/unbound/unbound.conf lookup www.example.com

It should list your domain override NS, and then info about that NS..

You could use the flush_negative command with that to flush all negative data

iorx

@johnpoz said in Unbound stops resolving when Domain Overrides DNS not answering:

unbound-control -c /var/unbound/unbound.conf lookup

Nice. I will see if I can find a way to trigger a flush when resolution stops for the overrides (go around the problem until a better solution)

For the moment I'm testing to use DNS Forwarder instead, but have experience some weirdness there too. But the Forwarder is "dumb" isn't it? No caching? So maybe last time it stopped working was a related to the IPsec, need to check that further.

But unbound I know have this issue. I'll try to create a bug report with reproducible steps to trigger the problem.

johnpoz

@iorx said in Unbound stops resolving when Domain Overrides DNS not answering:

But the Forwarder is "dumb" isn't it? No caching?

Not sure where you would of gotten that idea, it caches. It would really be pretty pointless if it didn't

Here I enabled dnsmasq on port 5353 (so I didn't have to turn off unbound), then asked it how big its cache is

$ dig @192.168.9.253 -p 5353 +short chaos txt cachesize.bind
"10000"

As simple way to see if something is cached or not, is look to see how fast it resolves.. If you get an answer in 0 or couple of ms vs how long it would take to forward to where your forwarding and back, it was cached and your answer was returned from cache.

You can also ask like the command above what is the hit rate on your cache.

$ dig @192.168.9.253 -p 5353 +short chaos txt hits.bind
"2"

Do a query for something a few times, and then check it again - see the number go up..

$ dig @192.168.9.253 -p 5353 +short chaos txt hits.bind
"7"

You can ask it how many misses its had

$ dig @192.168.9.253 -p 5353 +short chaos txt misses.bind
"1"

Keep in mind I just enabled it 30 seconds ago and have only done query for www.google.com, not actually using it, etc.

You can get info for cachesize.bind, insertions.bind, evictions.bind, misses.bind, hits.bind, auth.bind and servers.bind

There is a way you can get it to dump its cache to syslog too.. you have to set it to log queries and then

-q, --log-queries
     Log the results of DNS queries handled by dnsmasq. Enable a full 
     cache dump on receipt of SIGUSR1.

Unbound is much more robust dns option..

Check out the dnsmasq man page for other info
https://linux.die.net/man/8/dnsmasq

BTW, that is caches is right in its description ;)

Name
dnsmasq - A lightweight DHCP and caching DNS server.

iorx

Got the forwarder (dnsmasq) capabilities and function backwards I understand. Didn't read up enough on that, my apologies.
Many thanks for the awesome explanation!

I'll go forth trying to make reproducible lookup scenario. Going to try out both dnsmasq and unbounds behavior on domain overrides.

johnpoz

A simple test I would do when you feel your not resolving something over your vpn connection be it ipsec or openvpn... Is just do a direct query yourself via your fav lookup too, dig, host, nslookup - do you get a response?

If not then there is no possible way unbound or dnsmasq could either. If you do, then you need to figure out why unbound or dnsmasq is not - did they loose their binding to interface that would allow them to query down the vpn connection? Where exactly sort of response do you get, do you get timeout, refused, servfail, nx?

Was what you were looking for not cached? If it was cached you should of gotten response be it you could talk to that other ns either way.

I am not clear enough on how routing and pfsense works with ipsec, and what interface your binding unbound too. But least likely to fail sort of setup is to set unbound to only use localhost as as its outbound interface.. Now it should use routing to get to where you setup a domain override, or normal resolving/forwarding. If it has route to where the IP is that you setup in your domain override that says go over the vpn, it should do that.

If had some binding issue with its outbound interface, that has failed for some reason - reconnection of vpn, without restart of unbound.. Then sure it could have problems.. Which use of localhost as outbound interface could remedy.

Another option when your doing odd stuff with vpn connections that could reconned, and effect some applications binding to an interface/ip is to move the NS off pfsense, and put it on your network, so anything it would be trying to talk to would be normally routed just like any other client on your network.

iorx

@johnpoz

(necroposting, sorry for that. but I felt the need to follow up)

To begin with, I never thanked you for educating and helping me on the subject! Thanks!

This has been brewing for a while, I've gone back and forth, tested stuff and given up.

Short info/summary:
"remotesite.local" points to a DNS on the other side of a VPN connection. An override in Unbound.
"localsite.n23" is the local network where I am.
Unbound stops resolving "remotesite.local" hosts after a while. Works for a while again after restarting Unbound and the stops resolving at remotesite.local

Today using some extreme googe-fu after I realized something. The only overrides that stops resolving are those ending with .local.

What lead me to this conclusion was this:

As one can see (logs below) 17:18 it was able to resolve hosts at the remote site. At 17:19 it couldn't anymore. Checking the logs for Unbound i found that it's not even trying to resolve anything on the .local domain.
Googled around on the issue and found that someone had a similar problem with .local that just stopped responding.
domain-overrides-stop-resolving-periodically-they-only-resume-after-the-service-has-been-restarted
The solution there was to make an override ".local" to point out a DNS. Tested to do that, a "local" override that points to 127.0.0.1.

This was a couple of hours ago and it looks like it's working.
The reason .local was used at the remove domain is ancient, it's a windows domain created when Microsoft "best practice" was to create local FQDN with .local at the end.

Unbound log:

Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: validation success host01.remotesite.local. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: validator operate: query host01.remotesite.local. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: finishing processing for host01.remotesite.local. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: resolving host01.remotesite.local. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:3] info: validator operate: query host01.remotesite.local. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: validation success host01.remotesite.local.localsite.n23. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: validator operate: query host01.remotesite.local.localsite.n23. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: finishing processing for host01.remotesite.local.localsite.n23. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: resolving host01.remotesite.local.localsite.n23. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:2] info: validator operate: query host01.remotesite.local.localsite.n23. AAAA IN
Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: validation success host01.remotesite.local.localsite.n23. A IN
Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: validator operate: query host01.remotesite.local.localsite.n23. A IN
Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: finishing processing for host01.remotesite.local.localsite.n23. A IN
Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: resolving host01.remotesite.local.localsite.n23. A IN
Mar 18 17:19:24 	unbound 	52338 	[52338:0] info: validator operate: query host01.remotesite.local.localsite.n23. A IN
Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: validation success host01.remotesite.local. A IN
Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: validator operate: query host01.remotesite.local. A IN
Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: finishing processing for host01.remotesite.local. A IN
Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: resolving host01.remotesite.local. A IN
Mar 18 17:18:04 	unbound 	52338 	[52338:2] info: validator operate: query host01.remotesite.local. A IN

iorx

This post is deleted!

masupilamie

Can confirm iorx's "workaround" works. It seems the tld needs to be added as a domain override pointing to itself when a subdomain of that tld is used for local resolution and another subdomain is used for remote resolution via domain override.

In my case my local network uses main.lan and the remote site uses remote.lan
Only adding remote.lan as domain override to the remote site's DNS server made it work for less than a minute after flushing unbound's cache. Adding "lan" as domain override pointing to 127.0.0.1 made DNS resolution to remote.lan stable.

configured Domain Overrides
Screenshot 2025-01-19 at 20.55.04.png

pfsense version: 2.7.2