Unbound stops resolving when Domain Overrides DNS not answering
-
I am running 2.4.4 and have what appears to be a similar problem. This is over an ipssec tunnel. It has been this way for many versions of pfSense. When the dns server used for forwarding goes down (probably beyond the timeout mentioned above) forwarding stops. I haven't worked through the debugging steps in this thread. However, in "DNS Resolver General Settings" if I add Localhost to Outgoing Network Interfaces the forwarding name resolution does not happen at all.
Still investigating...
-
That is because sourcing traffic from the firewall can be problematic over VPNs. It can be done but you might have to make some changes. For instance, selecting an outgoing interface that makes the source traffic be interesting to IPsec (matches the traffic selector(s)) would probably fix your problem. This hack might also work:
https://docs.netgate.com/pfsense/en/latest/vpn/ipsec/accessing-firewall-services-over-ipsec-vpns.html
If vital infrastructure is necessary for that site to function it might be prudent to add redundancy and move it off the firewall. You could, for instance, run an authoritative slave DNS server (can you still say slave DNS server?) at that site that local users query. That way they could get work done even if the VPN was down for some reason.
-
I will take a look at those options.
As you propose I have been thinking of running a DNS server so I can be secondary for the zone I am currently forwarding to. This is not a critical application so in my case might not be worth the overhead.
Thanks,
John
-
Hi again!
Now I'm experiencing this with 2.4.5p1. Newly installed.
IPsec to main office.
The fix with LAN gateway and route.
Domain override in unbound.If connection is lost for a brief moment making unbound timeout it stops resolving for the overridden domain.
I believe we came to the conclusion that unbound marks this as unreachable or something and just doesn't bother to ask again.Any new idea on how to make pfsense/unbound not give up so easily? Or if it is possible in a script detect the unbound has "tombstoned" the entries?
Switching back to DNS Forwarder a solution maybe?
-
No response? This is an issue, how to go about getting some attention for it?
-
@iorx said in Unbound stops resolving when Domain Overrides DNS not answering:
No response? This is an issue, how to go about getting some attention for it?
You can register and submit bug reports on the Redmine site here: https://redmine.pfsense.org/projects/pfsense.
Be prepared to fully describe in the report the actual bug and the steps required to reliably recreate the bug.
-
To your other question you can ask unbound who it would ask for something
unbound-control -c /var/unbound/unbound.conf lookup www.example.com
It should list your domain override NS, and then info about that NS..
You could use the flush_negative command with that to flush all negative data
-
@johnpoz said in Unbound stops resolving when Domain Overrides DNS not answering:
unbound-control -c /var/unbound/unbound.conf lookup
Nice. I will see if I can find a way to trigger a flush when resolution stops for the overrides (go around the problem until a better solution)
For the moment I'm testing to use DNS Forwarder instead, but have experience some weirdness there too. But the Forwarder is "dumb" isn't it? No caching? So maybe last time it stopped working was a related to the IPsec, need to check that further.
But unbound I know have this issue. I'll try to create a bug report with reproducible steps to trigger the problem.
-
@iorx said in Unbound stops resolving when Domain Overrides DNS not answering:
But the Forwarder is "dumb" isn't it? No caching?
Not sure where you would of gotten that idea, it caches. It would really be pretty pointless if it didn't
Here I enabled dnsmasq on port 5353 (so I didn't have to turn off unbound), then asked it how big its cache is
$ dig @192.168.9.253 -p 5353 +short chaos txt cachesize.bind "10000"
As simple way to see if something is cached or not, is look to see how fast it resolves.. If you get an answer in 0 or couple of ms vs how long it would take to forward to where your forwarding and back, it was cached and your answer was returned from cache.
You can also ask like the command above what is the hit rate on your cache.
$ dig @192.168.9.253 -p 5353 +short chaos txt hits.bind "2"
Do a query for something a few times, and then check it again - see the number go up..
$ dig @192.168.9.253 -p 5353 +short chaos txt hits.bind "7"
You can ask it how many misses its had
$ dig @192.168.9.253 -p 5353 +short chaos txt misses.bind "1"
Keep in mind I just enabled it 30 seconds ago and have only done query for www.google.com, not actually using it, etc.
You can get info for cachesize.bind, insertions.bind, evictions.bind, misses.bind, hits.bind, auth.bind and servers.bind
There is a way you can get it to dump its cache to syslog too.. you have to set it to log queries and then
-q, --log-queries Log the results of DNS queries handled by dnsmasq. Enable a full cache dump on receipt of SIGUSR1.
Unbound is much more robust dns option..
Check out the dnsmasq man page for other info
https://linux.die.net/man/8/dnsmasqBTW, that is caches is right in its description ;)
Name dnsmasq - A lightweight DHCP and caching DNS server.
-
Got the forwarder (dnsmasq) capabilities and function backwards I understand. Didn't read up enough on that, my apologies.
Many thanks for the awesome explanation!I'll go forth trying to make reproducible lookup scenario. Going to try out both dnsmasq and unbounds behavior on domain overrides.
-
A simple test I would do when you feel your not resolving something over your vpn connection be it ipsec or openvpn... Is just do a direct query yourself via your fav lookup too, dig, host, nslookup - do you get a response?
If not then there is no possible way unbound or dnsmasq could either. If you do, then you need to figure out why unbound or dnsmasq is not - did they loose their binding to interface that would allow them to query down the vpn connection? Where exactly sort of response do you get, do you get timeout, refused, servfail, nx?
Was what you were looking for not cached? If it was cached you should of gotten response be it you could talk to that other ns either way.
I am not clear enough on how routing and pfsense works with ipsec, and what interface your binding unbound too. But least likely to fail sort of setup is to set unbound to only use localhost as as its outbound interface.. Now it should use routing to get to where you setup a domain override, or normal resolving/forwarding. If it has route to where the IP is that you setup in your domain override that says go over the vpn, it should do that.
If had some binding issue with its outbound interface, that has failed for some reason - reconnection of vpn, without restart of unbound.. Then sure it could have problems.. Which use of localhost as outbound interface could remedy.
Another option when your doing odd stuff with vpn connections that could reconned, and effect some applications binding to an interface/ip is to move the NS off pfsense, and put it on your network, so anything it would be trying to talk to would be normally routed just like any other client on your network.
-
(necroposting, sorry for that. but I felt the need to follow up)
To begin with, I never thanked you for educating and helping me on the subject! Thanks!
This has been brewing for a while, I've gone back and forth, tested stuff and given up.
Short info/summary:
"remotesite.local" points to a DNS on the other side of a VPN connection. An override in Unbound.
"localsite.n23" is the local network where I am.
Unbound stops resolving "remotesite.local" hosts after a while. Works for a while again after restarting Unbound and the stops resolving at remotesite.localToday using some extreme googe-fu after I realized something. The only overrides that stops resolving are those ending with .local.
What lead me to this conclusion was this:
As one can see (logs below) 17:18 it was able to resolve hosts at the remote site. At 17:19 it couldn't anymore. Checking the logs for Unbound i found that it's not even trying to resolve anything on the .local domain.
Googled around on the issue and found that someone had a similar problem with .local that just stopped responding.
domain-overrides-stop-resolving-periodically-they-only-resume-after-the-service-has-been-restarted
The solution there was to make an override ".local" to point out a DNS. Tested to do that, a "local" override that points to 127.0.0.1.This was a couple of hours ago and it looks like it's working.
The reason .local was used at the remove domain is ancient, it's a windows domain created when Microsoft "best practice" was to create local FQDN with .local at the end.Unbound log:
Mar 18 17:19:24 unbound 52338 [52338:3] info: validation success host01.remotesite.local. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:3] info: validator operate: query host01.remotesite.local. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:3] info: finishing processing for host01.remotesite.local. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:3] info: resolving host01.remotesite.local. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:3] info: validator operate: query host01.remotesite.local. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:2] info: validation success host01.remotesite.local.localsite.n23. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:2] info: validator operate: query host01.remotesite.local.localsite.n23. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:2] info: finishing processing for host01.remotesite.local.localsite.n23. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:2] info: resolving host01.remotesite.local.localsite.n23. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:2] info: validator operate: query host01.remotesite.local.localsite.n23. AAAA IN Mar 18 17:19:24 unbound 52338 [52338:0] info: validation success host01.remotesite.local.localsite.n23. A IN Mar 18 17:19:24 unbound 52338 [52338:0] info: validator operate: query host01.remotesite.local.localsite.n23. A IN Mar 18 17:19:24 unbound 52338 [52338:0] info: finishing processing for host01.remotesite.local.localsite.n23. A IN Mar 18 17:19:24 unbound 52338 [52338:0] info: resolving host01.remotesite.local.localsite.n23. A IN Mar 18 17:19:24 unbound 52338 [52338:0] info: validator operate: query host01.remotesite.local.localsite.n23. A IN Mar 18 17:18:04 unbound 52338 [52338:2] info: validation success host01.remotesite.local. A IN Mar 18 17:18:04 unbound 52338 [52338:2] info: validator operate: query host01.remotesite.local. A IN Mar 18 17:18:04 unbound 52338 [52338:2] info: finishing processing for host01.remotesite.local. A IN Mar 18 17:18:04 unbound 52338 [52338:2] info: resolving host01.remotesite.local. A IN Mar 18 17:18:04 unbound 52338 [52338:2] info: validator operate: query host01.remotesite.local. A IN
-
This post is deleted!