Unbound Resolver starts returning SERVFAIL after resolving certain hostnames

Derelict

Coming from here: https://forum.pfsense.org/index.php?topic=87491.msg488407#msg488407 Credit and apologies to those over there for isolating this way to reproduce.

New thread since this looks to me like a different issue from whatever's going on with DNS server hijacking.

I am running Unbound in Resolver mode with DNSSEC enabled. I can routinely tickle this by asking unbound to resolve:


ns3.csof.net
and/or
api-nyc01.exip.org

Note that that exip.org hostname has csof name servers.

ns3.csof.net.		600	IN	A	195.22.26.199
api-nyc01.exip.org.	 10	IN	A	195.22.26.248

Note that both of those are in a known hostile netblock.

Anyway, my resolver has been running fine for days. No problems until I asked it to resolve those two hostnames. After doing so, apparently random domains start being returned as SERVFAIL.

$ dig forum.pfsense.org

; <<>> DiG 9.8.3-P1 <<>> forum.pfsense.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 30471
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;forum.pfsense.org. IN A

;; Query time: 1781 msec
;; SERVER: 192.168.223.1#53(192.168.223.1)
;; WHEN: Mon Feb 9 17:46:41 2015
;; MSG SIZE rcvd: 35

There's one example. This happens until unbound is restarted. I did this a couple times. Last one on unbound log level 5. Haven't really looked at the logs yet.

Derelict

Without DNSSEC enabled, All I had to do was query these two domain names and then I got this:

gridbug:etc cjl$ dig www.pfsense.org

; <<>> DiG 9.8.3-P1 <<>> www.pfsense.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51593
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 4, ADDITIONAL: 0

;; QUESTION SECTION:
;www.pfsense.org. IN A

;; ANSWER SECTION:
www.pfsense.org. 10 IN A 195.22.26.248

;; AUTHORITY SECTION:
org. 172779 IN NS ns1.csof.net.
org. 172779 IN NS ns2.csof.net.
org. 172779 IN NS ns3.csof.net.
org. 172779 IN NS ns4.csof.net.

;; Query time: 159 msec
;; SERVER: 192.168.223.1#53(192.168.223.1)
;; WHEN: Mon Feb 9 18:48:39 2015
;; MSG SIZE rcvd: 129

This looks bad.

cmb

Do you have "harden glue" enabled on the Advanced tab of Unbound? If not, is it still replicable with that enabled?

agreenfield1

@cmb:

Do you have "harden glue" enabled on the Advanced tab of Unbound? If not, is it still replicable with that enabled?

I had experienced the same issue as Derelict, and was able to replicate it in the same way. I did not have 'harden glue' enabled. After doing so, I have not been able to replicate the issue!

Should the default setting for harden-glue be enabled? The documentation for unbound suggests yes (https://www.unbound.net/documentation/unbound.conf.html, but it was definitely not enabled by default on my system.

kejianshi

My settings include…

In Services: DNS Resolver: Advanced

Harden Glue

Harden DNSSEC data

Unwanted Reply Threshold (10 million)

Prefetch Support

Prefetch DNS Key Support

All those on - I had asked about 10x if those might be recommended without an answer. After trying them for couple weeks, I'd say "Yes" - Definitely

DNSSEC is on also and its not in forwarder mode. Anyway - I'd recommend trying with these settings.

Be sure to reboot everything and clear DNS Cache on all clients after.

Derelict

Harden Glue appears to correct this, but that's pretty anecdotal.

cmb

Judging by the DNS traffic I captured when replicating that, harden glue should fix. I changed the default in new configs to enable, and we'll add config upgrade code so anyone who doesn't already have it enabled will have that changed upon upgrade to 2.2.1.

kejianshi

Not so much anecdotal.

People are poisoning your cache either with malicious DNS records or with man-on-the-side attacks or both.

Those settings are to prevent such things. Although, IMHO DNS protocol is a broken piece of crap and needs to be replaced with something that both encrypts and authenticates.

I'm sure that would introduce some latency, but my god… Its ridiculous. current DNS is about as secure as ftp and equally in need of being phased.

doktornotor

@Derelict:

Harden Glue appears to correct this, but that's pretty anecdotal.

Never could reproduce this lolcal issue… I have harden-glue: yes enabled everywhere. So, sounds like a pretty good guess I'd say.

@cmb: Can we get harden-referral-path exposed in the GUI as well? (Probably not default on, but visible.) Also, harden-below-nxdomain.