Unbound periodically only retrieving IPv6 results for hostname instead of both IPv4 AND IPv6

Alex Atkin UK · Jul 26, 2018, 3:23 AM

I have a slightly unusual setup in that my ISP supports IPv6 but I only have it enabled on pfSense itself, disabled on the LAN as a whole.

This is a benefit as it means pfSense can use IPv6 for DNS resolution, pings, connectivity testing in general but I do not get the drawback of IPv6 on the clients (Android seems to be very iffy with it, trying to figure out how to make it secure over a VPN, etc).

The problem is that completely at random Unbound starts retrieving DNS entries that ONLY list IPv6 addresses, so connectivity on the LAN fails. When this happens if I manually do a lookup on Cloudflares IPv6 DNS servers, the result comes back normal with both IPv4 and IPv6, suggesting this problem is within Unbound itself. Restarting Unbound will then fix the problem for days or even weeks, until it randomly happens again.

DNS on my LAN is all forced through pfSense with a firewall rule and all queries are forwarded to Cloudflare DNS, using both IPv4 and IPv6, eg:

forward-zone:
name: "."
forward-addr: 2606:4700:4700::1111@853
forward-addr: 2606:4700:4700::1001@853
forward-addr: 1.1.1.1@853
forward-addr: 1.0.0.1@853
forward-ssl-upstream: yes

I do not believe there is anything inherently wrong with my configuration as it works 99% of the time and I can see IPv6 lookups in the logs routinely bringing back correct DNS results. I'm just confused why and how this is happening?

johnpoz · Jul 28, 2018, 10:34 AM

Unbound is not going to forward anything it didn't get a query for. If you client is asking for both A and AAAA which is the norm. And you don't get an answer back for A, that is not unbound problem.

Are you saying that unbound didn't query A when client did and unbound didn't have anything cached? Are you saying you got a response for your A query from upstream but unbound did not provide it to the client?

I would suggest you sniff this traffic for a while... And validate exactly what is happening. Maybe who you forwarded too had AAAA cached but had to forward/resolve to find the A and that was way longer than the AAAA response so that is why you have no A for the client that requested it.

Why are you not just resolving might I ask?

Alex Atkin UK · Jul 29, 2018, 12:56 AM

@johnpoz Correct, it seems like Unbound is not retrieving the A record when upstream clearly is working. This causes all web browsing to fail due to no longer getting A records. Restart Unbound and it works again, suggesting its not an issue with the client.

I'm not resolving for extra security. While I'm fairly sure my ISP doesn't sniff DNS traffic, there are rumblings in UK government that they want to further clamp down on what websites we can and cannot access, I'm merely pre-empting that by using a party which claims to be neutral.

I'm not sure how I can sniff this seeing as its happening over a secure connection and does not appear to be a client issue.

johnpoz · Jul 29, 2018, 11:32 AM

So you see the client ask for A? You validated it actually sent the query?

But your saying unbound doesn't forward it? How do you know? Since your "extra" security seems to prevent you from even knowing what your own devices are doing ;)

Its quite possible that it just takes longer for your forwarding dns to resolve/forward the A.. Or maybe the ttl is really short and the authoritative server where they resolve it from is having issues.

Hard to tell when you so worried the gov tracking you looked up pornhub or something that you can not even troubleshoot your own traffic ;)

Why do you not just resolve through the vpn connection its clear you have.. This way vs sending ALL your queries to dns server X, you would be actually asking all the different authoritative NS across the globe for what you want - that they are owners off. And the gov couldn't see your dns queries.

I would suggest you turn on debug logging in unbound for a start - but for testing I would turn off the ssl forward so you can sniff. Unless you want to MITM yourself ;) Which is also possible - but more complicated setup so you can view what is being sent.

Can you give an example fqdn that your seeing this issue on? So can actually look to see what the authoritative NS say about the record(s) your trying to query. host.domain.tld that returns AAAA but not A.

Alex Atkin UK · Jul 30, 2018, 5:07 AM

As you point out, I don't know. But as the first thing I notice is web browsing stops working (and host/nslookup confirm no A records coming from Unbound) and restarting Unbound fixes it, it seems unlikely that the web browser is NOT requesting A records.

The first reason I do have DNS over the VPN is simply because I set this up BEFORE I got a VPN.

The second reason is only one client is routed over the VPN and it does incur packet loss under heavy use, so I preferred DNS to keep going down the more reliable WAN route, plus I don't wan't every single client to fail if the VPN goes down.

DNS also has a higher priority in QoS than other traffic, which while I suspect this may not make much difference with the VPN client being on pfSense (plus the doubt that QoS does much for UDP in the first place), originally the VPN was running ON the client rather than pfSense. So its all a learning experience of what is the optimal configuration here.

johnpoz · Jul 30, 2018, 10:46 AM

And again could you please post some same FQDN that your having this so called issue with..

You restarting unbound could be a complete red herring.. Since that would flush the cache - maybe you got back a NX? And having to wait for the neg ttl to expire..

Restarting the service without knowing what the problem actually is not valid troubleshooting..

Alex Atkin UK · Jul 31, 2018, 1:56 AM

Good timing, it just happened again. (unfortunately I haven't gotten around to increasing the Unbound log level yet)

I'm sorry if I'm not being as helpful as needed, but as this is a router with four of us using it, its not really practical to leave it broken for long enough to extensively troubleshoot. I greatly appreciate your continued efforts.

[2.4.3-RELEASE][root@Router.lan]/root: host yourlabournec.co.uk
yourlabournec.co.uk mail is handled by 10 mx0.123-reg.co.uk.
yourlabournec.co.uk mail is handled by 20 mx1.123-reg.co.uk.

[2.4.3-RELEASE][root@Router.lan]/root: host yourlabournec.co.uk 1.1.1.1
Using domain server:
Name: 1.1.1.1
Address: 1.1.1.1#53
Aliases:

yourlabournec.co.uk has address 35.241.57.179
yourlabournec.co.uk mail is handled by 10 mx0.123-reg.co.uk.
yourlabournec.co.uk mail is handled by 20 mx1.123-reg.co.uk.

[2.4.3-RELEASE][root@Router.lan]/root: host yourlabournec.co.uk 1.0.0.1
Using domain server:
Name: 1.0.0.1
Address: 1.0.0.1#53
Aliases:

yourlabournec.co.uk has address 35.241.57.179
yourlabournec.co.uk mail is handled by 10 mx0.123-reg.co.uk.
yourlabournec.co.uk mail is handled by 20 mx1.123-reg.co.uk.

[2.4.3-RELEASE][root@Router.lan]/root: host yourlabournec.co.uk 2606:4700:4700::1111
Using domain server:
Name: 2606:4700:4700::1111
Address: 2606:4700:4700::1111#53
Aliases:

yourlabournec.co.uk has address 35.241.57.179
yourlabournec.co.uk mail is handled by 10 mx0.123-reg.co.uk.
yourlabournec.co.uk mail is handled by 20 mx1.123-reg.co.uk.

[2.4.3-RELEASE][root@Router.lan]/root: host yourlabournec.co.uk 2606:4700:4700::1001
Using domain server:
Name: 2606:4700:4700::1001
Address: 2606:4700:4700::1001#53
Aliases:

yourlabournec.co.uk has address 35.241.57.179
yourlabournec.co.uk mail is handled by 10 mx0.123-reg.co.uk.
yourlabournec.co.uk mail is handled by 20 mx1.123-reg.co.uk.

Last visited this site 4 hours before.

johnpoz · Jul 31, 2018, 9:51 AM

So these are the NS

;; QUESTION SECTION:
;yourlabournec.co.uk. IN NS

;; ANSWER SECTION:
yourlabournec.co.uk. 86400 IN NS buck.ns.cloudflare.com.
yourlabournec.co.uk. 86400 IN NS elsa.ns.cloudflare.com.

I just tested and buck.ns.cloudflare.com which is also the SOA, sometimes does not answer..

$ dig @buck.ns.couldflare.com yourlabournec.co.uk

; <<>> DiG 9.12.2 <<>> @buck.ns.couldflare.com yourlabournec.co.uk
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

$ dig @72.52.4.119 yourlabournec.co.uk

; <<>> DiG 9.12.2 <<>> @72.52.4.119 yourlabournec.co.uk
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Also - while its suppose to be the SOA - they do not present constant TTL for the record.. Its counting down, like it is not the authoritative NS...

;; QUESTION SECTION:
;yourlabournec.co.uk. IN A

;; ANSWER SECTION:
yourlabournec.co.uk. 218 IN A 35.241.57.179

;; Query time: 34 msec
;; SERVER: 173.245.58.111#53(173.245.58.111)
;; WHEN: Tue Jul 31 04:42:28 CDT 2018
;; MSG SIZE rcvd: 64

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;yourlabournec.co.uk. IN A

;; ANSWER SECTION:
yourlabournec.co.uk. 214 IN A 35.241.57.179

;; Query time: 33 msec
;; SERVER: 173.245.58.111#53(173.245.58.111)
;; WHEN: Tue Jul 31 04:42:32 CDT 2018
;; MSG SIZE rcvd: 64

If actually the authoritative NS for the domain - the TTL should always be what was set for it..

Next time it happens.. Look to see what unbound has cached for it..

2.4.3-RELEASE][root@sg4860.local.lan]/root: unbound-control -c /var/unbound/unbound.conf dump_cache | grep yourlabournec.co.uk

Or whatever domain it is your having issue with... So when this happened for that site, did all other sites fail? So there is something odd with this site.. You restart unbound, and maybe the issue goes away... See where I could not query the soa buck.. Well now it answers.

dig @buck.ns.cloudflare.com yourlabournec.co.uk

; <<>> DiG 9.12.2 <<>> @buck.ns.cloudflare.com yourlabournec.co.uk
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12191
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;yourlabournec.co.uk. IN A

;; ANSWER SECTION:
yourlabournec.co.uk. 281 IN A 35.241.57.179

;; Query time: 36 msec
;; SERVER: 173.245.59.78#53(173.245.59.78)
;; WHEN: Tue Jul 31 04:49:21 Central Daylight Time 2018
;; MSG SIZE rcvd: 64

Since they are hosting NS off of cloudflare - and the ttl not constant like should actually be from authoritative NS..

Not seeing any AAAA records for this.. So your AAAA comment is clearly a red herring, etc.

Unless you actually troubleshoot your problem - your going to be just stuck in a loop restarting it... So your saying nothing was resolving when this happened? Or just this 1 fqdn??

Alex Atkin UK · Jul 31, 2018, 10:52 AM

@johnpoz This was just the site that happened to be loading at the time of the failure, when it happens ALL DNS seems to fail. I believe even host overrides fail to resolve, something that I only just remembered and I feel stupid for not double checking now as surely that would eliminate Cloudflare as being the culprit if true?

I have made a note of the cache dump command, so when it fails again I will check to see what it says for any domains I recently accessed. Then will fire off a host command for that domain, see if maybe the cache has changed between the two?

I certainly would suspect Cloudflare hosted DNS could be a little odd, as their service allows you to switch between normal DNS and their site proxying service. Which would imply their TTL would have to be low or those changes would take days to propagate. My own domains use plain DNS hosted by Cloudflare, as it doesn't get nearly enough hits to trigger their cache so loads painfully slow if I use their proxying. I usually check those first as I figure less can go wrong if the DNS records I'm looking for are already hosted by Cloudflare, but honestly who knows with their convoluted service.

johnpoz · Jul 31, 2018, 12:15 PM

@alex-atkin-uk said in Unbound periodically only retrieving IPv6 results for hostname instead of both IPv4 AND IPv6:

I believe even host overrides fail to resolve

That would be a sign that unbound has failed all together... Could it be restarting? There are known issues when unbound restarts. Does the normal log show it restarting?

Have you updated the log level as of yet? Do you have unbound registering dhcp leases? This in the past has been an issue with it restarting. Are you running anything else - maybe pfblocker? I think there is something that reloads unbound in that - but would have to double check if that is true or not. Maybe on its download of new list it "could" be restarting unbound?

The not resolving local hosts/overrides points to either a connectivity problem to pfsense for resolving - or an issue with unbound itself.. Like it being in the middle of a restart when you query it, or even if after a reload its cache is flushed and it would then have to go ask for what your looking for again, etc.

Once we gather enough info sure will figure out what is really going on.

Alex Atkin UK · Aug 1, 2018, 11:41 AM

I have updated the log level but it has not failed yet since doing so. Its so inconsistent, sometimes it will fail every day, sometimes it wont fail for weeks.

DHCP Registration is disabled (as I had read about this issue before), its only set to register Static DHCP leases. Register OpenVPN Clients is also disabled, as is monitoring actions on my OpenVPN clients as that did cause instability if the client started bouncing.

I am using pfblocker but not the DNS blocklist function.

johnpoz · Aug 1, 2018, 12:34 PM

Well let us know when it fails again - and when it does please validate if local hosts resolve - ie your host overrides or any static registered devices.. pfsense own name for example.

Now just need to wait I guess.