Cannot resolve cdn.jsdelivr.net from LAN but fine from pfSense box itself

Alex Atkin UK

I have a really weird issue I can't wrap my head around.

Suddenly I can't resolve cdn.jsdelivr.net or linuxunplugged.com from the LAN, I get 2(SERVFAIL). But if I test from pfSense itself it resolves fine, including from 127.0.0.1 suggesting Unbound IS resolving this okay via TLS forwarding.

Surely even if its somehow ended up in one of my pfBlockerNG lists it should still resolve the DNS? I don't have DNS blocklists enabled, just firewall rules to prevent connections to IPs in the lists.

What on earth could be causing it to fail from the LAN?

djanke

Hi @alex-atkin-uk ,

I believe this might be a possible bug in DNS resolver.

I am having the same issues. I have just set up DNS over TLS. Both the old manual config and the current checkbox one results in the same issue.
Trying to resolve any of techsnap.systems , coder.show , linuxunplugged.com results in a SERVFAIL.
Using the DNS lookup tool on the admin UI will resolve them OK. Packet capture suggests that both are using DNS over TLS.

However the DNS resolver log is different. It will show the following when resolution was possible:

Sep 27 20:46:55	unbound	45665:0	info: 127.0.0.1 techsnap.systems. A IN
Sep 27 20:46:57	unbound	45665:0	info: 127.0.0.1 techsnap.systems. A IN
Sep 27 20:46:57	unbound	45665:0	info: 127.0.0.1 techsnap.systems. AAAA IN
Sep 27 20:46:57	unbound	45665:0	info: 127.0.0.1 techsnap.systems. CNAME IN

And the following when it fails:

Sep 27 20:37:51	unbound	45665:0	info: 192.168.19.96 techsnap.systems. A IN
Sep 27 20:37:51	unbound	45665:0	info: 192.168.19.96 techsnap.systems. A IN
Sep 27 20:37:51	unbound	45665:0	info: 192.168.19.96 techsnap.systems. A IN
Sep 27 20:37:51	unbound	45665:0	info: 192.168.19.96 techsnap.systems. A IN
Sep 27 20:37:52	unbound	45665:0	info: 192.168.19.96 techsnap.systems. A IN
Sep 27 20:37:52	unbound	45665:0	info: 192.168.19.96 techsnap.systems. A IN

Alex Atkin UK

You are right, linuxunplugged.com fails for me too. That's very interesting, this problem could have coincided with the previous update to pfSense before 2.4.4 (2.4.3?) so would explain a lot.

I wonder why the DNS lookup tool on the admin UI works though and why its only specific domains which are failing?

Also, do you mean that when it looks up via 1.1.1.1 directly its still using TLS? I kinda assumed that would only happen when using 127.0.0.1 thus via Unbound, so I reverted back to the manual configuration to ensure all DNS is via TLS (the system will only use 127.0.0.1 for resolution).

djanke

I have to say I might have been wrong with the DNS lookup on the admin UI. There are a lot of packets to 1.1.1.1:853 but those get preceded by packets to port 53. I missed those before, there was too much IP noise.

So I guess there are two bugs. One is that these names fail to resolve when using DoT (everything is fine when that is turned off). Plus the admin UI DNS lookup tool does not use DoT for some reason.

Alex Atkin UK

@djanke The admin UI thing makes sense, if the DNS servers are in the main list then I'd expect it to use them in turn. Thus some resolution would happen to 127.0.0.1 but some would also go direct, via normal resolution, to the other servers.

This does seem a flawed implementation though when the whole point of using TLS is to have ALL DNS resolution encrypted.

It would make more sense to have the DNS server list in the Unbound configuration itself, so that the main list still only contains localhost. It also decreases the chance of people just ticking the TLS box and everything failing, because the default DNS servers don't support TLS. Yes they do warn about this next to the option, but when ticking it doesn't work as intended then what's the point?

djanke

I hope a third person can confirm as well, then we can file the resolution part as a bug.

Alex Atkin UK

Ah here we go, if I switch back to manual configuration so ONLY Unbound is used, I get "Host "linuxunplugged.com" could not be resolved." from the WebUI.

So it seems the WebUI also has a bug, where if even one server resolves okay then it implies they ALL worked, when in fact they didn't.

This does bring up another possibility though, is it cloudflares DNS over TLS that is actually failing here, rather than Unbound itself?

It seems the thread title should be renamed to "some domains fail to resolve using Cloudflare DNS over TLS".

djanke

@alex-atkin-uk

For me, if I use the config suggested on https://www.netgate.com/blog/dns-over-tls-with-pfsense.html the web UI resolution behaves the same, it will not fail, but use the unsecured port.

However if I change to Quad9's servers it will resolve. V-Very slowly though. So I added Quad9 to the list to at least be able to resolve.

Now I'm wondering where in the flow between unbound and cloudflare the issue is.

Alex Atkin UK

@djanke I removed all DNS servers from the rest of pfSense to avoid any insecure DNS lookups. No point in having DNS over TLS if its only "sometimes" being used, not sure why they would suggest that.

I will add Quad9 servers though, as slow resolution is better than nothing.

johnpoz

That first one is cname
;; ANSWER SECTION:
cdn.jsdelivr.net. 171 IN CNAME cdn.jsdelivr.net.cdn.cloudflare.net.

And so is the 2nd one
;; ANSWER SECTION:
linuxunplugged.com. 900 IN CNAME hosted.fireside.fm.

So your problems are most likely with resolving the cname. Or their setup to be honest..

jsdelivr.net/CNAME (NXDOMAIN): The server returned CNAME for jsdelivr.net, but records of other types exist at that name.

Your other one has all kinds of issues as well
linuxunplugged.com/CNAME (NXDOMAIN): The server responded with no OPT record, rather than with RCODE FORMERR. (64.98.148.13, 216.40.47.26, UDP_0_EDNS0_32768_4096, UDP_0_EDNS0_32768_512)
linuxunplugged.com/CNAME (NXDOMAIN): The server returned CNAME for linuxunplugged.com, but records of other types exist at that name.
linuxunplugged.com/CNAME: The server responded with no OPT record, rather than with RCODE FORMERR. (64.98.148.13, 216.40.47.26, UDP_0_EDNS0_32768_4096)
linuxunplugged.com/CNAME: The server returned CNAME for linuxunplugged.com, but records of other types exist at that name.
linuxunplugged.com/NS: The server responded with no OPT record, rather than with RCODE FORMERR. (64.98.148.13, 216.40.47.26, UDP_0_EDNS0_32768_4096)

They have a edns problem.. So yeah that can cause problems..

https://ednscomp.isc.org/ednscomp/90be0b592b

These resolve fine even with those issues... But then you forward your at the mercy of what they have cached good or bad, or what they resolve which could be good or bad etc..

Alex Atkin UK

This is much deeper down the DNS rabbit hole than I have ever been, well over my head here. :/

Either way it seems bizarre that Cloudflare DNS servers give a completely different result to their DNS over TLS servers. You'd kinda expect them to have the same behaviour, but that's a question for Cloudflare.

I do still think the pfSense UI should make it clearer when doing a DNS lookup if one of the servers failed to give a result. It led me astray thinking that it WAS working, when it was not.

johnpoz

Which server failed to give what result? Your forwarding!! Once you forward you really are just throwing your question to them and hoping for a response.. You are at the mercy of what they answer or don't answer its that simple..

That is why anyone that really wants to know would just resolve - which is what pfsense does out of the box.. Then you are always getting your info straight from the horses mouth.. And not just asking some other guy for what he knows, or what he might resolve.. Which if you have no idea if he is actually resolving or asking some other forwarder up the stream, which asks another one up the stream.. Etc.. etc..

And if your going to forward - you sure and the F would not forward to NS that answer different on purpose. Quad 9 filters.. 1.1.1.1 is not suppose to filtering.. So for sure you could get different results there.. Then again what is 1.1.1.1 actually doing - you have no idea.. They could manipulate the info as well.. Same with google - they state they do not alter, other then if major security issue..

Also when you forward you really break any sort of actual geo location help.. All of those services resolve from their locations - no yours.. Which could be really really far away from you.. Which could be sending you to the wrong IP for your geo locations for access it is your looking for, etc.. So while your dns query might shave a few ms off the query time of talking to the authoritative NS for it.. You might be taking to the wrong area of the world to where you want to actually go.. Maybe your talking to the west coast DC for that resource vs the East cost DC.. When your in Atlanta for example - you have no idea when you forward.. And you really have no idea where that forwarder sites and or where it actually resolves from for geoIP stuff.

Pfsense out of the box resolves - unless you ACTUALLY understand all of the implications of changing that - I would suggest you just resolve..

Alex Atkin UK

@johnpoz Oh I DO understand the implications of putting all your trust in an external resolver, bugs, potential for them to compromise the DNS results, etc.

We were merely trying to figure out if this was a bug in how Unbound was making the upstream request or an upstream problem, now we know.

You got me thinking though as I THOUGHT Geo was largely moot in the UK but this is eye opening:

Doing a Google lookup on TLS vs Resolver then traceroute the results:

; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> www.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6759
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         15      IN      A       172.217.19.68

;; Query time: 0 msec
;; SERVER: 192.168.1.254#53(192.168.1.254)
;; WHEN: Sun Sep 30 11:25:55 BST 2018
;; MSG SIZE  rcvd: 59

traceroute to 172.217.19.68 (172.217.19.68), 30 hops max, 60 byte packets
 1  losubs.subs.bng2.th-lon.zen.net.uk (62.3.80.21)  14.807 ms  14.774 ms  14.745 ms
 2  ae1-183.cr2.th-lon.zen.net.uk (62.3.86.82)  14.717 ms  15.507 ms  15.427 ms
 3  72.14.223.28 (72.14.223.28)  14.601 ms  14.590 ms  14.563 ms
 4  108.170.246.144 (108.170.246.144)  14.516 ms 74.125.242.115 (74.125.242.115)  15.927 ms 74.125.242.83 (74.125.242.83)  15.843 ms
 5  209.85.250.185 (209.85.250.185)  16.568 ms 209.85.143.67 (209.85.143.67)  15.792 ms 216.239.57.227 (216.239.57.227)  15.506 ms
 6  209.85.142.166 (209.85.142.166)  20.734 ms 108.170.234.118 (108.170.234.118)  20.546 ms 209.85.142.166 (209.85.142.166)  20.640 ms
 7  74.125.37.130 (74.125.37.130)  24.096 ms 74.125.37.170 (74.125.37.170)  24.590 ms 72.14.237.98 (72.14.237.98)  24.534 ms
 8  172.253.51.199 (172.253.51.199)  28.307 ms 172.253.50.215 (172.253.50.215)  29.027 ms  28.717 ms
 9  216.239.54.170 (216.239.54.170)  29.473 ms 108.170.225.179 (108.170.225.179)  28.647 ms 108.170.226.48 (108.170.226.48)  28.430 ms
10  108.170.253.49 (108.170.253.49)  28.607 ms 108.170.253.33 (108.170.253.33)  28.850 ms 108.170.253.49 (108.170.253.49)  29.527 ms
11  209.85.245.203 (209.85.245.203)  28.231 ms  28.700 ms 209.85.244.219 (209.85.244.219)  29.057 ms
12  mil02s05-in-f68.1e100.net (172.217.19.68)  27.799 ms  27.905 ms  28.672 ms

; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> www.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48003
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.google.com.                        IN      A

;; ANSWER SECTION:
www.google.com.         300     IN      A       216.58.206.36

;; Query time: 38 msec
;; SERVER: 192.168.1.254#53(192.168.1.254)
;; WHEN: Sun Sep 30 11:47:24 BST 2018
;; MSG SIZE  rcvd: 59

bash-4.4$ traceroute 216.58.206.36
traceroute to 216.58.206.36 (216.58.206.36), 30 hops max, 60 byte packets
 1  losubs.subs.bng2.th-lon.zen.net.uk (62.3.80.21)  14.025 ms  14.008 ms  14.064 ms
 2  ae1-183.cr2.th-lon.zen.net.uk (62.3.86.82)  14.198 ms  14.199 ms  14.173 ms
 3  72.14.223.28 (72.14.223.28)  14.127 ms  14.181 ms  14.285 ms
 4  74.125.242.97 (74.125.242.97)  16.523 ms  16.518 ms 74.125.242.65 (74.125.242.65)  15.222 ms
 5  216.239.63.137 (216.239.63.137)  15.165 ms  15.185 ms 216.239.63.127 (216.239.63.127)  15.148 ms
 6  lhr35s10-in-f4.1e100.net (216.58.206.36)  15.112 ms  14.862 ms  14.787 ms

Maybe its a coincidence that it took more hops for the other IP as it still seems to be going to a similar destination, but it does make you think.

johnpoz

Dude look at your response time ;) They are not in the same area. Your 1/2 of the distance away.

Why would you think geo is moot? I would suggest you take a big look at how CDNs work ;) You don't serve up the planet and not worry about where users are going based upon where they are coming from...

You do understand your forwarding to anycast address as well right - so whee are they exactly? ;) There are situations where you would need to forward sure.. But to shave a couple of milliseconds off your lookup is pretty much always going to be counterproductive.. Unless where you forward to is in your same area, and they actually resolve from the area your in as well..

Shoot you could even be in the same area and where your query comes from could give you a different IP based upon source IP being a peering partner or not.. Resolving is the WAY to go ;) If your worried about a getting an answer for www.domain.tld 3 ms faster vs the whole process of how it works then your doing it WRONG ;)

Alex Atkin UK

Actually its not that simple, the major difference is simply the overhead of TLS I think as:

 1  62.3.80.21  14.815 ms  13.960 ms  14.228 ms
 2  62.3.86.82  15.227 ms  15.224 ms  14.233 ms
 3  5.57.81.75  13.976 ms  14.478 ms  13.983 ms
 4  1.1.1.1  14.604 ms  14.227 ms  14.980 ms

 1  62.3.80.21  13.727 ms  13.950 ms  14.246 ms
 2  62.3.86.82  14.218 ms  13.987 ms  13.951 ms
 3  5.57.80.238  14.978 ms  15.214 ms  15.479 ms
 4  9.9.9.9  13.982 ms !Z  13.968 ms !Z  13.740 ms !Z

I have indeed gone back to resolving as at this point DNS over TLS just seems far less reliable.

The different IPs for Google and their routes is a weird anomaly as the longer route still passes through the shorter ones routers but with more hops between, which to me suggests a routing table issue.

johnpoz

No my point was the response time in to your final endpoint.. You had 28ms vs 14ms - that has nothing to do with the lookup.. Which is my point.. No matter how many hops you took to get there.. But yeah in the big picture less hops normally means faster..

You shouldn't be worried about the few ms you might save in looking up www.xyz.tld - because it has it cached already. You should be worried that you got the correct info from the actual NS authoritative for said records. And that you asked him from where your actually at, so he will be sure to send you the proper IP for your location and or isp, etc..

Alex Atkin UK

@johnpoz But like I said, the route is odd as visually it looks like both routes "should" have been the same, because its bouncing around different routers to get to the same ones used in the quicker trace.

Granted, its likely this would not always be the case as Geo could "theoretically" make a difference, but its unlikely due to how UK ISPs almost always only hit the Internet in London, regardless of where you are geographically located. They just don't bother with the cost of taking the quickest route from your location to their network and all the major peering and CDNs are in London anyway.

I have a reasonable amount of experience looking into this as my old ISP was in my city and DID have their PoP within the city, using their own network. But even ISPs that did that before have fallen back onto leasing the telco virtual backhaul which again, aggregates everyone in London. Its a bit of a drag as I had a single-digit route to the Internet, but it is what it is.