Cannot resolve cdn.jsdelivr.net from LAN but fine from pfSense box itself
I have a really weird issue I can't wrap my head around.
Suddenly I can't resolve cdn.jsdelivr.net or linuxunplugged.com from the LAN, I get 2(SERVFAIL). But if I test from pfSense itself it resolves fine, including from 127.0.0.1 suggesting Unbound IS resolving this okay via TLS forwarding.
Surely even if its somehow ended up in one of my pfBlockerNG lists it should still resolve the DNS? I don't have DNS blocklists enabled, just firewall rules to prevent connections to IPs in the lists.
What on earth could be causing it to fail from the LAN?
Hi @alex-atkin-uk ,
I believe this might be a possible bug in DNS resolver.
I am having the same issues. I have just set up DNS over TLS. Both the old manual config and the current checkbox one results in the same issue.
Trying to resolve any of techsnap.systems , coder.show , linuxunplugged.com results in a SERVFAIL.
Using the DNS lookup tool on the admin UI will resolve them OK. Packet capture suggests that both are using DNS over TLS.
However the DNS resolver log is different. It will show the following when resolution was possible:
Sep 27 20:46:55 unbound 45665:0 info: 127.0.0.1 techsnap.systems. A IN Sep 27 20:46:57 unbound 45665:0 info: 127.0.0.1 techsnap.systems. A IN Sep 27 20:46:57 unbound 45665:0 info: 127.0.0.1 techsnap.systems. AAAA IN Sep 27 20:46:57 unbound 45665:0 info: 127.0.0.1 techsnap.systems. CNAME IN
And the following when it fails:
Sep 27 20:37:51 unbound 45665:0 info: 192.168.19.96 techsnap.systems. A IN Sep 27 20:37:51 unbound 45665:0 info: 192.168.19.96 techsnap.systems. A IN Sep 27 20:37:51 unbound 45665:0 info: 192.168.19.96 techsnap.systems. A IN Sep 27 20:37:51 unbound 45665:0 info: 192.168.19.96 techsnap.systems. A IN Sep 27 20:37:52 unbound 45665:0 info: 192.168.19.96 techsnap.systems. A IN Sep 27 20:37:52 unbound 45665:0 info: 192.168.19.96 techsnap.systems. A IN
You are right, linuxunplugged.com fails for me too. That's very interesting, this problem could have coincided with the previous update to pfSense before 2.4.4 (2.4.3?) so would explain a lot.
I wonder why the DNS lookup tool on the admin UI works though and why its only specific domains which are failing?
Also, do you mean that when it looks up via 220.127.116.11 directly its still using TLS? I kinda assumed that would only happen when using 127.0.0.1 thus via Unbound, so I reverted back to the manual configuration to ensure all DNS is via TLS (the system will only use 127.0.0.1 for resolution).
I have to say I might have been wrong with the DNS lookup on the admin UI. There are a lot of packets to 18.104.22.168:853 but those get preceded by packets to port 53. I missed those before, there was too much IP noise.
So I guess there are two bugs. One is that these names fail to resolve when using DoT (everything is fine when that is turned off). Plus the admin UI DNS lookup tool does not use DoT for some reason.
@djanke The admin UI thing makes sense, if the DNS servers are in the main list then I'd expect it to use them in turn. Thus some resolution would happen to 127.0.0.1 but some would also go direct, via normal resolution, to the other servers.
This does seem a flawed implementation though when the whole point of using TLS is to have ALL DNS resolution encrypted.
It would make more sense to have the DNS server list in the Unbound configuration itself, so that the main list still only contains localhost. It also decreases the chance of people just ticking the TLS box and everything failing, because the default DNS servers don't support TLS. Yes they do warn about this next to the option, but when ticking it doesn't work as intended then what's the point?
I hope a third person can confirm as well, then we can file the resolution part as a bug.
Ah here we go, if I switch back to manual configuration so ONLY Unbound is used, I get "Host "linuxunplugged.com" could not be resolved." from the WebUI.
So it seems the WebUI also has a bug, where if even one server resolves okay then it implies they ALL worked, when in fact they didn't.
This does bring up another possibility though, is it cloudflares DNS over TLS that is actually failing here, rather than Unbound itself?
It seems the thread title should be renamed to "some domains fail to resolve using Cloudflare DNS over TLS".
For me, if I use the config suggested on https://www.netgate.com/blog/dns-over-tls-with-pfsense.html the web UI resolution behaves the same, it will not fail, but use the unsecured port.
However if I change to Quad9's servers it will resolve. V-Very slowly though. So I added Quad9 to the list to at least be able to resolve.
Now I'm wondering where in the flow between unbound and cloudflare the issue is.
@djanke I removed all DNS servers from the rest of pfSense to avoid any insecure DNS lookups. No point in having DNS over TLS if its only "sometimes" being used, not sure why they would suggest that.
I will add Quad9 servers though, as slow resolution is better than nothing.
So your problems are most likely with resolving the cname. Or their setup to be honest..
Your other one has all kinds of issues as well
linuxunplugged.com/CNAME (NXDOMAIN): The server responded with no OPT record, rather than with RCODE FORMERR. (22.214.171.124, 126.96.36.199, UDP_0_EDNS0_32768_4096, UDP_0_EDNS0_32768_512)
linuxunplugged.com/CNAME (NXDOMAIN): The server returned CNAME for linuxunplugged.com, but records of other types exist at that name.
linuxunplugged.com/CNAME: The server responded with no OPT record, rather than with RCODE FORMERR. (188.8.131.52, 184.108.40.206, UDP_0_EDNS0_32768_4096)
linuxunplugged.com/CNAME: The server returned CNAME for linuxunplugged.com, but records of other types exist at that name.
linuxunplugged.com/NS: The server responded with no OPT record, rather than with RCODE FORMERR. (220.127.116.11, 18.104.22.168, UDP_0_EDNS0_32768_4096)
They have a edns problem.. So yeah that can cause problems..
These resolve fine even with those issues... But then you forward your at the mercy of what they have cached good or bad, or what they resolve which could be good or bad etc..
This is much deeper down the DNS rabbit hole than I have ever been, well over my head here. :/
Either way it seems bizarre that Cloudflare DNS servers give a completely different result to their DNS over TLS servers. You'd kinda expect them to have the same behaviour, but that's a question for Cloudflare.
I do still think the pfSense UI should make it clearer when doing a DNS lookup if one of the servers failed to give a result. It led me astray thinking that it WAS working, when it was not.
Which server failed to give what result? Your forwarding!! Once you forward you really are just throwing your question to them and hoping for a response.. You are at the mercy of what they answer or don't answer its that simple..
That is why anyone that really wants to know would just resolve - which is what pfsense does out of the box.. Then you are always getting your info straight from the horses mouth.. And not just asking some other guy for what he knows, or what he might resolve.. Which if you have no idea if he is actually resolving or asking some other forwarder up the stream, which asks another one up the stream.. Etc.. etc..
And if your going to forward - you sure and the F would not forward to NS that answer different on purpose. Quad 9 filters.. 22.214.171.124 is not suppose to filtering.. So for sure you could get different results there.. Then again what is 126.96.36.199 actually doing - you have no idea.. They could manipulate the info as well.. Same with google - they state they do not alter, other then if major security issue..
Also when you forward you really break any sort of actual geo location help.. All of those services resolve from their locations - no yours.. Which could be really really far away from you.. Which could be sending you to the wrong IP for your geo locations for access it is your looking for, etc.. So while your dns query might shave a few ms off the query time of talking to the authoritative NS for it.. You might be taking to the wrong area of the world to where you want to actually go.. Maybe your talking to the west coast DC for that resource vs the East cost DC.. When your in Atlanta for example - you have no idea when you forward.. And you really have no idea where that forwarder sites and or where it actually resolves from for geoIP stuff.
Pfsense out of the box resolves - unless you ACTUALLY understand all of the implications of changing that - I would suggest you just resolve..
@johnpoz Oh I DO understand the implications of putting all your trust in an external resolver, bugs, potential for them to compromise the DNS results, etc.
We were merely trying to figure out if this was a bug in how Unbound was making the upstream request or an upstream problem, now we know.
You got me thinking though as I THOUGHT Geo was largely moot in the UK but this is eye opening:
Doing a Google lookup on TLS vs Resolver then traceroute the results:
; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> www.google.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6759 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;www.google.com. IN A ;; ANSWER SECTION: www.google.com. 15 IN A 188.8.131.52 ;; Query time: 0 msec ;; SERVER: 192.168.1.254#53(192.168.1.254) ;; WHEN: Sun Sep 30 11:25:55 BST 2018 ;; MSG SIZE rcvd: 59 traceroute to 184.108.40.206 (220.127.116.11), 30 hops max, 60 byte packets 1 losubs.subs.bng2.th-lon.zen.net.uk (18.104.22.168) 14.807 ms 14.774 ms 14.745 ms 2 ae1-183.cr2.th-lon.zen.net.uk (22.214.171.124) 14.717 ms 15.507 ms 15.427 ms 3 126.96.36.199 (188.8.131.52) 14.601 ms 14.590 ms 14.563 ms 4 184.108.40.206 (220.127.116.11) 14.516 ms 18.104.22.168 (22.214.171.124) 15.927 ms 126.96.36.199 (188.8.131.52) 15.843 ms 5 184.108.40.206 (220.127.116.11) 16.568 ms 18.104.22.168 (22.214.171.124) 15.792 ms 126.96.36.199 (188.8.131.52) 15.506 ms 6 184.108.40.206 (220.127.116.11) 20.734 ms 18.104.22.168 (22.214.171.124) 20.546 ms 126.96.36.199 (188.8.131.52) 20.640 ms 7 184.108.40.206 (220.127.116.11) 24.096 ms 18.104.22.168 (22.214.171.124) 24.590 ms 126.96.36.199 (188.8.131.52) 24.534 ms 8 184.108.40.206 (220.127.116.11) 28.307 ms 18.104.22.168 (22.214.171.124) 29.027 ms 28.717 ms 9 126.96.36.199 (188.8.131.52) 29.473 ms 184.108.40.206 (220.127.116.11) 28.647 ms 18.104.22.168 (22.214.171.124) 28.430 ms 10 126.96.36.199 (188.8.131.52) 28.607 ms 184.108.40.206 (220.127.116.11) 28.850 ms 18.104.22.168 (22.214.171.124) 29.527 ms 11 126.96.36.199 (188.8.131.52) 28.231 ms 28.700 ms 184.108.40.206 (220.127.116.11) 29.057 ms 12 mil02s05-in-f68.1e100.net (18.104.22.168) 27.799 ms 27.905 ms 28.672 ms
; <<>> DiG 9.11.4-P1-RedHat-9.11.4-5.P1.fc28 <<>> www.google.com ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48003 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;www.google.com. IN A ;; ANSWER SECTION: www.google.com. 300 IN A 22.214.171.124 ;; Query time: 38 msec ;; SERVER: 192.168.1.254#53(192.168.1.254) ;; WHEN: Sun Sep 30 11:47:24 BST 2018 ;; MSG SIZE rcvd: 59 bash-4.4$ traceroute 126.96.36.199 traceroute to 188.8.131.52 (184.108.40.206), 30 hops max, 60 byte packets 1 losubs.subs.bng2.th-lon.zen.net.uk (220.127.116.11) 14.025 ms 14.008 ms 14.064 ms 2 ae1-183.cr2.th-lon.zen.net.uk (18.104.22.168) 14.198 ms 14.199 ms 14.173 ms 3 22.214.171.124 (126.96.36.199) 14.127 ms 14.181 ms 14.285 ms 4 188.8.131.52 (184.108.40.206) 16.523 ms 16.518 ms 220.127.116.11 (18.104.22.168) 15.222 ms 5 22.214.171.124 (126.96.36.199) 15.165 ms 15.185 ms 188.8.131.52 (184.108.40.206) 15.148 ms 6 lhr35s10-in-f4.1e100.net (220.127.116.11) 15.112 ms 14.862 ms 14.787 ms
Maybe its a coincidence that it took more hops for the other IP as it still seems to be going to a similar destination, but it does make you think.
Dude look at your response time ;) They are not in the same area. Your 1/2 of the distance away.
Why would you think geo is moot? I would suggest you take a big look at how CDNs work ;) You don't serve up the planet and not worry about where users are going based upon where they are coming from...
You do understand your forwarding to anycast address as well right - so whee are they exactly? ;) There are situations where you would need to forward sure.. But to shave a couple of milliseconds off your lookup is pretty much always going to be counterproductive.. Unless where you forward to is in your same area, and they actually resolve from the area your in as well..
Shoot you could even be in the same area and where your query comes from could give you a different IP based upon source IP being a peering partner or not.. Resolving is the WAY to go ;) If your worried about a getting an answer for www.domain.tld 3 ms faster vs the whole process of how it works then your doing it WRONG ;)
Actually its not that simple, the major difference is simply the overhead of TLS I think as:
1 18.104.22.168 14.815 ms 13.960 ms 14.228 ms 2 22.214.171.124 15.227 ms 15.224 ms 14.233 ms 3 126.96.36.199 13.976 ms 14.478 ms 13.983 ms 4 188.8.131.52 14.604 ms 14.227 ms 14.980 ms
1 184.108.40.206 13.727 ms 13.950 ms 14.246 ms 2 220.127.116.11 14.218 ms 13.987 ms 13.951 ms 3 18.104.22.168 14.978 ms 15.214 ms 15.479 ms 4 22.214.171.124 13.982 ms !Z 13.968 ms !Z 13.740 ms !Z
I have indeed gone back to resolving as at this point DNS over TLS just seems far less reliable.
The different IPs for Google and their routes is a weird anomaly as the longer route still passes through the shorter ones routers but with more hops between, which to me suggests a routing table issue.
No my point was the response time in to your final endpoint.. You had 28ms vs 14ms - that has nothing to do with the lookup.. Which is my point.. No matter how many hops you took to get there.. But yeah in the big picture less hops normally means faster..
You shouldn't be worried about the few ms you might save in looking up www.xyz.tld - because it has it cached already. You should be worried that you got the correct info from the actual NS authoritative for said records. And that you asked him from where your actually at, so he will be sure to send you the proper IP for your location and or isp, etc..
@johnpoz But like I said, the route is odd as visually it looks like both routes "should" have been the same, because its bouncing around different routers to get to the same ones used in the quicker trace.
Granted, its likely this would not always be the case as Geo could "theoretically" make a difference, but its unlikely due to how UK ISPs almost always only hit the Internet in London, regardless of where you are geographically located. They just don't bother with the cost of taking the quickest route from your location to their network and all the major peering and CDNs are in London anyway.
I have a reasonable amount of experience looking into this as my old ISP was in my city and DID have their PoP within the city, using their own network. But even ISPs that did that before have fallen back onto leasing the telco virtual backhaul which again, aggregates everyone in London. Its a bit of a drag as I had a single-digit route to the Internet, but it is what it is.