Unbound not advertising logincdn.msauth.net correctly to clients

General Platypus

Hi guys, so I've been noticing an issue with ms authentication over the past few days. At first I thought it might have to do with 2.5.0, so I downgraded to 2.4.5-p1 ... factory reset, default config, no custom packages... but I'm still seeing the issue.

console

It might reply once, but any subsequent queries all fail.

Issue persists across multiple OSes: Windows, MacOS, and Android, leading me to suspect that is an Unbound-related issue.

The odd part is, DNS Lookup on pfSense itself works perfectly fine every single time:

pfSense response

Which itself may or may not be correct. This is what I'm pulling via dnslookup.online:

dnslookup results

Unfortunately, this is beyond my level of expertise. I've flipped DNS providers, but seeing the same issue with 8.8.8.8 as well. That leads me to suspect the issue resides with how Unbound processes the request and passes it on to a client.

I should also mention that I'm only consistently seeing this issue with logincdn.msauth.net only. Other domains, like msn.com, respond 100% of the time.

Is anyone else able to replicate this problem?

Gertjan

@general-platypus said in Unbound not advertising logincdn.msauth.net correctly to clients:

Is anyone else able to replicate this problem?

Noop.

C:\Users\Gauche>nslookup logincdn.msauth.net
Serveur :   pfsense.local.net
Address:  2001:470:dead:beef:2::1

Réponse ne faisant pas autorité :
Nom :    cs1227.wpc.alphacdn.net
Address:  192.229.221.185
Aliases:  logincdn.msauth.net
          lgincdn.trafficmanager.net
          lgincdnvzeuno.azureedge.net
          lgincdnvzeuno.ec.azureedge.net

The ping test itself is rather useless.
There is no law that says that a host has to reply to ping.

Your unbound doesn't do much : it just forwards the requests to 1.1.1.1 and 1.0.0.1.

General Platypus

@gertjan Correct, the server doesn't have to reply to ping, but if you see the error message, it says ping cannot find the host, which is a lookup issue.

nslookup works the first time, but fails on any subsequent lookups. I understand that Unbound forwards requests to 1.1.1.1, but in this case, while pfSense itself is able to nslookup the domain 100% of the time, clients are not able to get it from Unbound, as indicated by "Server failed" error - an unusual error message from the DNS server which usually means something went horribly wrong.

Considering this particular dns entry has a boat-load of cnames, I suspect the issue lies there. The Unbound pipeline to the client is choking and failing. Hopefully it's not something silly like a 256 byte buffer overflow?? The combined CNAME character count just happens to be 284.

I'm genuinely surprised more people aren't seeing this, unless I am the only pfSense user out there who happens to use Outlook and Firefox. Chrome masks the issue because cache survives browser restarts.

Here is a "hack" that's currently working for me:

alt text

Now that we've stripped out the cnames, nslookup logincdn.msauth.net works 100% of the time for clients.

That being said, I believe this should be properly investigated.

bmeeks

You may be seeing this issue in unbound that someone else here recently posted a link to: https://github.com/NLnetLabs/unbound/issues/132 (or perhaps an artifact of it).

General Platypus

I'm seeing this in the log:

alt text

Particularly:

[712:0] debug: return error response SERVFAIL
[712:0] debug: request has exceeded the maximum number of query restarts with 9

When I perform a lookup on this domain.

@bmeeks said in Unbound not advertising logincdn.msauth.net correctly to clients:

You may be seeing this issue in unbound that someone else here recently posted a link to: https://github.com/NLnetLabs/unbound/issues/132 (or perhaps an artifact of it).

I think you're onto something. Looks like it is indeed being caused by the cname chasing. Might be related to this: https://www.mail-archive.com/debian-bugs-dist@lists.debian.org/msg1608638.html

follysuperscript

I'm having a hard time getting my service setup to produce a log entry as useful as @General-Platypus, but I believe I've got the same issue.

Started a couple days ago.
nslookup logincdn.msauth.net results in "can't find logincdn.msauth.net: Server failed" at the client. Using the "Diagnostics> DNS Lookup" tool, It will resolve, though using "Diagnostics> Ping" also fails.

I've tried doing a domain override to 8.8.8.8 but I can't get that to work either.

Any ideas how to implemented a unbound based fix? Even if it's a temporary hack. I just want to login to outlook again.

Gertjan

@follysuperscript said in Unbound not advertising logincdn.msauth.net correctly to clients:

but I believe I've got the same issue.

Then stop forwarding to 1.1.1.1 (or 8.8.8.8 - or some other forwarder).
Just use unbound as it was meant to be used : as a resolver.
And it works well :

C:\Users\Gauche>nslookup logincdn.msauth.net
Serveur :   pfsense.local.net
Address:  2001:470:beef:5c0:2::1

Réponse ne faisant pas autorité :
Nom :    cs1227.wpc.alphacdn.net
Address:  192.229.221.185
Aliases:  logincdn.msauth.net
          lgincdn.trafficmanager.net
          lgincdnvzeuno.azureedge.net
          lgincdnvzeuno.ec.azureedge.net

Remember : if the forwarder says "dono" then unbound can't make it better.

lukasz.s

Hi guys

I have encountered the same problem as You.
DNS resolver , used as a resolver not as forwarder, has problem with advertising logincdn.msauth.net to clients, which has results in that client cant open login.live.com page correctly.

If I ask

dig @1.1.1.1 logincdn.msauth.net

; QUESTION SECTION:
;logincdn.msauth.net.		IN	A
;; ANSWER SECTION:
logincdn.msauth.net.	285	IN	CNAME	lgincdn.trafficmanager.net.
lgincdn.trafficmanager.net. 15	IN	CNAME	lgincdnvzeuno.azureedge.net.
lgincdnvzeuno.azureedge.net. 1785 IN	CNAME	lgincdnvzeuno.ec.azureedge.net.
lgincdnvzeuno.ec.azureedge.net.	3585 IN	CNAME	cs1227.wpc.alphacdn.net.
cs1227.wpc.alphacdn.net. 3585	IN	A	192.229.221.185

;; Query time: 53 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Thu Jun 09 14:44:29 CEST 2022
;; MSG SIZE  rcvd: 204

i get correct answer but when i ask

dig @my_pfsense_dns_resolver_ip logincdn.msauth.net

;; QUESTION SECTION:
;logincdn.msauth.net.		IN	A

;; Query time: 323 msec
;; SERVER: 192.168.0.10#53(192.168.0.10)
;; WHEN: Thu Jun 09 14:44:56 CEST 2022
;; MSG SIZE  rcvd: 48

i get empty response.

Its strange because others queries work ok.
I have double checked any blockers, ids, firewall and other...

Is it some known bug or something ?

Pfsense version 22.01-RELEASE (amd64)
Netgate 7100

Regards

johnpoz

@lukasz-s said in Unbound not advertising logincdn.msauth.net correctly to clients:

logincdn.msauth.net

wow 8 freaking cnames - who and the F does their dns???

;logincdn.msauth.net.           IN      A

;; ANSWER SECTION:
logincdn.msauth.net.    3600    IN      CNAME   lgincdn.trafficmanager.net.
lgincdn.trafficmanager.net. 3600 IN     CNAME   lgincdnmsftuswe2.azureedge.net.
lgincdnmsftuswe2.azureedge.net. 3600 IN CNAME   lgincdnmsftuswe2.afd.azureedge.net.
lgincdnmsftuswe2.afd.azureedge.net. 3600 IN CNAME firstparty-azurefd-prod.trafficmanager.net.
firstparty-azurefd-prod.trafficmanager.net. 3600 IN CNAME dual.part-0023.t-0009.t-msedge.net.
dual.part-0023.t-0009.t-msedge.net. 3600 IN CNAME global-entry-afdthirdparty-fallback.trafficmanager.net.
global-entry-afdthirdparty-fallback.trafficmanager.net. 3600 IN CNAME dual.part-0023.t-0009.fbs1-t-msedge.net.
dual.part-0023.t-0009.fbs1-t-msedge.net. 3600 IN CNAME part-0023.t-0009.fbs1-t-msedge.net.
part-0023.t-0009.fbs1-t-msedge.net. 3600 IN A   13.107.219.51
part-0023.t-0009.fbs1-t-msedge.net. 3600 IN A   13.107.227.51

;; Query time: 390 msec

I haven't chased the whole chain, but first chain has a 5 min TTL as well - wow that is going to cause some unnecessary queries that is for sure.. I have unbound set to min ttl of 3600 (1 hour).. Because so many places using unrealistic ttl values that are so freaking low.

logincdn.msauth.net.    300     IN      CNAME   lgincdn.trafficmanager.net.

2nd cname - 30 seconds - jfc people! no wonder its problematic!

lgincdn.trafficmanager.net. 30  IN      CNAME   lgincdnmsftuswe2.azureedge.net.

lukasz.s

@johnpoz that do You suggest to set "Minimum TTL for RRsets and Messages" to more than default 0 ?

".... I have unbound set to min ttl of 3600 (1 hour).."

btw. today this domain resolves ok

johnpoz

@lukasz-s here is the thing - back in the day, you should of really never messed with changing somethings ttl.. But that was back when they used realistic ttls, the only time you would lower them to very short was you were getting ready for a change..

You would lower the ttl the closer you got to the change, you would then change the IP of the record. After you were sure everything was working, and new IP was good you would then raise the ttl back up to something normal.

These days they love to set them to shit like 30 freaking seconds.. Or 5 minutes - why, they like to drive of number of queries and doing something with tracking if you ask me..

I set my min to 1 hour, and I also serve 0.. Have not run into anything there that has caused me any issues in accessing anything..

In a sane world no I wouldn't suggest messing with the ttl - but these places are insane - 30 second freaking ttl..