Unbound: DNS request timed out for two requests, then returns Non-authoritative answer

Paint

@gertjan 20H2, 1903 and one other version in between. They all have the same issue

Paint

@johnpoz Ill see if removing the LAGG fixes the issue. Thank you!

Yes, I ran the captures at two different times - I originally configured the capture from my pfSense machine wrong.

johnpoz

Well it worked to show the problem atleast.. But yeah when troubleshooting stuff like this is best to do the sniffs at the same time so that if intermittent packet loss is the problem you can see specifically what happened to specific packet.. In a normal tcp conversation you could use the seq/ack numbers to track which are which.

But with udp, the source port (different for each query) and the transaction ID can help line up which queries and responses go with each other..

Its still odd to me why nx only being sent once, while normal responses are being sent twice.. Maybe that has something to do with the lagg? Very strange.. I do not recall ever seeing such a thing before in troubleshooting dns.. No reponse sure, lost traffic sure.. But in 20 some years of troubleshooting networking, dns, etc.. I do not recall seeing dupes like that..

The closes thing that comes to mind.. Is we had a bug on a cisco switch that was dropping dns inside a vlan.. Bug turned out to be if there was no svi set for that vlan.. When you sniffed the vlan on the switch you could see packets being dropped.. You should always see 2 copies of the packet as it enters the switch and when it leaves the switch.. The bug we were seeing is sometimes you would see the packet enter the switch - but not leave the switch.

That one took a a bit to track down ;) There were multiple switches in the path.. And we could see the packets leaving the source, and being returned by the server.. But the client was not getting the response - same as your seeing.. But then we had to follow the path of the traffic through multiple switches in the datacenter.. And some switches did not support sniffing right on the switch.. So we had to setup span ports with a laptop where the packets were being dropped.. Once we figured out where the packets were being lost - it was simple enough to track down the actual bug report.. Adding a svi to the vlan on that switch, even though it was just doing layer 2 was a work around until they fixed the bug in firmware update on the switch.

Paint

@johnpoz said in Unbound: DNS request timed out for two requests, then returns Non-authoritative answer:

Well it worked to show the problem atleast.. But yeah when troubleshooting stuff like this is best to do the sniffs at the same time so that if intermittent packet loss is the problem you can see specifically what happened to specific packet.. In a normal tcp conversation you could use the seq/ack numbers to track which are which.

But with udp, the source port (different for each query) and the transaction ID can help line up which queries and responses go with each other..

Its still odd to me why nx only being sent once, while normal responses are being sent twice.. Maybe that has something to do with the lagg? Very strange.. I do not recall ever seeing such a thing before in troubleshooting dns.. No reponse sure, lost traffic sure.. But in 20 some years of troubleshooting networking, dns, etc.. I do not recall seeing dupes like that..

The closes thing that comes to mind.. Is we had a bug on a cisco switch that was dropping dns inside a vlan.. Bug turned out to be if there was no svi set for that vlan.. When you sniffed the vlan on the switch you could see packets being dropped.. You should always see 2 copies of the packet as it enters the switch and when it leaves the switch.. The bug we were seeing is sometimes you would see the packet enter the switch - but not leave the switch.

That one took a a bit to track down ;)

thank you, @johnpoz, for your help thus far!

Im not using any VLANS or tagging on my Brocade ICX6450 switch or in my LAN.

Ill investigate if I have any settings wrong on the managed switch and then remove the LAGG.

johnpoz

No didn't mean to suggest it was the same sort of bug.. That was just the closest thing I could remember to such an issue in like 30 years doing this sort of thing ;)

Its sim in the fact that we see the server sending the response, but the client not getting it - so the packet is being lost somewhere..

You also notice in your sniffs that 2 packets are sent for the responses you do get - but your client is only seeing 1 of them..

I have to say it has to be related to your lag.. But I do not recall ever seeing server send 2 responses..

Here

This is the initial ptr the client does for the name of the NS.. You sent 2 of those - but only 1 was seen by your client.. Its harder to know for sure which one you got.. Because your sniffs were not done at the same time..

But 2 responses were put on the wire - your client should of seen both of those.. They were sent 0.4 ms apart..

The odd thing for me - is why only some responses being sent twice? The NX responses are only sent once - which you do not get.. Strange for sure..

edit:
It would be interesting to see if the 2nd packet is the one you get and the first is always lost sort of thing. This would make sense why your not getting the NX which is only sent once.

But I am not spotting any difference in the packets that could explain why they might be filtered vs the ones sent twice.. They look the same.. same macs, same transaction ids, same ports.. They are just retrans.. I have to assume your getting the retrans.. And I guess its possible that maybe unbound itself is not sending it, but something your switch is doing, resending the packets? That would make more sense really since not sure why unbound would send out retrans for normal, but not NX.. And why would the retrans be sent so fast? Maybe your switch is doing it??? All the sniff tells us is they were seen on the wire..

Which is why I guess it has something to do with the lagg..

Would be interesting to see what happens on your linux boxes where you say your not seeing the problem when you do a query for something that is NX.. And sniff where we see normal responses and nx responses - are the nx only being seen once, while normals are actually seen twice?

Paint

@johnpoz said in Unbound: DNS request timed out for two requests, then returns Non-authoritative answer:

really since not sure why unbound would send out retrans for normal, but not NX.. And why would the retrans be sent so fast? Maybe your switch is doing it??? All the sniff tells us is they were seen on the wire..

Ill do some more simultaneous sniffs and share them with you prior to changing anything.

I dont see any dropped or resent packets on the switch for any of the 48 ports. This is a strange issue for sure. I agree that the next course of action would be removing the LAGG - though it will be fun backing out the LAGG configuration for pfSense and the switch.

johnpoz

there is no wireless involved in these sniffs right.. Clients are all wired?

You mention other switches. Could you lay out the physical connections are clients connected directly to the managed switch, or are there some dumb switches involved?

Really like to see if linux clients show the same duplicate packets from the server response. Where the linux clients are all connected to the same switch(es) as the windows ones.

Possible something doing something odd with dns?? But I would assume that would have to be something only a managed switch might do, or wireless..

Paint

@johnpoz said in Unbound: DNS request timed out for two requests, then returns Non-authoritative answer:

there is no wireless involved in these sniffs right.. Clients are all wired?

You mention other switches. Could you lay out the physical connections are clients connected directly to the managed switch, or are there some dumb switches involved?

Really like to see if linux clients show the same duplicate packets from the server response. Where the linux clients are all connected to the same switch(es) as the windows ones.

Possible something doing something odd with dns?? But I would assume that would have to be something only a managed switch might do, or wireless..

No, I have wireless devices on the network. The issue happens on both wired and wireless clients, as long as they are running Windows 10. The previously sent sniffs are from all wired clients, however.

I can map out the network architecture as well. Yes, there are two unmanaged/dumb switches being used as well.