Slowness of overcomplicated DNS setup
-
Hi,
I'm experimenting with a complex DNS setup. I have two sites - both of which use pfsense as a firewall.
Site A and Site B are connected using OpenVPN pre-shared key site-to-site connection and both LAN networks are routed to each other. Everything works fine here.
Site B has a server with yet another pfsense inside LAN. This pfsense - lets call it 'LABSERVER' has numerous VLANs all of which are routed via static routes from outside and are reachable with ~50ms pings from Site B (local site) as well as Site A (through vpn).
Each of the vlans contains a single project with numerous VMs (that need to be isolated but accessible from Site A). Each VLAN DHCP server has a custom domain configured: project1.lab for VLAN with project1, project2.lab for VLAN with project2 and so on. Unbount on LABSERVER has both register static mappings as well as dhcp leases in dns enabled in options. I configured DNS resolver from LABSERVER in a way that I can query it from Site A/Site B and it will return proper records for my hosts from inside vlans. So far so good.Now I wanted Site A unbound to be able to resolve those names without manually adding each project. Thas is why I added 'lab' domain to 'domain overrides' pointing to LABSERVER dns ip. This actually worked flawlessly - all hosts resolve properly.
Both resolvers are configured in forwarding mode.
I solved the issue I'm about to describe, but I want to know the reason it happened. The issue is: dns queries for *.lab in Site A and extremely slow. Like 2-4s slow measured with dig. Invoking dig @LABSERVER from Site A is blazingly fast. This causes issues with RDP which encounters frequent freezes. Also DNS Resolver stats in Site A show a large number of RTO for LABSERVER DNS (2000-30000).
I fixed the issue by selecting 'Serve Expired' in Site A. Now Site A pfsense seems to cache the records and apart from first query to a given name that is really slow all is fine.
I initially thought maybe the TTL of names served by LABSERVER is 0 or very low, but dig (from my laptop in Site A) shows:
dig a box.proj1.lab ;; ANSWER SECTION: box.proj1.lab 2880 IN A 10.40.50.5
Apparently TTL is 48 minutes, so it shouldn't be the case.
I'm positive that DNS slowness is the culprit since all issues disappear when I use IP addresses.My working hypothesis is: perhaps allowing to add DHCP leases to LABSERVER dns invalidates the cache constantly? The problem is that right now all active hosts have static leases.
I'd be grateful for some pointers on how to debug this further :)
-
@beefer said in Slowness of overcomplicated DNS setup:
add DHCP leases to LABSERVER dns invalidates the cache constantly?
yes restarting unbound would clear the cache.
-
@beefer said in Slowness of overcomplicated DNS setup:
RTO for LABSERVER DNS (2000-30000).
You have something wrong with your connection, there should be no reason for such an issue unless you have a horrible connection between the sites.
-
@johnpoz yeah, figured the same, but connection is pretty stable - I can stream youtube via RDP from one of remote machines without issues etc. Maybe I should look for a particular connection parameter issue? Like latency? RTT?
-
@beefer maybe your filling up the connection - and that is what then kills dns?
-
@johnpoz so I made a test to see how much I can saturate my connection. I booted up an apache2 on one of LABSERVER hosts and started downloading a 200mb file to a laptop in Site A - I got a stable 1mb/s - meanwhile primary gateway for VPN did not report any RTT, packet loss or latency issues.
But! Since I mentioned it :) I also have a multi WAN setup on Site A. VPN is using a gateway group with a fallback (different tiers) and primary connection is pretty stable. LAN network is also using same gateway group - only selected machines from other interface have policy routing pointing to a round robin GW group.
I tested using traceroute the route from site a laptop to LABSERVER and it was direct via vpn. Perhaps I should look at the way the packets come back from site B? Any tips how to troubleshoot this?
-
@beefer I made a test. When I run traceroute from Site A pfsense with 'source address' set to any + icmp I get a flawless route. But when I select LAN as 'source address' route shows only the first hop for VPN network and then gets stuck. This isn't normal, right?
-
@beefer dude i have no idea what sort of nonsense setup you have. You have given no details. What I can tell you is not normal is timeout for a simple dns query.. even if you were on the other side of the planet you should have like tops 300ms
So you seeing that sort of problem with a simple query points to an ISSUE!!
if I had a site to setup setup - then ips on either site talking to each other would go over the site to site connection. You have a problem if you can not do simple dns queries over that.. With RTT of what the RTT is between your sites.
for VPN network and then gets stuck.
So you trying to route your s2s through some other vpn connection - yeah no wonder your having issues..
-
@johnpoz said in Slowness of overcomplicated DNS setup:
So you trying to route your s2s through some other vpn connection - yeah no wonder your having issues..
No - I have a single site-to-site OpenVPN.
LAN Site A: 10.30.0.0/24
VPN: 192.168.60.0.0 don't know the mask, but only two hosts there
LAN Site B: 10.32.0.0./24
LABSERVER: 10.40.x.xWhen I traceroute from 'any' interface I get hops from all sites. When I select 'LAN' interface from pfsense in Site A and do the traceroute I get only 192.168.60.0.1 (other VPN side) and the stars for further hops, so it's not routed by... hmmm, site B firewall? Hmmmmmmmmmmm
-
@beefer So same thing happens to ping - when I use LAN from site A ping never comes back from LABSERVER. I did record the traffic in Site B and I can see that it reached the LABSERVER and it responded, but somehow it was not passed back to LAN in site A. I'm thinking maybe unbound is trying to run those queries from a LAN interface? It gets a timeout and then tries another interface? LAN is not selected as outbound interface in DNS resolver though.
-
@beefer so I think I somewhat solved the issue. My Site A DNS Resolver was configured with selected interfaces as 'Outgoing network interfaces'. When I changed back to 'all' all of the sudden all queries are blazingly fast - even RTO's.
The only thing I don't understand is why it helped. First - I'm in forwarder mode for unbound - shouldn't this setting affect only root dns queries? Also why it was slow is still a mystery to me - perhaps it was doing round robin over those interfaces and got stuck on waiting for answers?