Slowness of overcomplicated DNS setup

johnpoz

@beefer said in Slowness of overcomplicated DNS setup:

add DHCP leases to LABSERVER dns invalidates the cache constantly?

yes restarting unbound would clear the cache.

johnpoz

@beefer said in Slowness of overcomplicated DNS setup:

RTO for LABSERVER DNS (2000-30000).

You have something wrong with your connection, there should be no reason for such an issue unless you have a horrible connection between the sites.

beefer

@johnpoz yeah, figured the same, but connection is pretty stable - I can stream youtube via RDP from one of remote machines without issues etc. Maybe I should look for a particular connection parameter issue? Like latency? RTT?

johnpoz

@beefer maybe your filling up the connection - and that is what then kills dns?

beefer

@johnpoz so I made a test to see how much I can saturate my connection. I booted up an apache2 on one of LABSERVER hosts and started downloading a 200mb file to a laptop in Site A - I got a stable 1mb/s - meanwhile primary gateway for VPN did not report any RTT, packet loss or latency issues.

But! Since I mentioned it :) I also have a multi WAN setup on Site A. VPN is using a gateway group with a fallback (different tiers) and primary connection is pretty stable. LAN network is also using same gateway group - only selected machines from other interface have policy routing pointing to a round robin GW group.

I tested using traceroute the route from site a laptop to LABSERVER and it was direct via vpn. Perhaps I should look at the way the packets come back from site B? Any tips how to troubleshoot this?

beefer

@beefer I made a test. When I run traceroute from Site A pfsense with 'source address' set to any + icmp I get a flawless route. But when I select LAN as 'source address' route shows only the first hop for VPN network and then gets stuck. This isn't normal, right?

johnpoz

@beefer dude i have no idea what sort of nonsense setup you have. You have given no details. What I can tell you is not normal is timeout for a simple dns query.. even if you were on the other side of the planet you should have like tops 300ms

So you seeing that sort of problem with a simple query points to an ISSUE!!

if I had a site to setup setup - then ips on either site talking to each other would go over the site to site connection. You have a problem if you can not do simple dns queries over that.. With RTT of what the RTT is between your sites.

for VPN network and then gets stuck.

So you trying to route your s2s through some other vpn connection - yeah no wonder your having issues..

beefer

@johnpoz said in Slowness of overcomplicated DNS setup:

So you trying to route your s2s through some other vpn connection - yeah no wonder your having issues..

No - I have a single site-to-site OpenVPN.
LAN Site A: 10.30.0.0/24
VPN: 192.168.60.0.0 don't know the mask, but only two hosts there
LAN Site B: 10.32.0.0./24
LABSERVER: 10.40.x.x

When I traceroute from 'any' interface I get hops from all sites. When I select 'LAN' interface from pfsense in Site A and do the traceroute I get only 192.168.60.0.1 (other VPN side) and the stars for further hops, so it's not routed by... hmmm, site B firewall? Hmmmmmmmmmmm

beefer

@beefer So same thing happens to ping - when I use LAN from site A ping never comes back from LABSERVER. I did record the traffic in Site B and I can see that it reached the LABSERVER and it responded, but somehow it was not passed back to LAN in site A. I'm thinking maybe unbound is trying to run those queries from a LAN interface? It gets a timeout and then tries another interface? LAN is not selected as outbound interface in DNS resolver though.

beefer

@beefer so I think I somewhat solved the issue. My Site A DNS Resolver was configured with selected interfaces as 'Outgoing network interfaces'. When I changed back to 'all' all of the sudden all queries are blazingly fast - even RTO's.

The only thing I don't understand is why it helped. First - I'm in forwarder mode for unbound - shouldn't this setting affect only root dns queries? Also why it was slow is still a mystery to me - perhaps it was doing round robin over those interfaces and got stuck on waiting for answers?