Automatically fill Unbound DNS cache with top hits list?

TS_b

Is there a way that one could automatically (shell commands via cron?) have Unbound resolve a list of top hits in order to keep its cache full of useful DNS entries?

i.e., keep a list of hits on a network, have a Cron job go resolve those hits so that local DNS is faster for the websites actually used on the network.

Unbound Resolution is noticeably slower when it has to resolve the whole thing recursively instead of something that's in the cache.

TS_b

Something like

nslookup < /tmp/list_of_urls

or

dig -f /tmp/list_of_urls

These are pretty slow for running through lists, is there a fast way to do a lot of hosts in a short period of time?

EDIT: Is there some way to run a whole bunch of these processes or something that accomplishes this simultaneously in order to process 100+ DNS requests/sec.
Just doing it this way it looks like it averages ~4 lookups a second with dig -f. So if I could get 25+ instances working on looking up the file entries simultaneously it would be great.

I ran the dig -f for maybe 20 minutes and my unbound cache output went from <5k lines to >180k lines and DNS resolution feels much snappier.

unbound-control -c /var/unbound/unbound.conf dump_cache > /tmp/dnsdump

NOYB

First of all why would you want to run all of them at once? As long as each one is re-queried before it's TTL expires should suffice. Running then all at once will create utilization spikes. Potentially impacting network performance during those times.

If I were going to do something like this I'd get the TTL of each and re-query for them at maybe 10% remaining. This would probably reduce the need for them to be fast.

A possible query method might be to use curl (PHP or shell). There may be some other PHP get host type functions available too. You might look at the get host functions pfSense uses for some ideas.

TS_b

Well my thought was that if I set it up on cron to run early in the morning before the network is in use for the first run of the day. The biggest surge would be when no one would notice anyways. Subsequent re-queries would only go outside of the network for the entries with shorter TTLs.

The system is also way overkill for what it needs so I wouldn't think that even pulling in a lot of DNS requests at once would be noticeable.

I tried using host before I tried nslookup or dig, but I couldn't get it to use a file. If you could tell me what I'm messing up there I'd really appreciate it!

Finally, are there any tools that can natively run multiple processes simultaneously? Like can I run multiple instances of dig, nslookup, host, or anything that can pull in DNS requests?

I've no idea how to use cURL or PHP. I don't know if they are simply enough that I could hack something together given a good tutorial or if that would be a bridge to far?

TS_b


host < /tmp/list_of_url

Just returns the instructions for usage of host

What am I doing wrong there?

TS_b

I tried running multiple instances of dig & nslookup in batch mode and it always fails. The shell just hangs after a while and when I Ctrl+C it will tell me one or more of the instances = "Exit 9", and the rest are either "Done" or "+ Done". I don't know what the difference in done and +done are, or what Exit 9 means, or why it's happening.

The way I'm running multiple instances is splitting the file containing the list of URLs into many smaller files, then running


dig -f /tmp/list00 & dig -f /tmp/list01 & dig -f /tmp/list02

This always fails after running for a bit. However, if I open up multiple shells, and in each shell run a single instance, they will all run concurrently until finish with no problems. I've tried this with 8-10 shells all running dig at the same time.


dig -f /tmp/BIGlist

I've tried this with nslookup as well with the same results, still can't get host to work from a file. I don't think it's possible.
It seems like when I run many parallel dig (or nslookup) instances in the background, when the first one finishes then it just halts everything. I don't have any evidence of this, just seems like that?

Any input from anyone would be greatly appreciated!

johnpoz

Doesn't the prefetch checkbox do this for you?

My understanding is if something does a query and after the cached item has been given to the requester.. If the remaining ttl on the item is less than 10% of its ttl then it will automatically be refreshed.

But to be honest if your having issue with using a resolver, because items take to long to resolve.. Then maybe you should just use a forwarder.. There should be really no noticeable issues with lookups from clients once your resolver has been running for a while.

That being said if your in the part of the world where most of the sites you visit NS are on the other side of the planet from you ;) Then yeah you could have some added latency issues with dns queries. Or if your on some sort of sat internet connection with HIGH latency..

If your going to go to the trouble of prefilling your cache.. might as well just let a forwarder do that for you ;)

prefetchsupport.png_thumb

doktornotor

Yeah indeed, why don't you just let Unbound do its job as suggested above (plus tuning the cache via the GUI if it doesn't fit your needs - TTL/number of hosts).

TS_b

I'm not sure what the parameters are for prefetch, but I've always had it turned on.

When I pull in a bunch of DNS entries this way, the network is noticeably snappier.

I didn't think tweaking TTLs on my end was a good idea? I thought if I set it too long I could end up causing problems?

I've used the resolver because I like the idea of cutting out the middle man and letting my own machine do the work. Also, I use DNSBL which requires Unbound. I guess I could use Unbound in forwarding mode and it might work fine with DNSBL?

Honestly at this point I'm really just curious to make it work by pulling in requests.
If I can figure out an efficient way of doing it, my network is really snappy and I'm still resolving my own requests. Seems like a win-win to me!

johnpoz

What exactly is taking long time to resolve?? I have not seen any sort of added delay in resolving anything.. And I have been running unbound as the resolve since it was first possible to do so, you are talking a few ms.. This is not going to be noticeable.. If you are saying you can tell the difference between a query that is cached takes say <3 ms and one that takes 30 ms I would say your crazy.. Lets say it took a whole 100ms – your taking .1 of second.. Come on - there is no way this is going to be an issue.

With most domains your NS are going to be cached for extended periods because of longer TTLs of the NS for a domain and down the tree.. So the only thing you are talking about having to look up is the actual record of the thing in question.. say www.domain.com -- what is the ttl of said record and their NS?

If your noticing issues with specific domains - maybe its just their DNS sucks and they have really short ttl with crappy NS that take long time to respond, etc.

If your having issues with resolving of something specific.. I would suggest you troubleshoot that specific issue, what are the TTLs involved in resolving said record. What sort of response time are you getting from their authoritative NS, etc.

Preloading your unbound cache is just seems pointless to shave a few ms of some initial resolve of a fqdn.

TS_b

I would say most sites take ~2-3 seconds to load up unless they've been recently visited.

If I go fetch a lot of the top url DNS entries from a published list (Alexa, Magic Million, etc.) then everything seems to load just about instantly.
My Unbound cache also increases exponentially in size obviously.

I don't think it's any issue with Unbound itself, like you stated before my latency is higher which is why (I assume) the network is notably slower without DNS cached locally.

It's really not a huge deal, but I wanted to poke around and see if I could improve performance while still resolving. I found a way, I'm just having trouble making it automatic and reliable.

I think I might have figured it out.

fetch a list, trim the list down to a much smaller number of entries, split the big list into many smaller lists, add an entry before each url for dig (so "google.com" becomes "dig google.com", chmod +x each of the split files, run all of the split files in parallel (/tmp/f0 & /tmp/f1 & /tmp/f2…) this was successful with two files running in parallel. We'll see if it scales up reliably.

johnpoz

"I would say most sites take ~2-3 seconds to load up unless they've been recently visited. "

How does relate to dns, and not more to do with your browser caching all the stuff it downloads from the site?? Your saying it takes 2-3 seconds for your dns to resolve.. Then yeah you have something going on with resolving.

You do understand there are multiple dns caches involved here, your OS is going to cache, your browser is going to cache - and then you have the resolver that will cache. Delays in browser loading more often have to do with issues with your browser cache and less to do with how long it takes to query the fqdns of the stuff used on the page.

You sure your not just clearing your browser cache when you close it? And your having download everything vs pulling it from cache..

Lets take forum.pfsense.org for example.. It takes all of 36ms to do that query direct to their authoritative servers.

;; QUESTION SECTION:
;forum.pfsense.org. IN A

;; ANSWER SECTION:
forum.pfsense.org. 300 IN A 208.123.73.18

;; AUTHORITY SECTION:
pfsense.org. 300 IN NS ns1.netgate.com.
pfsense.org. 300 IN NS ns2.netgate.com.

;; ADDITIONAL SECTION:
ns1.netgate.com. 3600 IN A 192.207.126.6
ns2.netgate.com. 3600 IN A 162.208.119.38

;; Query time: 36 msec
;; SERVER: 162.208.119.38#53(162.208.119.38)
;; WHEN: Tue Apr 18 13:27:55 CDT 2017
;; MSG SIZE rcvd: 141

How does that equate to your 2 to 3 second delay in page loads?? Even if you multiple that by 10 your talking 360ms.. That is .36 of second.. How is that going to be an issue??

If you feel your resolving is taking too long - then pick some domains that are of concern and look into how long they actually take to resolve.. Lets say you shave that .036 it takes to resolve forum.pfsense.org off your load time of 3 seconds. Your talking 1% savings in time out of 3 seconds.. Come on - how is that an issue?

edit: Lets say you want to preload that.. So your going to run your script to lookup forum.pfsense.org every 5 minutes? Since that is the ttl of that record.. So for your preloading to do anything it would have to query forum.pfsense.org ever 5 minutes.. Or you are going to run into an issue where your browser asks for forum.pfsense.org and the resolver will have to resolve it.

So your going to run a script every 5 minutes to query forum.pfsense.org so your cache is loaded, to save .036 seconds of your load time in your browser?? How much more cpu does that use up, so electricity.. How much more bandwidth does that use up? Les call it 140 bytes from my above query. Times every 5 minutes, 24 hours a day just for the 1 record.. What is that about 40Kbytes extra for each record a day to save .036 seconds when you happen to load up the pfsense forums.

TS_b

Haha I don't know what you want to hear man. All I can tell you is that if I set your down at a computer on my network, had you browse under normal conditions, then pulled in a bunch of DNS queries at once, you would be able to tell the difference.
It isn't anything crazy, but it does make it feel noticeably faster.
I am definitely not worried about the few bucks more a year I'll have to pay for the CPU cycles or the kBs of data that it will take to do it.

NOYB

Well what you really should do is identify the issue. So it can be resolved specifically and properly.

I'd start with httpWatch. The basic edition is free.
https://www.httpwatch.com/

Pretty sure Chrome has similar capabilities that can show where time is being consumed.

TS_b

That's very cool, I'll check it out!

I don't think that there is a real problem per say, I think I just have a high latency connection and that's life. For example, when I run dig on a lot of url's, at a glance it looks like most resolve in ~300ms, but some queries are in the 2-3 second range, as John pointed out he has a normal latency connection and is getting returns in the 30ms range. Again, everything works fine and I wouldn't call it a significant problem at all (after all I am still resolving). It's just something I thought I'd see if I could tweak.
You definitely do tell a difference when you go from 2-3sec in some cases to 0ms. While 2-3 seconds is abysmal for DNS, it's still only 2-3 seconds and those times are the exception not the norm. This really is something I'm just curious about, especially since it's actually a noticeable improvement on my network.

I'm not suggesting this is something other should do either. For me it's educational messing around with the commands and trying to hack something together that does what I want.

But I will check that program out, maybe there is something wrong. Thank you.

NOYB

Hopefully the basic edition will show the DNS lookup timing. I have access pro edition.

pfSense.org.jpg
pfSense.org.jpg_thumb

pfBasic

Very cool, getting that now!

TS_b

Alright, I got it working. It's pretty hacked together but it works (so far).

The first cron job retrieves the Majestic Million Top URL List every day, it then pulls out only the column listing the URLs, and cuts it down to the first 5000 lines. Then it adds dig commands limiting the timeout to 3s (I think default = 5s) and only one try (default =3?). It splits the file up into 50 files of 100 lines each and makes them executable, and deletes all of the unnecessary files.

30 04 * * * root


fetch -o /tmp/mm.csv http://downloads.majestic.com/majestic_million.csv && cut -d , -f 3 /tmp/mm.csv | cat >> /tmp/mmf && rm /tmp/mm.csv && sed -I -e '1d;5002,$d' /tmp/mmf && rm /tmp/mmf-e && sed -I -e 's/^/dig +short +time=3 +tries=1 +ttlid /' /tmp/mmf && rm /tmp/mmf-e && split -d -l 100 /tmp/mmf /tmp/mmf && rm /tmp/mmf && chmod +x /tmp/mmf**

The second cron job runs the DNS query every 5 minutes.

0,5,10,15,20,25,30,35,40,45,50,55 * * * * root


/tmp/mmf00 & /tmp/mmf01 & /tmp/mmf02 & /tmp/mmf03 & /tmp/mmf04 & /tmp/mmf05 & /tmp/mmf06 & /tmp/mmf07 & /tmp/mmf08 & /tmp/mmf09 & /tmp/mmf10 & /tmp/mmf11 & /tmp/mmf12 & /tmp/mmf13 & /tmp/mmf14 & /tmp/mmf15 & /tmp/mmf16 & /tmp/mmf17 & /tmp/mmf18 & /tmp/mmf19 & /tmp/mmf20 & /tmp/mmf21 & /tmp/mmf22 & /tmp/mmf23 & /tmp/mmf24 & /tmp/mmf25 & /tmp/mmf26 & /tmp/mmf27 & /tmp/mmf28 & /tmp/mmf29 & /tmp/mmf30 & /tmp/mmf31 & /tmp/mmf32 & /tmp/mmf33 & /tmp/mmf34 & /tmp/mmf35 & /tmp/mmf36 & /tmp/mmf37 & /tmp/mmf38 & /tmp/mmf39 & /tmp/mmf40 & /tmp/mmf41 & /tmp/mmf42 & /tmp/mmf43 & /tmp/mmf44 & /tmp/mmf45 & /tmp/mmf46 & /tmp/mmf47 & /tmp/mmf48 & /tmp/mmf49 &

The initial run takes ~34s right after a flushed cache. This pretty much lines up with my ~300ms/request average (cutting out the really high time outliers).

Subsequent runs take ~13s.

CPU usage during the run is ~40%. This is significant considering my router runs an i5-2400.

Bandwidth usage is ~40kbps during the initial run.

Cache size increases by a factor of ~165 over the size ~ a minute after a flush with normal usage.

It works for me and makes my network noticeably snappier. But it also sucks down a lot of CPU for a router. That doesn't bother me on my network but this is obviously not a useful thing to do for most.

johnpoz

So it takes you 300ms to query for forum.pfsense.org from their NS??

Where are you in the world? What is your internet connection? If your latency is that bad, that is going to effect all downloads not just dns queries.. So I again do not see how trimming .3 of second is going to make a freaking difference in your performance..

Lets see your httpwatch traces, etc.

kpa

What you're doing will cut down on the initial seeding time of the cache but you can't cheat on the TTLs set on the records, they have to refetched when they expire and that's going to equally costly compared to the situation if you just started with an empty cache.