Unbound stops working periodically
Synopsis: having issues where unbound stops responding properly to lookups (doesn't report error, just gives bad info) until I restart it.
I recently upgraded pfsense to 2.1 and switched to Unbound for the DNS resolver because I needed to do resolving directly instead of forwarding due to mail RBL service query overloading. Had no problem getting Unbound to work initially, but after a day I started getting a lot of malformed MX record lookups on my mail server and when I queried the records I was seeing a lot of null mx records, but doing a lookup on an external DNS service showed normal MX records. I disabled DNSSEC thinking it was related to that and the problem seemed to go away. However today the same problem started happening again and restarting the Unbound service has resolved. When the problem happens, Unbound reports bad info for the lookup… below is a lookup for navyfederal.org MX and notice is returns a null MX
>> dig @192.168.100.1 -t mx navyfederal.org. ; <<>> DiG 9.9.5-3-Ubuntu <<>> @192.168.100.1 -t mx navyfederal.org. ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17827 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;navyfederal.org. IN MX ;; ANSWER SECTION: navyfederal.org. 261 IN MX 0 . ;; AUTHORITY SECTION: org. 22284 IN NS ns.buydomains.com. org. 22284 IN NS this-domain-for-sale.com. ;; Query time: 0 msec ;; SERVER: 192.168.100.1#53(192.168.100.1) ;; WHEN: Wed Sep 24 12:29:47 EDT 2014 ;; MSG SIZE rcvd: 125
Restarting Unbound and repeating now gives:
>> dig @192.168.100.1 -t mx navyfederal.org. ; <<>> DiG 9.9.5-3-Ubuntu <<>> @192.168.100.1 -t mx navyfederal.org. ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14040 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 2 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;navyfederal.org. IN MX ;; ANSWER SECTION: navyfederal.org. 300 IN MX 10 navyfederal-org.mail.protection.outlook.com. ;; AUTHORITY SECTION: navyfederal.org. 500 IN NS ns1.navyfedcu.org. navyfederal.org. 500 IN NS ns.navyfedcu.org. navyfederal.org. 500 IN NS ns1.navyfederal.org. ;; ADDITIONAL SECTION: ns1.navyfederal.org. 500 IN A 126.96.36.199 ;; Query time: 41 msec ;; SERVER: 192.168.100.1#53(192.168.100.1) ;; WHEN: Wed Sep 24 12:35:48 EDT 2014 ;; MSG SIZE rcvd: 182
I'm not seeing anything obvious in the Unbound logs, so any help how to troubleshoot this is greatly appreciated.
You got more than some malformed mx records happening
org. 22284 IN NS ns.buydomains.com.
org. 22284 IN NS this-domain-for-sale.com.
that is clearly not the NS for .org or navyfederal.org
I agree - that info was really concerning. Something is clearly going very wrong when it gets in this state. Just not sure if its an internal bug or something about the root hints used in the pfsense package, or even some compromised information.
So when this happens is like every thing that is cached.. or just specific domains?
Definitely not everything, but many things. Haven't isolated whether it only affects cached items or also new items, but found a post that suggested that it may be a type of DNS cache poisoning because buydomains.com apparently advertises themselves as authoritative for all of .com, .org, etc. But I don't understand why Unbound (supposedly more secure than most) would be susceptible to such a common issue. Unbound is recommended as the DNS server for non-forwarding lookups, and I'm leveraging the standard configuration provided by the pfSense package. So I'm struggling to understand how I'm running into such a basic issue that would presumably affect everyone using it. What is different about my configuration? In pfSense 2.2 they will apparently make Unbound the default forwarder, so I'm missing something somewhere…
I believe I've solved the problem. The package of unbound in pfSense 2.1 does not include any root hints. Once I added an appropriate root hints file, the problem seems to have gone away. To add the root hints:
1. Fetch latest root hints file:
wget ftp://FTP.INTERNIC.NET/domain/named.cache -O root.hints
2. Copy the file 'root.hints' to the pfsense host in the /usr/pbi/unbound-i386/etc/unbound folder. You can overwrite the existing empty root.hints file
3. Add a custom configuration entry to include the root hints file. In the pfSense admin GUI, from the menu select Services –> Unbound DNS. Select the Unbound DNS Advanced Settings tab. Scroll down to the Custom Options box and enter:
Restart the Unbound service (toggle the Unbound Enabled checkbox in the first tab, saving in between) and you should now have up to date root hints.
Note that its recommended you update the root.hints file every 6 months or so, so the package fix should schedule this as a cron job. Hopefully when pfSense 2.2 makes Unbound the default, they will fix this.
how would of it worked at all without root hints?? I can see once it has stuff cached it could still work if had ns for the tlds - but how did it get those on first query with no cache or when you restarted it if did not know where to go to get the owners of the top tlds?
According to unbound documentation, if no root hints file is provided, it will use some "reasonable defaults". Not sure what those are, but they're enough to get it working, but not very clean or reliable.