Unbound DNS intermittent failure

Liath.WW

It would seem that DNS is failing intermittently, and it has really started to impact my day to day operation.

I'm using an old 2nd gen I5, board is fine but the built-in NIC only runs at 400MBps before bottlenecking, added an intel d33682 2-port NIC, Intel logo and all that because I know there were some Chinese cheapo clones with crap capacitors and such.

The machine has had this setup for well, since 2nd get i5's were new. Haven't had much issue with pfSense until this latest build, with the new interface and loss of rrd graphs. DNS since that upgrade has been a bit of an issue. Lately its so bad I'm pulling aggro from my family because 'the internet is broke'.

Only real hints I can think of are that I have an AT&T modem with IP-passthrough turned on, modem has all filtering off.
The logs will occasionally spam llinfo arp resolution issues with the modems IP even though the link is up and passing traffic.
I also see in logs>system>DNS resolver that every 5 minutes like clockwork, it is evaluating and dropping some aliases:


.....lots of similar entries like
Feb 2 13:01:08	filterdns		adding entry 54.239.172.202 to pf table Eve for host launcher.eveonline.com
Feb 2 12:56:08	filterdns		IP address 52.84.128.4 already present on table Eve as address of hostname launcher.eveonline.com
...lots more of the same
Feb 2 12:56:08	filterdns		adding entry 52.84.27.39 to pf table Eve for host resources.eveonline.com
Feb 2 12:51:08	filterdns		clearing entry 52.84.133.166 from pf table Eve on host binaries.eveonline.com
....
Feb 2 12:51:08	filterdns		adding entry 52.84.128.4 to pf table Eve for host launcher.eveonline.com
Feb 2 12:46:09	filterdns		clearing entry 54.192.7.112 from pf table Eve on host resources.eveonline.com

In StatusSystem LogsDHCP


Feb 2 13:15:33	dhclient		Creating resolv.conf
Feb 2 13:15:33	dhclient		RENEW
Feb 2 13:10:33	dhclient		Creating resolv.conf
Feb 2 13:10:33	dhclient		RENEW
Feb 2 13:05:33	dhclient		Creating resolv.conf
Feb 2 13:05:33	dhclient		RENEW
Feb 2 13:00:33	dhclient		Creating resolv.conf
Feb 2 13:00:33	dhclient		RENEW
Feb 2 12:55:33	dhclient		Creating resolv.conf
Feb 2 12:55:33	dhclient		RENEW

System/Gateways


Feb 2 02:59:51	dpinger		WAN_DHCP6 2001:4860:4860::8888: Clear latency 10158us stddev 1982us loss 16%
Feb 2 02:59:34	dpinger		WAN_DHCP6 2001:4860:4860::8888: Alarm latency 9857us stddev 1487us loss 21%

I was up late last night trying to figure this out while family was asleep. In my tiredness I cleared logs for a fresh view since I was testing new cables, re-tipped even the factory tipped ones, etc. etc. Wishing now I'd not done so.

When the DNS is on the fritz, connections that were already made continue passing traffic as normal. Streams keep streaming, SIP calls keep working, etc. That rules out the connection dropping as the issue. Only DNS seems to fail, so new connections can't be made.

Any clue what's going on and how to fix it?

Info that might be useful:
packages:
bandwidthd
darkstat
iperf
mtr-nox11
openvpn-client-export
Service_Watchdog << Added to try and resolve dns issues, thought maybe the service was dying? Possibly related to 5-minute interval with filterdns? I believe i added because before I did unbound just died and stayed dead.

toluun

Does a restart of unbound solve the issue? I have been having major issues with DNSSEC on unbound causing DNS failures. Same thing would happen to me, streams would continue, WAN gateways were shown as still open, etc… Only new new DNS lookups would fail. Once I restarted Unbound everything would go back to normal for a short period of time, then BOOM DNS failures. I am still trying to solve my issue (see a couple posts down) but I did find that disabling DNSSEC stopped the DNS failures. Not sure if this helps, but your problems seemed very similar to mine so I thought I would comment with my temporary fix.

Liath.WW

I am going to say "yes" to this one. I'd been having issues with it dying before, and installed the watchdog package to automatically restart it.

From last night until shortly before I made this thread, the internet was generally unbrowsable due to constant DNS issues. I reboot the pfSense box a few hours ago, and have had no more issues since, however this is a repeat issue that seems to get worse until I get tired of it and reboot the entire network.

It really concerns me because I have business clients who I really want to migrate from SonicWall to pfSense, but if I replace them and DNS is going to act like this in a business production environment, I'll be looking for new clients.

toluun

Well at least yours sound a lot more uncommon then mine. My DNS would go down every 10 - 30 min. Do you have DNSSEC enabled on unbound?

Gertjan

@Liath.WW : filterdns : Take a look at "binaries.eveonline.com" :

[code]root@ns311465:~# host binaries.eveonline.com
binaries.eveonline.com is an alias for d17ueqc3zm9j8o.cloudfront.net.
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.137
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.7
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.11
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.177
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.52
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.156
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.186
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.181
[/code]

A couple of seconds later, the list changes ! :

root@ns311465:~# host binaries.eveonline.com
binaries.eveonline.com is an alias for d17ueqc3zm9j8o.cloudfront.net.
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.11
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.137
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.52
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.181
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.7
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.177
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.156
d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.186

so it's normal that filterdns is very busy every 5 minutes with removing IP's, and adding new ones.
filterdns is payed to do so.

UP to you to remove "inaries.eveonline.com" from your alias list, or complain against them ;)

DNS : You are using the DHCP client to obtain a new WAN IP ? Somethings goes very wrong with that. When I see it recreates "resolv.conf" I wouldn't be surprised that your local DNS server (unbound) is restarting. Every 5 minutes. Yep, you're right, consider your DNS in very bad state. But this is not his fault.

Find out why your DHCP clients (is forced ?!) to renew evey 5 minutes - like when filterdns is running … Strange, it's time to describe your setup completely.

Btw : unbound resolves up against the root DNS servers, and is ROCK solid as a DNS server.
Your issues is not DNSSEC related. DNSSEC activated for unbound works for thousands if not tens of thousands of pfSense installs, and all other servers that use unbound.

Liath.WW

FilterDNS runs after Unbound kicks the bucket and restarts.


Feb 2 16:30:10	filterdns		adding entry 54.239.172.212 to pf table Eve for host binaries.eveonline.com
Feb 2 16:26:51	unbound	53607:0	info: start of service (unbound 1.6.6).
....
Feb 2 16:26:47	unbound	88185:0	info: service stopped (unbound 1.6.6).
Feb 2 16:25:09	filterdns		clearing entry 52.84.133.127 from pf table Eve on host binaries.eveonline.com
...
Feb 2 16:25:09	filterdns		adding entry 54.192.7.236 to pf table Eve for host launcher.eveonline.com
Feb 2 16:22:46	unbound	88185:0	info: start of service (unbound 1.6.6).
Feb 2 16:22:46	unbound	88185:0	notice: init module 1: iterator
Feb 2 16:22:46	unbound	88185:0	notice: init module 0: validator
...
Feb 2 16:22:32	unbound	67005:0	info: server stats for thread 0: 139 queries, 61 answers from cache, 78 recursions, 0 prefetch, 0 rejected by ip ratelimiting
Feb 2 16:22:32	unbound	67005:0	info: service stopped (unbound 1.6.6).
Feb 2 16:21:43	unbound	67005:0	info: start of service (unbound 1.6.6).
Feb 2 16:21:43	unbound	67005:0	notice: init module 0: iterator
...
Feb 2 16:21:40	unbound	59480:0	info: server stats for thread 0: 566 queries, 183 answers from cache, 383 recursions, 6 prefetch, 0 rejected by ip ratelimiting
Feb 2 16:21:40	unbound	59480:0	info: service stopped (unbound 1.6.6).
Feb 2 16:20:12	filterdns		adding entry 52.84.133.127 to pf table Eve for host binaries.eveonline.com
...
Feb 2 16:17:25	unbound	59480:0	info: start of service (unbound 1.6.6).
Feb 2 16:17:25	unbound	59480:0	notice: init module 1: iterator
Feb 2 16:17:25	unbound	59480:0	notice: init module 0: validator
Feb 2 16:17:22	unbound	18317:0	info: 4096.000000 8192.000000 1
...

If I understand you correctly, there is something happening that is causing unbound to restart. How can I find the root cause?

One rabbit hole I fell down was because of the arp llinfo messages, but I don't have an example of right now. They do point to the IP of my ISP-provided modem - which I cannot get rid of (I'm on fiber, they said the system wont allow me to go straight from the "ONT?" (fiber<>eth bridge) to my router. but I admit I haven't tried to bypass it.)

The passthrough on the modem is weird. The device first hands out an address in the 192.168.1.x range, then once pass-through is handled it hands out the public facing IP.

I do see a bunch of this in DHCP log, but I'm not 100% is applicable:


Feb 2 15:44:44	dhclient		Creating resolv.conf
Feb 2 15:44:44	dhclient		RENEW
Feb 2 15:39:44	dhclient		Creating resolv.conf
Feb 2 15:39:44	dhclient		RENEW

Liath.WW

I think I may have stumbled upon something in the ISP modem config that could be causing this, though the times are different than the pfSense 5 minute issues.
In the IP-passthrough page, there is a Passthrough DHCP Lease. Default value is 10 minutes. I changed to 1 day, hopefully this is the root cause and will fix things.

FYI, the modem is this one:

Manufacturer ARRIS
Model Number BGW210-700

Liath.WW

Haven't seen much more logs about dns/dhcp dying since I updated the thread last night.
Computers seem to be going well enough. Phones still aren't too happy, though they're phones, no idea if there's something goofy going on with them.

Liath.WW

Forgot to update this because it was late and I was tired. Had the services die again last night, unbound restarting itself. Switched to just using dns forwarder and haven't had a peep from anything since.

Despite people saying that it isn't unbound DNS, that is the service with the symptom. If there are logs or configs that someone would like to have that might be able to help identify the issue, I'll be happy to provide them. I understand that some other service failure may be causing unbound to die and restart, but thus far all of the information I've seen and read doesn't solve the issue for me, and I've not seen any useful requests that yield results.

Unfortunately this means I can't pitch pfSense with dnssec as a selling point. The rest of it works great, and I've been using pfSense as a whole for years.

I might be able to put it in production without unbound, but if I can't get a home setup stable, it makes me wonder if the underlying cause of unbound dying would end up impacting customers.

johnpoz

What aliases are using?

Also unbound can restart when you have it set to register dhcp.

Liath.WW

Which type of aliases would you want to know about? I have a few that have FQDN in them, I have some that are IPs and some that are ports. Be happy to share if you think there may be something with them that is causing the issues, however I'm not sure I want to 'lift my skirt' in public so to speak :P

Also, I didn't have the option to register DHCP leases in DNS resolver config, so while I wish it was that simple it's not. Although it does beg to question why such an option would even be available if it causes instability?

romainp

Hi,
I got too some really strange dns issue…
From time to time, the dns resolution does not work at all or take very long time. It could happen several time per hour, all system connected to pfsense are affected. It is not from my ISP or my DSL router on which I am connected to because if I do a nslookup google.com 8.8.8.8, it works perfectly but if I use the internal pfsense dns, it fail.

It happens some weeks ago. At that time I thought that because I did several upgrade of pfsense without a real good clean installation it could be the root cause. So I made backup, install from scratch and restore my config and everything was fine until today. The only thing I change yesterday was to install the trafic total package.
I don't see any obvious reason why I got this issue but I will try to investigate more.

Thanks.
R.

johnpoz

How many entries? FQDN have to be looked up every so often - if you have hundreds of fqdn and they all return lots of IPs then sure could be a contributing factor..

Not sure if it still an issue but register dhcp restarts unbound - so if you have hundreds of dhcp clients and or very short lease times you could have unbound starting every few minutes which would for sure cause a problem with clients actually being able to lookup anything ;)

Also if filterdns is having to lookup 1000's of fqdn every few minutes that have very short ttl's etc.. This also could be a problem depending…

romainp

Hi,
It's just an home setup with max 30 fixed dns entries. I use pfblockerng also but even if I stop it I still have this strange behaviour. I understand that unbound could be restarted when a dhcp client register itself to the dns but it should not take 30 sec to the dns to work again…

The problem is that I don't see obvious reasons in the logs that could explain this...

johnpoz

Is it restarting or not? I have been running unbound on a home setup in resolver mode in pfsense since before it was included and was a package. Have never had any such issues other than the dhcp restart thing.

I really see no point of registering dhcp in a home setup. All my devices I care about have reservations so I know what IP they are and yes the static entries are registered. Devices that are just going to get some random IP out of the pool are going to be guest sort of devices and don't give 2 shits what what their name is or IP is, etc. They are only going to to be on the network temp… If they were always going to be on the network and I wanted to resolve them they would have reservations for an IP, etc.

Grimson

@romainp:

I use pfblockerng also but even if I stop it I still have this strange behaviour. I understand that unbound could be restarted when a dhcp client register itself to the dns but it should not take 30 sec to the dns to work again…

Are you using the TLD feature of pfBlockerNG? If yes, did you read the infoblock? Especially this:

The 'Unbound Resolver Reloads' can take several seconds or more to complete and may temporarily interrupt DNS Resolution until the Resolver has been fully Reloaded with the updated Domain changes.

Liath.WW

Myself, I have 3 aliases with domains in them.
The biggest one is the eve online one, the other two point to voice servers and only resolve to one place.

Also, since switching off unbound and using the forwarder only, I've not had a single peep with browsing issues, and my family is off my butt.

This further points to unbound being part of the problem. Not sure how or why, but if unbound is the only thing that fails then that kinda points to unbound being at fault either itself, or by failing due to some other process and its inability to not choke on it.

However, I would like to use unbound dns as dnssec is something that i believe in, and my clients would require. If only we could get to the bottom of the issue, and put me in a place of confidence in the product again, I'd start pitching it. Heck I have one client that lately requests daily changes to rules that consume time by requiring a login on each sonicwall individually over 18 sites… with differing firmware to make life more interesting. If I could run all of the sites with small appliances running pfsense, it would cut down at least 12 hours a week of unproductive time.

romainp

Thanks for the infos.

Because I use PfblockerNG and need unbound but event if I stop it I still have the issue. I will try to set the debug level higher and have the stats and logs managed by telegraf (I saw a plugin for unbound but not sure if it can work) or use collected (I see an article on how to use collectd on pfsense).
If I can output those logs and the stats to an ELK stack I can at least see a pattern because I do not see any error messages in the logs…

R.

Liath.WW

If you can come up with why its crashing on your end I'd love to hear about it. I wonder if it is something hardware related, or some obscure setting that we've used.

I just can't figure it out.

romainp

Hi,
I do not have a proof of it but it seems related to the fact that when some dhcp client request a new IP, the dhcp server send a signal to the dns server (which is correct since I ask the dns resolver to accept that, somewhere in the config), but when the sighup occurs, the dns do not proceed any request for 20-30 secs.

I will try to have some logs/detail info about that but I am pretty sure of this.

R.