DNS issues after upgrading from 2.4.3 to 2.4.4

luas

Hi,
our firewall is practically unusable since upgrading to pfsense 2.4.4.
DNS requests are answered with a massive delay (10'000+ ms) or timeouts.
We have four WAN connections, and this happens only on the Default Gateway.

Choosing a different interface as default gateway will only relocate the problem to the new default gateway.

What causes headache additionally is that pfsense seems to wait until all configured DNS servers have answered. So no matter if the DNS servers on the remaining WAN lines have answered quickly, a client will not get a reply before the one on the default gateway answered. This is not exactly what we expect from a failover configuration.

We are using pfblocker and dnsbl, which I suspected to be the root cause. But disabling both of them doesn't bring any improvements.

Another indication is that RTT/RTTsd times will build up to 2'000ms and more on the default gateway. Same thing here: if I choose a different interface as default gateway, the same problem will start happening there.

For the time being, we're running a spare pfsense with 2.4.3 which doesn't have those issues. Greatly appreciating any support.

Edit: I discovered one more thing.
We have three internal networks, whereof only one is configured to use pfsense as a DNS server. As failing DNS resolution kept clients in that network from accessing the internet, I changed the DHCP options, so that clients would directly use a public DNS server. Mysteriously, this solved the two issues described above (DNS delay and high RTT time).
We still need to fix this, as whe have some DNS overrides configured, but it seemed worth mentioning it...

stephenw10

What DNS settings are you using? Resolver in resolving mode? Sounds like you might be using forwarding mode if it;s checking multiple configured servers.
What DNS settings are you using in System > General?

Steve

Raffi_

Do you by any chance have the "DHCP Registration" option enabled under Services => DNS Resolver? That can cause Unbound to restart almost everytime a new DHCP request is sent out. Ask me how I know.

luas

Thanks for your replies!
Finally, I was too impatient, tried a couple of things and grudgingly uninstalled pfblocker. Pfsense seems to run smoothly by now (even acting as an internal DNS server again), although I'm still unaware about the root cause.

Yes, DNS resolver is enabled, DNS forwarder disabled. Listening and outgoing interfaces in resolver are both set to "all".
"DNS Query Forwarding" is off (which doesn't exactly make sense to me, now as I'm looking at it).
"DHCP Registration" and "Static DHCP" is enabled; I remember noticing the frequent service restarts and could disable this.
Otherwise a couple of host overrides and one domain override that has been working smoothly since long.

In General Setup, we have four external DNS servers configured, one per WAN connection. Putting alternative DNS servers here didn't bring any improvement. "DNS Server Override" and "Disable DNS Forwarder" are unchecked here.

luas

After running smoothly for a day, the problem just reappeared.
High latency on WAN1 (default gateway), although nearly no traffic visible. One of the four external DNS servers stated as "no response" in Diagnostics>DNS lookup, and in effect no internet connection for anyone.

stephenw10

Is it only the gateway IP that shows high latency? If you change the monitoring IP to something else for that WAN does it still show that issue?

Steve

Raffi_

Disable DHCP registration. I'm pretty sure that's going solve the issue. That's what's causing resolver to restart. Of course Unbound can't resolve any DNS queries while the service is restarting. That's why sometimes resolution is very delayed or fails completely. I had this exact same issue. Disabling DHCP registartion fixed it. There are many threads on this issue.
https://forum.netgate.com/topic/120838/unbound-appears-to-restart-frequently-and-fails-to-resolve-domains-sometimes/9

https://forum.netgate.com/topic/80517/unbound-seems-to-be-restarting-frequently

luas

The problem reoccurs roughly once a day for a couple of minutes. Then pfsense will work smoothly again without further intervention.

Right now, it turned up again.
@stephenw10 I tried pinging WAN1's monitoring IP via a different WAN line - RTT is very acceptable there. I tried putting a different monitoring IP, but the problem persists.
@Raffi_ I disabled DHCP registration right now. No improvement up to here, but I'll keep an eye on it.

Oh, and there was one misinformation up there: on the machine we're currently using, pfblocker is still installed, but disabled. Not sure if this is of interest.

luas

It hasn't happened again in the last three days. So was it really just disabling DHCP registration?
Well, thanks a bunch for the time being!

Raffi_

I hope that was it. Keep an eye on it and let us know if not.

stephenw10

Hmm, curious. What's odd is that most people who have that enabled don't hit it. I've never seen any problems with it in testing. It could be simply a scaling issue; it works fine with a few test clients but gets overwhelmed on a large network. But we have many large customer networks not hitting it either.
When you look into what that is doing it's not hard to see why it causes some disruption . It's almost harder to see how it does not! Yet in the majority of cases it doesn't.

Thanks for the update anyway.

Steve

Raffi_

@stephenw10 that's interesting. In my scenario it's a relatively small network of about 35 clients, about half of which have DHCP static mappings. I could not get reliable behavior from Unbound without disabling DHCP registration. Although it would be nice to have that feature enabled, having DNS queries fail is not an option.

jimp

It mostly depends on how quickly unbound restarts. For most people it's very fast, but the more you have in it (say, DNSBL lists), the slower it gets. Also depends on hardware speed.

Another case for offloading DNS to a proper DNS server in some way. Either offload DHCP registration to a proper BIND setup that can handle dynamic registration from dhcpd, or offload the filtering aspects to something like PiHole.

I suppose you could also do something convoluted like setup the DNS Forwarder on an alternate port, with DHCP registration enabled there, and then setup a domain override in the DNS Resolver to send queries for hosts it doesn't know in the domain there. That seems like a bad idea, though. :-)

Raffi_

@jimp that would explain it. In my case it's a nearly 10 year old desktop and even when it was new it wasn't top of the line. First gen i5 with 8 GB RAM and 120 GB SSD. I do use DNSBL with over 100k IP's/URLs. The idea of Bind in conjunction with Unbound is interesting.

stephenw10

How many DHCP cleints? What is the lease time?

I imagine a large DNSBL being added would slow down the Unbound restart time. Did you ever test it with that disabled?

Steve

Raffi_

@stephenw10 The setup has about 15 DHCP clients with the default 7200 second lease time. I don't remember if I tried enabling DHCP registration before setting up DNSBL. I'm currently running the setup in a production environment, so testing that is unfortunately not something I can do.

With DHCP registration enabled, it seems that each time a DHCP request is made, Unbound is restarting. With my lease time set to 2 hours, it makes sense that I was having a lot of trouble with Unbound restarting. I assume increasing the lease time to a day would dramatically reduce the number of times I see the problem in my case.

Raffi_

My mistake, almost all 35 or so clients are DHCP. About 16 out of that 35 are also DHCP but not statically mapped. I believe DHCP requests are sent our regardless of static mapping.

johnpoz

If you have it set to register dhcp clients in dns - then yes I believe unbound restarts.. So if you have lots of clients via dhcp and short lease times you prob have a lot of restarts of unbound.

I do not recall if they ever worked out where unbound doesn't have to restart to add new dhcp client in the dns listing?

I personally don't see the point of registering dhcp clients.. Static makes sense since if your taking the time to reserve and IP for a client then you prob have need of resolving it via name..

Why do you need to resolve these dhcp clients by name? If you do why not setup a reservation for them ;)

Also if your on the same network as the clients and dns does not resolve - windows will broadcast for the name ;)

7200 second lease time

That is a really LOW lease time... So every 3600 seconds or so you going to see a request for renewal... Why would you have lease so low... Are these clients that are very transient and you have too many clients for the available IPs? I set all my leases to 4days...

They should prob change that default - seems pretty freaking low.. 2 hours.. default of 24 hours would prob better choice if you ask me.

stephenw10

It's still not a huge number of clients though. I have to assume Unbound is slow to start on your system.

Try restarting it manually, check the logs, how long does it actually take?

Steve

Raffi_

@johnpoz I agree on all points. I personally don't have a need to resolve DHCP clients. It would be more of a nice to have thing. That's why I never put any effort into finding an alternative solution to having DHCP registration in Unbound or Bind or elsewhere.

Yes, the default 7200 second lease time is much to short. I probably should change that :0
I never realized it was so short until I looked it up to answer Steve. I agree with you that making the default lease time 24 hours sounds more reasonable.

@stephenw10 Is this it? I had this in my log since Unbound restarts everyday due to DNSBL cron updates.
Nov 5 00:00:12 unbound 10476:0 notice: Restart of unbound 1.7.3.
Nov 5 00:00:13 unbound 10476:0 notice: init module 0: validator
Nov 5 00:00:13 unbound 10476:0 notice: init module 1: iterator
Nov 5 00:00:13 unbound 10476:0 info: start of service (unbound 1.7.3).

1 second doesn't seem bad, but I think it could be a combination of short lease times, multiple DHCP requests, Unbound restarting or in the process of restarting and all of that creating the perfect storm for delayed/failed DNS queries.