DNS issues after upgrading from 2.4.3 to 2.4.4



  • Hi,
    our firewall is practically unusable since upgrading to pfsense 2.4.4.
    DNS requests are answered with a massive delay (10'000+ ms) or timeouts.
    We have four WAN connections, and this happens only on the Default Gateway.

    Choosing a different interface as default gateway will only relocate the problem to the new default gateway.

    What causes headache additionally is that pfsense seems to wait until all configured DNS servers have answered. So no matter if the DNS servers on the remaining WAN lines have answered quickly, a client will not get a reply before the one on the default gateway answered. This is not exactly what we expect from a failover configuration.

    We are using pfblocker and dnsbl, which I suspected to be the root cause. But disabling both of them doesn't bring any improvements.

    Another indication is that RTT/RTTsd times will build up to 2'000ms and more on the default gateway. Same thing here: if I choose a different interface as default gateway, the same problem will start happening there.

    For the time being, we're running a spare pfsense with 2.4.3 which doesn't have those issues. Greatly appreciating any support.

    Edit: I discovered one more thing.
    We have three internal networks, whereof only one is configured to use pfsense as a DNS server. As failing DNS resolution kept clients in that network from accessing the internet, I changed the DHCP options, so that clients would directly use a public DNS server. Mysteriously, this solved the two issues described above (DNS delay and high RTT time).
    We still need to fix this, as whe have some DNS overrides configured, but it seemed worth mentioning it...


  • Netgate Administrator

    What DNS settings are you using? Resolver in resolving mode? Sounds like you might be using forwarding mode if it;s checking multiple configured servers.
    What DNS settings are you using in System > General?

    Steve



  • Do you by any chance have the "DHCP Registration" option enabled under Services => DNS Resolver? That can cause Unbound to restart almost everytime a new DHCP request is sent out. Ask me how I know.



  • Thanks for your replies!
    Finally, I was too impatient, tried a couple of things and grudgingly uninstalled pfblocker. Pfsense seems to run smoothly by now (even acting as an internal DNS server again), although I'm still unaware about the root cause.

    Yes, DNS resolver is enabled, DNS forwarder disabled. Listening and outgoing interfaces in resolver are both set to "all".
    "DNS Query Forwarding" is off (which doesn't exactly make sense to me, now as I'm looking at it).
    "DHCP Registration" and "Static DHCP" is enabled; I remember noticing the frequent service restarts and could disable this.
    Otherwise a couple of host overrides and one domain override that has been working smoothly since long.

    In General Setup, we have four external DNS servers configured, one per WAN connection. Putting alternative DNS servers here didn't bring any improvement. "DNS Server Override" and "Disable DNS Forwarder" are unchecked here.



  • After running smoothly for a day, the problem just reappeared.
    High latency on WAN1 (default gateway), although nearly no traffic visible. One of the four external DNS servers stated as "no response" in Diagnostics>DNS lookup, and in effect no internet connection for anyone.


  • Netgate Administrator

    Is it only the gateway IP that shows high latency? If you change the monitoring IP to something else for that WAN does it still show that issue?

    Steve



  • Disable DHCP registration. I'm pretty sure that's going solve the issue. That's what's causing resolver to restart. Of course Unbound can't resolve any DNS queries while the service is restarting. That's why sometimes resolution is very delayed or fails completely. I had this exact same issue. Disabling DHCP registartion fixed it. There are many threads on this issue.
    https://forum.netgate.com/topic/120838/unbound-appears-to-restart-frequently-and-fails-to-resolve-domains-sometimes/9

    https://forum.netgate.com/topic/80517/unbound-seems-to-be-restarting-frequently



  • The problem reoccurs roughly once a day for a couple of minutes. Then pfsense will work smoothly again without further intervention.

    Right now, it turned up again.
    @stephenw10 I tried pinging WAN1's monitoring IP via a different WAN line - RTT is very acceptable there. I tried putting a different monitoring IP, but the problem persists.
    @Raffi_ I disabled DHCP registration right now. No improvement up to here, but I'll keep an eye on it.

    Oh, and there was one misinformation up there: on the machine we're currently using, pfblocker is still installed, but disabled. Not sure if this is of interest.



  • It hasn't happened again in the last three days. So was it really just disabling DHCP registration?
    Well, thanks a bunch for the time being!



  • I hope that was it. Keep an eye on it and let us know if not.


  • Netgate Administrator

    Hmm, curious. What's odd is that most people who have that enabled don't hit it. I've never seen any problems with it in testing. It could be simply a scaling issue; it works fine with a few test clients but gets overwhelmed on a large network. But we have many large customer networks not hitting it either.
    When you look into what that is doing it's not hard to see why it causes some disruption . It's almost harder to see how it does not! Yet in the majority of cases it doesn't.

    Thanks for the update anyway.

    Steve



  • @stephenw10 that's interesting. In my scenario it's a relatively small network of about 35 clients, about half of which have DHCP static mappings. I could not get reliable behavior from Unbound without disabling DHCP registration. Although it would be nice to have that feature enabled, having DNS queries fail is not an option.


  • Rebel Alliance Developer Netgate

    It mostly depends on how quickly unbound restarts. For most people it's very fast, but the more you have in it (say, DNSBL lists), the slower it gets. Also depends on hardware speed.

    Another case for offloading DNS to a proper DNS server in some way. Either offload DHCP registration to a proper BIND setup that can handle dynamic registration from dhcpd, or offload the filtering aspects to something like PiHole.

    I suppose you could also do something convoluted like setup the DNS Forwarder on an alternate port, with DHCP registration enabled there, and then setup a domain override in the DNS Resolver to send queries for hosts it doesn't know in the domain there. That seems like a bad idea, though. :-)



  • @jimp that would explain it. In my case it's a nearly 10 year old desktop and even when it was new it wasn't top of the line. First gen i5 with 8 GB RAM and 120 GB SSD. I do use DNSBL with over 100k IP's/URLs. The idea of Bind in conjunction with Unbound is interesting.


  • Netgate Administrator

    How many DHCP cleints? What is the lease time?

    I imagine a large DNSBL being added would slow down the Unbound restart time. Did you ever test it with that disabled?

    Steve



  • @stephenw10 The setup has about 15 DHCP clients with the default 7200 second lease time. I don't remember if I tried enabling DHCP registration before setting up DNSBL. I'm currently running the setup in a production environment, so testing that is unfortunately not something I can do.

    With DHCP registration enabled, it seems that each time a DHCP request is made, Unbound is restarting. With my lease time set to 2 hours, it makes sense that I was having a lot of trouble with Unbound restarting. I assume increasing the lease time to a day would dramatically reduce the number of times I see the problem in my case.



  • My mistake, almost all 35 or so clients are DHCP. About 16 out of that 35 are also DHCP but not statically mapped. I believe DHCP requests are sent our regardless of static mapping.


  • Rebel Alliance Global Moderator

    If you have it set to register dhcp clients in dns - then yes I believe unbound restarts.. So if you have lots of clients via dhcp and short lease times you prob have a lot of restarts of unbound.

    I do not recall if they ever worked out where unbound doesn't have to restart to add new dhcp client in the dns listing?

    I personally don't see the point of registering dhcp clients.. Static makes sense since if your taking the time to reserve and IP for a client then you prob have need of resolving it via name..

    Why do you need to resolve these dhcp clients by name? If you do why not setup a reservation for them ;)

    Also if your on the same network as the clients and dns does not resolve - windows will broadcast for the name ;)

    7200 second lease time

    That is a really LOW lease time... So every 3600 seconds or so you going to see a request for renewal... Why would you have lease so low... Are these clients that are very transient and you have too many clients for the available IPs? I set all my leases to 4days...

    They should prob change that default - seems pretty freaking low.. 2 hours.. default of 24 hours would prob better choice if you ask me.


  • Netgate Administrator

    It's still not a huge number of clients though. I have to assume Unbound is slow to start on your system.

    Try restarting it manually, check the logs, how long does it actually take?

    Steve



  • @johnpoz I agree on all points. I personally don't have a need to resolve DHCP clients. It would be more of a nice to have thing. That's why I never put any effort into finding an alternative solution to having DHCP registration in Unbound or Bind or elsewhere.

    Yes, the default 7200 second lease time is much to short. I probably should change that :0
    I never realized it was so short until I looked it up to answer Steve. I agree with you that making the default lease time 24 hours sounds more reasonable.

    @stephenw10 Is this it? I had this in my log since Unbound restarts everyday due to DNSBL cron updates.
    Nov 5 00:00:12 unbound 10476:0 notice: Restart of unbound 1.7.3.
    Nov 5 00:00:13 unbound 10476:0 notice: init module 0: validator
    Nov 5 00:00:13 unbound 10476:0 notice: init module 1: iterator
    Nov 5 00:00:13 unbound 10476:0 info: start of service (unbound 1.7.3).

    1 second doesn't seem bad, but I think it could be a combination of short lease times, multiple DHCP requests, Unbound restarting or in the process of restarting and all of that creating the perfect storm for delayed/failed DNS queries.


  • Netgate Administrator

    There should be some time before that also, I expect it the restart process to be between:
    Nov 5 21:41:24 unbound 51810:0 info: service stopped (unbound 1.7.3).
    and
    Nov 5 21:41:25 unbound 51810:0 info: start of service (unbound 1.7.3).
    For example.

    Though that's still within 1s and I have DNSBL enabled. But far less entries than you.

    Steve



  • I have had issues with DNS upgrading to 2.4.4 as well. It was so intense that I would lose internet connectivity every 3-5 mins. Pinging an external IP would work, but pinging anything with a name (www.google.com) wouldn't.

    I also have a VPN client running and initially I thought it might be the VPN causing issues. I was going back and forth with my VPN provider to see what I could do. After a lot of reading I went against their tutorials and stopped forwarding DNS queries and started using DNS resolver. Also did a few other things at the time.

    Overall, I ended up installing pfSense 4 times and setting things up over and over again. Finally, for me it turned out to be the excess blocking that I had enabled in pfBlocker. I had subscribed to too many lists, I guess. I ran pfSense without pfBlocker for a week and had no issues. Finally I enabled pfBlocker again, but only subscribed to 1 EasyList. This is a lot less than what I was subscribed to with 2.4.3 and still had no issues (in 2.4.3).

    I do get a few more ads on my pages than I would like, but at least my wife isn't on my case every 5 mins. :)

    I will keep watching this thread to see if there are other pointers that I can tweak in order to block as many ads, junk sites etc. without losing my mind over dropped DNS requests/unbound restarts.

    EDIT: I just checked and it seems that I do have the
    DHCP Registration, Static DHCP Registration & OpenVPN Clients Registration all checked.

    Can't remember if those were all checked when I was having issues or was this something that I enabled after I got stable network, however.


  • Netgate Administrator

    Any idea how many DNSBL entries?

    I have 20480 here currently and have never seen any issues.

    Steve



  • @stephenw10 Yup, the service stopped entry in my log had the same time stamp as the restart entry so I left it out since it was negligible.



  • @inxsible Disable DHCP registration if you're having issues with unbound restarts. It's a feature you probably don't need anyway, so any minor benefit you get from it is not worth the cost of having unbound restarts triggered.

    Also, below is a great video on getting things going with pfblocker. You don't have to use all lists and recommendations, but this is where I started and I don't have many false positives.
    https://www.youtube.com/watch?v=QwFpMwXEK5w&list=LLKjPM3pDxt_EiYOfJgxsvQQ&t=305s&index=5



  • @raffi_ Thanks. I will check the video out and see what I can tweak. As for the tutorials that I followed regarding pfBlocker setup were these:

    https://www.linuxincluded.com/block-ads-malvertising-on-pfsense-using-pfblockerng-dnsbl-old/

    The first couple of times that I set up pfBlocker, I used all the lists that he mentioned in that blog -- except the TLD blocking. Currently, I only use 1 EasyList.



  • This post is deleted!


  • We've been on our backup machine since my last post and have had zero problems since disabling DHCP registration.
    Confidently, I just wanted to switch back to our production machine - and problems started again. And they didn't stop after a couple of minutes as observed before, but just continued.

    One new observation I can share:
    pfsense opens an unreasonable amount of DNS requests - I will see around 9000 entries in state table size; most of them are port 53 (DNS). A normal number for our environment for this time of day would rather be around 2500.
    The interesting thing is that the other pfsense (which shares the same WAN router in front of pfsense) will show the same symptoms (high RTT time and no DNS), even though it does not receive any requests from the LAN. This leads me to the assumption that either the router in front of pfsense (a Fritz!Box 7362) resigns due to the flood of DNS requests, or the DNS servers themselves throttle because of the massive amount of requests they receive from us.
    After switching back to the backup pfsense, problems are instantly gone.

    Does anyone have an idea why pfsense would want to start such a mass of DNS requests?


  • Netgate Administrator

    Do you have a large number of aliases with FQDNs in them? Those will all be resolved when the ruleset is generated.

    Steve



  • You mean at Firewall>Aliases?
    Only ten IP Aliases. That shouldn't be the problem, I guess.


  • Netgate Administrator

    But do they have a lot of FQDNs in them? They are all resolved when the ruleset is generated which can make a lot of connections in a short time.

    Steve



  • No, only IP addresses, and not more than 20 each.


  • Netgate Administrator

    Hmm, well you could turn up the logging in Unbound to see what is being resolved. You might need to expand the log size if it's a lot of things.

    Steve



  • "Turning up logging" means switching to "Raw display"?

    Well, I did another attempt this morning and it seems to work for the time of being.
    It seems that, during the last attempt, DHCP registration had accidentally been active. I disabled it immediately upon noticing it, but it probably was too late.


  • Netgate Administrator

    You can set Ubounds log level in the advanced tab. If you set it to level 3 or higher you can see the queries made against it so you would see whatever it is resolving initially.

    Steve