Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Unbound DNS intermittent failure

    DHCP and DNS
    7
    21
    4.0k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • L
      Liath.WW
      last edited by

      It would seem that DNS is failing intermittently, and it has really started to impact my day to day operation.

      I'm using an old 2nd gen I5, board is fine but the built-in NIC only runs at 400MBps before bottlenecking, added an intel d33682 2-port NIC, Intel logo and all that because I know there were some Chinese cheapo clones with crap capacitors and such.

      The machine has had this setup for well, since 2nd get i5's were new. Haven't had much issue with pfSense until this latest build, with the new interface and loss of rrd graphs.  DNS since that upgrade has been a bit of an issue.  Lately its so bad I'm pulling aggro from my family because 'the internet is broke'.

      Only real hints I can think of are that I have an AT&T modem with IP-passthrough turned on, modem has all filtering off.
      The logs will occasionally spam llinfo arp resolution issues with the modems IP even though the link is up and passing traffic.
      I also see in logs>system>DNS resolver that every 5 minutes like clockwork, it is evaluating and dropping some aliases:

      
      .....lots of similar entries like
      Feb 2 13:01:08	filterdns		adding entry 54.239.172.202 to pf table Eve for host launcher.eveonline.com
      Feb 2 12:56:08	filterdns		IP address 52.84.128.4 already present on table Eve as address of hostname launcher.eveonline.com
      ...lots more of the same
      Feb 2 12:56:08	filterdns		adding entry 52.84.27.39 to pf table Eve for host resources.eveonline.com
      Feb 2 12:51:08	filterdns		clearing entry 52.84.133.166 from pf table Eve on host binaries.eveonline.com
      ....
      Feb 2 12:51:08	filterdns		adding entry 52.84.128.4 to pf table Eve for host launcher.eveonline.com
      Feb 2 12:46:09	filterdns		clearing entry 54.192.7.112 from pf table Eve on host resources.eveonline.com
      
      

      In StatusSystem LogsDHCP

      
      Feb 2 13:15:33	dhclient		Creating resolv.conf
      Feb 2 13:15:33	dhclient		RENEW
      Feb 2 13:10:33	dhclient		Creating resolv.conf
      Feb 2 13:10:33	dhclient		RENEW
      Feb 2 13:05:33	dhclient		Creating resolv.conf
      Feb 2 13:05:33	dhclient		RENEW
      Feb 2 13:00:33	dhclient		Creating resolv.conf
      Feb 2 13:00:33	dhclient		RENEW
      Feb 2 12:55:33	dhclient		Creating resolv.conf
      Feb 2 12:55:33	dhclient		RENEW
      
      

      System/Gateways

      
      Feb 2 02:59:51	dpinger		WAN_DHCP6 2001:4860:4860::8888: Clear latency 10158us stddev 1982us loss 16%
      Feb 2 02:59:34	dpinger		WAN_DHCP6 2001:4860:4860::8888: Alarm latency 9857us stddev 1487us loss 21%
      
      

      I was up late last night trying to figure this out while family was asleep. In my tiredness I cleared logs for a fresh view since I was testing new cables, re-tipped even the factory tipped ones, etc. etc.  Wishing now I'd not done so.

      When the DNS is on the fritz, connections that were already made continue passing traffic as normal. Streams keep streaming, SIP calls keep working, etc.  That rules out the connection dropping as the issue.  Only DNS seems to fail, so new connections can't be made.

      Any clue what's going on and how to fix it?

      Info that might be useful:
      packages:
      bandwidthd
      darkstat
      iperf
      mtr-nox11
      openvpn-client-export
      Service_Watchdog      << Added to try and resolve dns issues, thought maybe the service was dying? Possibly related to 5-minute interval with filterdns? I believe i added because before I did unbound just died and stayed dead.

      1 Reply Last reply Reply Quote 0
      • T
        toluun
        last edited by

        Does a restart of unbound solve the issue? I have been having major issues with DNSSEC on unbound causing DNS failures. Same thing would happen to me, streams would continue, WAN gateways were shown as still open, etc…  Only new new DNS lookups would fail.  Once I restarted Unbound everything would go back to normal for a short period of time, then BOOM DNS failures. I am still trying to solve my issue (see a couple posts down) but I did find that disabling DNSSEC stopped the DNS failures.  Not sure if this helps, but your problems seemed very similar to mine so I thought I would comment with my temporary fix.

        1 Reply Last reply Reply Quote 0
        • L
          Liath.WW
          last edited by

          I am going to say "yes" to this one. I'd been having issues with it dying before, and installed the watchdog package to automatically restart it.

          From last night until shortly before I made this thread, the internet was generally unbrowsable due to constant DNS issues.  I reboot the pfSense box a few hours ago, and have had no more issues since, however this is a repeat issue that seems to get worse until I get tired of it and reboot the entire network.

          It really concerns me because I have business clients who I really want to migrate from SonicWall to pfSense, but if I replace them and DNS is going to act like this in a business production environment, I'll be looking for new clients.

          1 Reply Last reply Reply Quote 0
          • T
            toluun
            last edited by

            Well at least yours sound a lot more uncommon then mine.  My DNS would go down every 10 - 30 min. Do you have DNSSEC enabled on unbound?

            1 Reply Last reply Reply Quote 0
            • GertjanG
              Gertjan
              last edited by

              @Liath.WW : filterdns : Take a look at "binaries.eveonline.com" :

              [code]root@ns311465:~# host binaries.eveonline.com
              binaries.eveonline.com is an alias for d17ueqc3zm9j8o.cloudfront.net.
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.137
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.7
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.11
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.177
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.52
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.156
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.186
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.181
              [/code]
              

              A couple of seconds later, the list changes ! :

              root@ns311465:~# host binaries.eveonline.com
              binaries.eveonline.com is an alias for d17ueqc3zm9j8o.cloudfront.net.
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.11
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.137
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.52
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.181
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.7
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.177
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.156
              d17ueqc3zm9j8o.cloudfront.net has address 13.32.153.186
              

              so it's normal that filterdns is very busy every 5 minutes with removing IP's, and adding new ones.
              filterdns is payed to do so.

              UP to you to remove "inaries.eveonline.com" from your alias list, or complain against them ;)

              DNS : You are using the DHCP client to obtain a new WAN IP ? Somethings goes very wrong with that. When I see it recreates "resolv.conf" I wouldn't be surprised that your local DNS server (unbound) is restarting. Every 5 minutes. Yep, you're right, consider your DNS in very bad state. But this is not his fault.

              Find out why your DHCP clients (is forced ?!) to renew evey 5 minutes - like when filterdns is running … Strange, it's time to describe your setup completely.

              Btw : unbound resolves up against the root DNS servers, and is ROCK solid as a DNS server.
              Your issues is not DNSSEC related. DNSSEC activated for unbound works for thousands if not tens of thousands of pfSense installs, and all other servers that use unbound.

              No "help me" PM's please. Use the forum, the community will thank you.
              Edit : and where are the logs ??

              1 Reply Last reply Reply Quote 0
              • L
                Liath.WW
                last edited by

                FilterDNS runs after Unbound kicks the bucket and restarts.

                
                Feb 2 16:30:10	filterdns		adding entry 54.239.172.212 to pf table Eve for host binaries.eveonline.com
                Feb 2 16:26:51	unbound	53607:0	info: start of service (unbound 1.6.6).
                ....
                Feb 2 16:26:47	unbound	88185:0	info: service stopped (unbound 1.6.6).
                Feb 2 16:25:09	filterdns		clearing entry 52.84.133.127 from pf table Eve on host binaries.eveonline.com
                ...
                Feb 2 16:25:09	filterdns		adding entry 54.192.7.236 to pf table Eve for host launcher.eveonline.com
                Feb 2 16:22:46	unbound	88185:0	info: start of service (unbound 1.6.6).
                Feb 2 16:22:46	unbound	88185:0	notice: init module 1: iterator
                Feb 2 16:22:46	unbound	88185:0	notice: init module 0: validator
                ...
                Feb 2 16:22:32	unbound	67005:0	info: server stats for thread 0: 139 queries, 61 answers from cache, 78 recursions, 0 prefetch, 0 rejected by ip ratelimiting
                Feb 2 16:22:32	unbound	67005:0	info: service stopped (unbound 1.6.6).
                Feb 2 16:21:43	unbound	67005:0	info: start of service (unbound 1.6.6).
                Feb 2 16:21:43	unbound	67005:0	notice: init module 0: iterator
                ...
                Feb 2 16:21:40	unbound	59480:0	info: server stats for thread 0: 566 queries, 183 answers from cache, 383 recursions, 6 prefetch, 0 rejected by ip ratelimiting
                Feb 2 16:21:40	unbound	59480:0	info: service stopped (unbound 1.6.6).
                Feb 2 16:20:12	filterdns		adding entry 52.84.133.127 to pf table Eve for host binaries.eveonline.com
                ...
                Feb 2 16:17:25	unbound	59480:0	info: start of service (unbound 1.6.6).
                Feb 2 16:17:25	unbound	59480:0	notice: init module 1: iterator
                Feb 2 16:17:25	unbound	59480:0	notice: init module 0: validator
                Feb 2 16:17:22	unbound	18317:0	info: 4096.000000 8192.000000 1
                ...
                
                

                If I understand you correctly, there is something happening that is causing unbound to restart.  How can I find the root cause?

                One rabbit hole I fell down was because of the arp llinfo messages, but I don't have an example of right now.  They do point to the IP of my ISP-provided modem - which I cannot get rid of (I'm on fiber, they said the system wont allow me to go straight from the "ONT?" (fiber<>eth bridge) to my router. but I admit I haven't tried to bypass it.)

                The passthrough on the modem is weird. The device first hands out an address in the 192.168.1.x range, then once pass-through is handled it hands out the public facing IP.

                I do see a bunch of this in DHCP log, but I'm not 100% is applicable:

                
                Feb 2 15:44:44	dhclient		Creating resolv.conf
                Feb 2 15:44:44	dhclient		RENEW
                Feb 2 15:39:44	dhclient		Creating resolv.conf
                Feb 2 15:39:44	dhclient		RENEW
                
                
                1 Reply Last reply Reply Quote 0
                • L
                  Liath.WW
                  last edited by

                  I think I may have stumbled upon something in the ISP modem config that could be causing this, though the times are different than the pfSense 5 minute issues.
                  In the IP-passthrough page, there is a Passthrough DHCP Lease. Default value is 10 minutes.  I changed to 1 day, hopefully this is the root cause and will fix things.

                  FYI, the modem is this one:

                  Manufacturer ARRIS
                  Model Number BGW210-700

                  1 Reply Last reply Reply Quote 0
                  • L
                    Liath.WW
                    last edited by

                    Haven't seen much more logs about dns/dhcp dying since I updated the thread last night.
                    Computers seem to be going well enough.  Phones still aren't too happy, though they're phones, no idea if there's something goofy going on with them.

                    1 Reply Last reply Reply Quote 0
                    • L
                      Liath.WW
                      last edited by

                      Forgot to update this because it was late and I was tired.  Had the services die again last night, unbound restarting itself.  Switched to just using dns forwarder and haven't had a peep from anything since.

                      Despite people saying that it isn't unbound DNS, that is the service with the symptom. If there are logs or configs that someone would like to have that might be able to help identify the issue, I'll be happy to provide them. I understand that some other service failure may be causing unbound to die and restart, but thus far all of the information I've seen and read doesn't solve the issue for me, and I've not seen any useful requests that yield results.

                      Unfortunately this means I can't pitch pfSense with dnssec as a selling point.  The rest of it works great, and I've been using pfSense as a whole for years.

                      I might be able to put it in production without unbound, but if I can't get a home setup stable, it makes me wonder if the underlying cause of unbound dying would end up impacting customers.

                      1 Reply Last reply Reply Quote 0
                      • johnpozJ
                        johnpoz LAYER 8 Global Moderator
                        last edited by

                        What aliases are using?

                        Also unbound can restart when you have it set to register dhcp.

                        An intelligent man is sometimes forced to be drunk to spend time with his fools
                        If you get confused: Listen to the Music Play
                        Please don't Chat/PM me for help, unless mod related
                        SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                        1 Reply Last reply Reply Quote 0
                        • L
                          Liath.WW
                          last edited by

                          Which type of aliases would you want to know about? I have a few that have FQDN in them, I have some that are IPs and some that are ports. Be happy to share if you think there may be something with them that is causing the issues, however I'm not sure I want to 'lift my skirt' in public so to speak :P

                          Also, I didn't have the option to register DHCP leases in DNS resolver config, so while I wish it was that simple it's not.  Although it does beg to question why such an option would even be available if it causes instability?

                          1 Reply Last reply Reply Quote 0
                          • R
                            romainp
                            last edited by

                            Hi,
                            I got too some really strange dns issue…
                            From time to time, the dns resolution does not work at all or take very long time. It could happen several time per hour, all system connected to pfsense are affected. It is not from my ISP or my DSL router on which I am connected to because if I do a nslookup google.com 8.8.8.8, it works perfectly but if I use the internal pfsense dns, it fail.

                            It happens some weeks ago. At that time I thought that because I did several upgrade of pfsense without a real good clean installation it could be the root cause. So I made backup, install from scratch and restore my config and everything was fine until today. The only thing I change yesterday was to install the trafic total package.
                            I don't see any obvious reason why I got this issue but I will try to investigate more.

                            Thanks.
                            R.

                            1 Reply Last reply Reply Quote 0
                            • johnpozJ
                              johnpoz LAYER 8 Global Moderator
                              last edited by

                              How many entries?  FQDN have to be looked up every so often - if you have hundreds of fqdn and they all return lots of IPs then sure could be a contributing factor..

                              Not sure if it still an issue but register dhcp restarts unbound - so if you have hundreds of dhcp clients and or very short lease times you could have unbound starting every few minutes which would for sure cause a problem with clients actually being able to lookup anything ;)

                              Also if filterdns is having to lookup 1000's of fqdn every few minutes that have very short ttl's etc.. This also could be a problem depending…

                              An intelligent man is sometimes forced to be drunk to spend time with his fools
                              If you get confused: Listen to the Music Play
                              Please don't Chat/PM me for help, unless mod related
                              SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                              1 Reply Last reply Reply Quote 0
                              • R
                                romainp
                                last edited by

                                Hi,
                                It's just an home setup with max 30 fixed dns entries. I use pfblockerng also but even if I stop it I still have this strange behaviour. I understand that unbound could be restarted when a dhcp client register itself to the dns but it should not take 30 sec to the dns to work again…

                                The problem is that I don't see obvious reasons in the logs that could explain this...

                                1 Reply Last reply Reply Quote 0
                                • johnpozJ
                                  johnpoz LAYER 8 Global Moderator
                                  last edited by

                                  Is it restarting or not?  I have been running unbound on a home setup in resolver mode in pfsense since before it was included and was a package.  Have never had any such issues other than the dhcp restart thing.

                                  I really see no point of registering dhcp in a home setup.  All my devices I care about have reservations so I know what IP they are and yes the static entries are registered.  Devices that are just going to get some random IP out of the pool are going to be guest sort of devices and don't give 2 shits what what their name is or IP is, etc.  They are only going to to be on the network temp… If they were always going to be on the network and I wanted to resolve them they would have reservations for an IP, etc.

                                  An intelligent man is sometimes forced to be drunk to spend time with his fools
                                  If you get confused: Listen to the Music Play
                                  Please don't Chat/PM me for help, unless mod related
                                  SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                                  1 Reply Last reply Reply Quote 0
                                  • GrimsonG
                                    Grimson Banned
                                    last edited by

                                    @romainp:

                                    I use pfblockerng also but even if I stop it I still have this strange behaviour. I understand that unbound could be restarted when a dhcp client register itself to the dns but it should not take 30 sec to the dns to work again…

                                    Are you using the TLD feature of pfBlockerNG? If yes, did you read the infoblock? Especially this:

                                    The 'Unbound Resolver Reloads' can take several seconds or more to complete and may temporarily interrupt DNS Resolution until the Resolver has been fully Reloaded with the updated Domain changes.

                                    1 Reply Last reply Reply Quote 0
                                    • L
                                      Liath.WW
                                      last edited by

                                      Myself, I have 3 aliases with domains in them.
                                      The biggest one is the eve online one, the other two point to voice servers and only resolve to one place.

                                      Also, since switching off unbound and using the forwarder only, I've not had a single peep with browsing issues, and my family is off my butt.

                                      This further points to unbound being part of the problem. Not sure how or why, but if unbound is the only thing that fails then that kinda points to unbound being at fault either itself, or by failing due to some other process and its inability to not choke on it.

                                      However, I would like to use unbound dns as dnssec is something that i believe in, and my clients would require.  If only we could get to the bottom of the issue, and put me in a place of confidence in the product again, I'd start pitching it.  Heck I have one client that lately requests daily changes to rules that consume time by requiring a login on each sonicwall individually over 18 sites… with differing firmware to make life more interesting.  If I could run all of the sites with small appliances running pfsense, it would cut down at least 12 hours a week of unproductive time.

                                      1 Reply Last reply Reply Quote 0
                                      • R
                                        romainp
                                        last edited by

                                        Thanks for the infos.

                                        Because I use PfblockerNG and need unbound but event if I stop it I still have the issue. I will try to set the debug level higher and have the stats and logs managed by telegraf (I saw a plugin for unbound but not sure if it can work) or use collected (I see an article on how to use collectd on pfsense).
                                        If I can output those logs and the stats to an ELK stack I can at least see a pattern because I do not see any error messages in the logs…

                                        R.

                                        1 Reply Last reply Reply Quote 0
                                        • L
                                          Liath.WW
                                          last edited by

                                          If you can come up with why its crashing on your end I'd love to hear about it.  I wonder if it is something hardware related, or some obscure setting that we've used.

                                          I just can't figure it out.

                                          1 Reply Last reply Reply Quote 0
                                          • R
                                            romainp
                                            last edited by

                                            Hi,
                                            I do not have a proof of it but it seems related to the fact that when some dhcp client request a new IP, the dhcp server send a signal to the dns server (which is correct since I ask the dns resolver to accept that, somewhere in the config), but when the sighup occurs, the dns do not proceed any request for 20-30 secs.

                                            I will try to have some logs/detail info about that but I am pretty sure of this.

                                            R.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.