DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0
-
@daddygo
I do not use it, so I delete it. Let's see help this or not. -
@jabacrack said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
so I delete it.
Go ahead, though good stuff
-
It didn't help.
-
@jabacrack
I have the same problem : upgraded from 2.5.2 to 2.6.0, and unbound randomly stops responding.
I must restart the service to get DNS resolutions back.(I remember I had the same problem when upgrading from 2.4.x to 2.5...)
-
would you look in the logs to see what happens at the same time with the other services when Unbound stops...
(is there a trigger that could cause this?) -
@daddygo
Here is what I have in system log (general) filtered on "unbound"
(restarted the service at 10:11:40)2022-03-12 09:28:31.886087+00:00 php-fpm 10571 /services_unbound.php: Configuration Change: admin@10.21.11.10 (Local Database Fallback): DNS Resolver configured. 2022-03-12 08:58:51.561997+00:00 php 337 rc.bootup: sync unbound done. 2022-03-12 08:58:50.780246+00:00 php 337 rc.bootup: Unbound /var/unbound/root.key file is corrupt, removing and recreating. Mar 12 08:56:59 php-fpm 367 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1647075419] unbound[82383:0] error: can't bind socket: Can't assign requested address for 10.21.1.254 port 53 [1647075419] unbound[82383:0] fatal error: could not open ports' Mar 12 08:56:22 php-fpm 367 /rc.newwanip: Unbound /var/unbound/root.key file is corrupt, removing and recreating. Mar 12 08:56:08 php-fpm 367 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1647075368] unbound[49925:0] error: can't bind socket: Can't assign requested address for 10.21.1.254 port 53 [1647075368] unbound[49925:0] fatal error: could not open ports' Mar 12 08:56:03 php-fpm 9076 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1647075363] unbound[72574:0] error: can't bind socket: Can't assign requested address for 10.21.1.254 port 53 [1647075363] unbound[72574:0] fatal error: could not open ports' Mar 12 08:55:31 php-fpm 367 /rc.newwanip: Unbound /var/unbound/root.key file is corrupt, removing and recreating. Mar 12 08:55:27 php-fpm 9076 /rc.newwanip: Unbound /var/unbound/root.key file is corrupt, removing and recreating. Mar 12 08:50:29 php-fpm 9076 /rc.linkup: Unbound /var/unbound/root.key file is corrupt, removing and recreating. Mar 12 08:49:49 php-fpm 367 /rc.linkup: Unbound /var/unbound/root.key file is corrupt, removing and recreating. Mar 12 08:48:45 php 428 rc.bootup: sync unbound done. Mar 12 08:47:51 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:50 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:49 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:48 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:47 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:46 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:44 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:43 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:42 php 428 rc.bootup: Unbound start waiting on dhcp6c. Mar 12 08:47:41 php 428 rc.bootup: Unbound start waiting on dhcp6c.
and DNS Resolver log :
2022-03-12 10:11:41.165877+01:00 unbound 89573 [89573:0] info: generate keytag query _ta-4f66. NULL IN 2022-03-12 10:11:40.290016+01:00 unbound 89573 [89573:0] info: start of service (unbound 1.13.2). 2022-03-12 10:11:40.276727+01:00 unbound 89573 [89573:0] notice: init module 1: iterator 2022-03-12 10:11:40.276345+01:00 unbound 89573 [89573:0] notice: init module 0: validator 2022-03-12 09:58:58.856897+01:00 unbound 58559 [58559:2] info: generate keytag query _ta-4f66. NULL IN 2022-03-12 09:58:51.577322+01:00 unbound 58559 [58559:0] info: start of service (unbound 1.13.2). 2022-03-12 09:58:51.562657+01:00 unbound 58559 [58559:0] notice: init module 1: iterator 2022-03-12 09:58:51.562307+01:00 unbound 58559 [58559:0] notice: init module 0: validator Mar 12 08:55:25 unbound 55268 [55268:0] info: 64.000000 128.000000 153
I switched to BIND package, to try.
-
@jabacrack please install service watch guard on pfsense package and add dns resolver service in watchguard .
Thanks
-
@daddygo
Some new information:- After I delete pfBlockerNG dns resolver fail on next day, I restart it
- It do not fail on next day and I think "I guess I had to restart it after removing pfBlocker". But it fail on third day.
- I restart router and it work one week, until today.
I don't find anything interesting in logs.
Latest DNS resolver entries. Looks like it restart but cannot start correctly, but maybe I'm wrong.
Mar 20 13:43:18 unbound 29389 [29389:0] info: [25%]=0.125611 median[50%]=0.32768 [75%]=0.643216
Mar 20 13:43:18 unbound 29389 [29389:0] info: lower(secs) upper(secs) recursions
Mar 20 13:43:18 unbound 29389 [29389:0] info: 0.008192 0.016384 1
Mar 20 13:43:18 unbound 29389 [29389:0] info: 0.032768 0.065536 1
Mar 20 13:43:18 unbound 29389 [29389:0] info: 0.065536 0.131072 6
Mar 20 13:43:18 unbound 29389 [29389:0] info: 0.131072 0.262144 5
Mar 20 13:43:18 unbound 29389 [29389:0] info: 0.262144 0.524288 8
Mar 20 13:43:18 unbound 29389 [29389:0] info: 0.524288 1.000000 6
Mar 20 13:43:18 unbound 29389 [29389:0] info: 1.000000 2.000000 2
Mar 20 13:43:18 unbound 29389 [29389:0] info: 2.000000 4.000000 1
Mar 20 13:43:18 unbound 29389 [29389:0] notice: Restart of unbound 1.13.2.
Mar 20 13:43:18 unbound 29389 [29389:0] notice: init module 0: validator
Mar 20 13:43:18 unbound 29389 [29389:0] notice: init module 1: iterator
Mar 20 13:43:18 unbound 29389 [29389:0] info: start of service (unbound 1.13.2).
Mar 20 13:43:20 unbound 29389 [29389:3] info: generate keytag query _ta-4f66. NULL IN
Mar 20 13:43:20 unbound 29389 [29389:2] info: generate keytag query _ta-4f66. NULL INDHCP Log when DNS failed
Mar 20 13:51:49 dhcpd 36656 DHCPREQUEST for 192.168.0.139 from da:2e:07:e3:c7:6d (M2004J19C-Redmi9) via igb1
Mar 20 13:51:49 dhcpd 36656 Wrote 0 class decls to leases file.
Mar 20 13:51:49 dhcpd 36656 Wrote 0 deleted host decls to leases file.
Mar 20 13:51:49 dhcpd 36656 Wrote 0 new dynamic host decls to leases file.
Mar 20 13:51:49 dhcpd 36656 Wrote 37 leases to leases file.
Mar 20 13:51:49 dhcpd 36656 DHCPACK on 192.168.0.139 to da:2e:07:e3:c7:6d (M2004J19C-Redmi9) via igb1
Mar 20 13:51:49 dhcpleases 78677 Sending HUP signal to dns daemon(29389)
Mar 20 13:51:49 dhcpleases 78677 Could not deliver signal HUP to process 29389: No such process.@ahsunh
thank you, that will help to avoid this problem.
But does not cancel the fact that pfsense or unbound has error, that was not in the previous version of pfsense. And I think better to fix it than avoid it. -
@jabacrack said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
And I think better to fix it than avoid it.
Exactly, but unfortunately there seem to be quite a few people here who don't think that way.
-
This :
@jabacrack said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
Mar 20 13:43:18 unbound 29389 [29389:0] notice: Restart of unbound 1.13.2.
is the issue.
unbound was told to restart.
It was running before, using process ID n° 29389.
It stops, and at the very end, it restarts itself. New processes will have a different process ID, this is valid for all Microsoft, macOS, Linux and FreeBSD systems.
Your logs keep showing process ID 29389.
This tells me that you should check if you haven't a second (or more ?) zombie unboud process running on your system.Open a console or SSH shell, and when you see the menu, use option 8.
Then :
[2.6.0-RELEASE][admin@pfsense.local.net]/root: ps ax | grep unbound31176 - Ss 8:58.04 /usr/local/sbin/unbound -c /var/unbound/unbound.conf 32713 - S 0:03.13 /usr/local/sbin/lighttpd_pfb -f /var/unbound/pfb_dnsbl_lighty.conf 97333 0 S+ 0:00.00 grep unbound
Line 2 and 3 are not relevant, but line 1 show that I have one unbound process.
Note that line one shows the process ID - 32713 in mt case.Check unbound the process ID file :
cat /var/run/unbound.pid
It should show the same ID : 32713.
If you find more then one unbound instance, kill them all.
Use the kill command followed with process ID number, likekill 31176 kill xxxxxx etc
and wipe the process ID file, if it isn't already gone :
rm /var/run/unbound.pid
Now you can start unbound in the GUI.
@jabacrack said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
Mar 20 13:51:49 dhcpd 36656 DHCPACK on 192.168.0.139 to da:2e:07:e3:c7:6d (M2004J19C-Redmi9) via igb1
Mar 20 13:51:49 dhcpleases 78677 Sending HUP signal to dns daemon(29389)
Mar 20 13:51:49 dhcpleases 78677 Could not deliver signal HUP to process 29389: No such process.This shows that the process dhcpleases wants to to send a message "HUP' ( it's a restart message) to the process 28389.
dhcpleases wants to restart unbound and it knows what process ID it has : it reads /var/run/unbound.pid to find out.
Guess what : process "29389" might be what's left of unbound, but it isn't responding any more ( == zombie).For every DHCPREQUEST / DHCPRENEW / DHCPACK event, dhcpleases will restart unbound. You have a lot of devices ? unbound will get restarted a lot ! Thus the change a zombie gets created grows a lot.
This is a very known situation.To stop that issue right away :Goto the resolver setting page and un check this option :
Save and apply.
if you need some of your devices to be known with their host name and IP addresses, make a couple of static MAC DHCP leases on the Services > DHCP Server > LAN page.
Btw : adding pfBlockerng-devel will restart unbound every day, only if feeds got updated.
When pfBlockerng-devel is sued , the overall stop start time of unbound gets longer, from a second or two (default pfSense system with no packages) to several tens of seconds or even minutes. For those who use a lot of feeds : it can takes minutes.
Side effect : during the restart time, DNS doesn't work
Now, throw in some devices on your LAN that trigger DHCP events a bit more then 'frequent' and your set for the situation that makes you think that "DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0".I'm using pfBlockerng-devel (latest version) with some feeds (about 5 ?).
No DHCP lease related unbound restarts as I de activated "DHCP Registration".Still, my unbound restarts often. See here for yourself.
That's me, fiddling with the system ;)
-
@gertjan
thank you for your help. Looks like possible reason. I check it now and I haven't any zombie, I'll wait for next fail to check your theory.@gertjan said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
To stop that issue right away :Goto the resolver setting page and un check this option :
How I previously say, solutions like this help to avoid problem, but do not fix it. Yes, if unbound will restart not so frequently it will fail with lower probability. But if it fail one time in half of year versus one time in week does that mean that problem fixed? I think - no. It still can fail. And on 2.5.2 it do not fail with same amount srestarts at day.
@gertjan said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
your set for the situation that makes you think that "DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0".
I think this is irrelevant to me:
- I delete pfBlockerng and unbound fail anyway
- When I visit DNS resolver status page it say that it is not running. And it was in this state ~ 3 hours before I start it again last two times.
-
@gertjan
Today morning it fail again.No zombies while unbound stoped:
ps ax | grep unbound
44241 - S 0:00.00 sh -c ps ax | grep unbound 2>&1 44464 - S 0:00.00 grep unbound 64935 - Is 0:01.14 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d lan -p /var/run/unbound.pid -u /var/unbound/dhcpleases_entries.conf -h /etc/hosts
cat /var/run/unbound.pid
91903
Logs:
Mar 22 09:28:56 unbound 91903 [91903:0] info: 4.000000 8.000000 36 Mar 22 09:28:56 unbound 91903 [91903:0] info: 8.000000 16.000000 14 Mar 22 09:28:56 unbound 91903 [91903:0] notice: Restart of unbound 1.13.2. Mar 22 09:28:56 unbound 91903 [91903:0] notice: init module 0: validator Mar 22 09:28:56 unbound 91903 [91903:0] notice: init module 1: iterator Mar 22 09:28:56 unbound 91903 [91903:0] info: start of service (unbound 1.13.2). Mar 22 09:29:06 unbound 91903 [91903:3] info: generate keytag query _ta-4f66. NULL IN Mar 22 09:49:53 dhcpleases 64935 Sending HUP signal to dns daemon(91903) Mar 22 09:49:53 dhcpleases 64935 Could not deliver signal HUP to process 91903: No such process. Mar 22 09:49:53 dhcpleases 64935 Sending HUP signal to dns daemon(91903) Mar 22 09:49:53 dhcpleases 64935 Could not deliver signal HUP to process 91903: No such process
Looks like this is something else. Or I'm wrong?
-
@jabacrack said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
Looks like this is something else. Or I'm wrong?
Your "ps ax | grep unbound" lists some stuff, like the command itself (normal) and the dhcpleases process (the one that is paid to shoot unbound in the head as much as possible).
Look again.
I was asking you to look for multiple instances of unbound.
You found the contrary.
There was - at that moment, or a couple of seconds later, a unbound pid file. This file is created by the process (unbound) when it starts - or to be more precise : when it finished starting).
Or, the ps command did not list a unbound process.That's what 'dhcpleases' is telling you : unbound is so slow to start up, that it signals (logs) you that the known unbound pid file "/var/run/unbound.pid" contains a process number that is not in the process list any more.
Or, there are so many DHCP events on you LAN that 'dhcpleases' is chain gunning the unbound process. It tries to sand more HUPs then that unbound can handle.
The result is what you experience : DNS is more down then up.Btw : you could try to make the default lease time of the DHCP lease longer. This way, less DHCP renegotiations will happen. But this will out only for cabled connections. Wifi connections come and go, and on every wifi reconnect a DHCP event will start.
When your wifi device is on the edge of the zone wifi zone covered by the AP, , you trigger a lot of DHCP events. You wind up 'killing' unbound.Check your logs, how many "Sending HUP signal to dns daemon" from the "dhcpleases" did you find ?
Check the unbound (resolver) log file, and look up how long it took for unbound to start. When it's done, the process should be present (the ps command), a pid file should exists, and very important : the number in the file should be the process id of the unbound process found with 'ps'.
Example :Feb 16 15:07:08 unbound 7464 [7464:0] info: service stopped (unbound 1.13.2).
2 seconds later :
Feb 16 15:07:10 unbound 59164 [59164:0] info: start of service (unbound 1.13.2).
( I do have pfBlockerng-devel using the Python mode - a couple of DNSBL feeds)
Final test :
dig @127.0.0.1 a.b.c +short
where a.b.c is a host name that shouldn't be in the DNS cache file (a file you - you and all LAN devices - didn't look up, just find some random one, as millions exist). This 'dig' command forces unbound to do resolve the a.b.c. hostname. If you get an IP back, you know your DNS is ok.
edit :
Or the very informative :dig @127.0.0.1 a.b.c +trace
@jabacrack said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
How I previously say, solutions like this help to avoid problem, but do not fix it
Sure thing.
Still, a good work around (you can do it a coup;e of seconds) is far better then waiting for a solution ( going for a couple of years now ).
The work around can even be made better, close to a perfect solution, when you add all your known LAN devices as a DHCP MAC leases list.unbound is a process that starts during system start.
Like a nginx or apache2 web server, or the big brother 'bind' or postfix, world's most used mail server, these are actually never restarted.
Only during setup or testing the admin restarts them, for testing purposes.
unbound is the same. It's not meant to be restarted like this. -
@gertjan said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
I was asking you to look for multiple instances of unbound.
If it fail, there is no any instance of unboud. If it working - only one instance.@gertjan said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
Check your logs, how many "Sending HUP signal to dns daemon" from the "dhcpleases" did you find ?
Today, from morning to evening, when it fail again - 70.
@gertjan said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
Check the unbound (resolver) log file, and look up how long it took for unbound to start.
second or less.
@gertjan said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
Final test :
Sorry, I don't understand how it help me. When it fail,
Status \ DNS Resolver
page say that it is stopped, like this
And how I check previously there no any zombie process in this case. So 100% dns resolving do not work, I see when try to visit sites. And it do not start working until I manually run it.@gertjan said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
Sure thing.
I still hope that with your and other guys help I can localize this problem and add corresponded issue to bug tracker. And, maybe, in future it will be fixed. If it impossible I use workarounds.
-
@jabacrack said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
Sorry, I don't understand how it help me.
When you see :
you know DNS, the resolver, isn't answering.
But that's just a GUI message. The GUI is very nice when everything goes well. When you have to look for issues, forget about the GUI.
Unbound should listen on most if not all interfaces, and the most important one is 127.0.0.1That's why I propose :
dig @127.0.0.1 a.b.c +short
Run this in a console / SSH and leave it there for a while :
tail -f /var/log/resolver.log | grep 'start\|stop'
Only stop and start events will be shown.
Every logged 'stop' event should be followed by a start.I'll post here a small shell script that compares the content of the /var/run/unbound.pid file with the process ID of the running unbound instance. They should be the same.
And if there is no pid file, then unbound isn't running. This can happens, but just for a short while.
I'll work on that.@jabacrack said in DNS resolver Stop Working after upgrade from 2.5.2 to 2.6.0:
add corresponded issue to bug tracker. And, maybe, in future it will be fixed.
No need.
Unbound runs on all pfSense systems just fine. Tens of thousands, maybe hundreds of thousands. If it gets restarted, it restarts = it stops and then starts. Most of us won't never notice this.
These is only one thing to do : fin why yours stops doing it's work.
Because was stopped and resonating failed ?
The running instance freezes ?
It could be interface related (yep, they can die, drivers can fail, etc) but normally, it should stay 'up' on 127.0.0.1 as this is not a physical interface.The GUI won't help you here to discover the issue. It's a command line task. And most of it is looking at the resolver log, and other logs like the system log to determine what event provokes your issue.