[Solved] Unbound fails on restart after pfBlockerNG updates

BBcan177

That error if from unbound itself. Are you using the service watchdog program to restart the resolver? If so, that could cause issues.

Similar thread here:
https://forum.pfsense.org/index.php?topic=123554.0

I don't think it's a package related issue, but something with Unbound itself.

bole5

I am not using to restarting resolver. It is pfBlockerNG that does the reload which seems to fail. Below are the logs of the cron job in pfBlockerNG. Note the: "Reloading Unbound…. Not completed."
The weird thing is that the reload starts (and fails) and: 01:15:33 but DNSBL Update Competed message comes a minute later at 01:16:48.

If I restart the process manually by stopping/starting unbound, everything works fine.

pfBlockerNG Cron Logs


===[  DNSBL Process  ]================================================

[ EasyList ]		 Downloading update .. 200 OK.
  Whitelist: 10fbb07a4b0.se|32d1d3b9c.se|82d914.se|adaction.se|adrotate.se|ccebba93.se|contexlink.se|d1110e4.se|emediate.se|
  ----------------------------------------------------------------------
  Orig.    Unique     # Dups     # White    # Alexa    Final                
  ----------------------------------------------------------------------
  7312     6970       0          9          0          6961                 
  ----------------------------------------------------------------------

[ EasyPrivacy ]		 Downloading update [ 02/18/17 01:15:27 ] .. 200 OK.
  Whitelist: clicksen.se|easyresearch.se|getstatistics.se|ip2location.com|optimizely.com|publish-int.se|research-int.se|vastpaketet.se|webstat.se|webtraffic.se|
  ----------------------------------------------------------------------
  Orig.    Unique     # Dups     # White    # Alexa    Final                
  ----------------------------------------------------------------------
  2759     2738       21         10         0          2707                 
  ----------------------------------------------------------------------

[ Malware_Corpus_Tracker ] Downloading update [ 02/18/17 01:15:28 ] .. 200 OK
  Remote timestamp missing .
  ----------------------------------------------------------------------
  Orig.    Unique     # Dups     # White    # Alexa    Final                
  ----------------------------------------------------------------------
  140      120        0          0          0          120                  
  ----------------------------------------------------------------------

[ MalwareDomainList ]	 exists. [ 02/18/17 01:15:32 ]
------------------------------------------
Assembling database... completed
Validating database... completed
Reloading Unbound.... Not completed.
DNSBL update [ 10941 | PASSED  ]... completed [ 02/18/17 01:16:48 ]
------------------------------------------

System logs:

Feb 18 01:17:45	php		[pfBlockerNG] No changes to Firewall rules, skipping Filter Reload
Feb 18 01:15:25	php		[pfBlockerNG] Starting cron process.

BBcan177

As per the link I posted above.
If you run this command from the pfSense shell, does it complete properly?

@BBcan177:

Try the following command from the shell:

/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf status

and also try this command:

/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf reload

Are you using DHCP Registration in Unbound, or IPv6? Maybe post some screenshots of the unbound settings.

There is some issue in the pfSense unbound code that is causing these corner-case issues…

bole5

You are right. It is something with unbound and my setup.
If I run the "status" command it runs normally

version: 1.5.9
verbosity: 1
threads: 4
modules: 2 [ validator iterator ]
uptime: 48303 seconds
options: control(ssl)
unbound (pid 47565) is running...

Then I run "reload" and it returns "ok" immediately. If I run another "reload" shortly thereafter I get:

[1487451259] unbound-control[22739:0] error: connect: Operation timed out for 127.0.0.1

and in the unbound logs I see: ```
fatal error: Could not read config file: /unbound.conf


I will modify pfblockerng.inc file as to stop/start service as suggested by the thread you referenced.

BBcan177

Could you check a few things out for me please?

Does this folder exist in your box? ls -lah /var/tmp
and does this file exist? ls -lah /var/tmp/unbound_cache

So basically it runs thru these steps to reload the new DNSBL changes into Unbound…
If Unbound is running:

/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache > /var/tmp/unbound_cache
/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf reload
/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf load_cache < /var/tmp/unbound_cache
/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf status

Its really strange that it gives that error on multiple reload attempts…

Could you check to see if the /var/unbound/unbound.conf file is changed after each "Reload"?
or PM me a copy of the unbound.conf before each reload attempt…

Another thought is that the dump/load cache might have some issue... But can't tell until you run these commands and see if you can find another clue...

And is you pause for a second or two, do you still get the error?

Thanks!

bole5

Could you check a few things out for me please?

@BBcan177:

Does this folder exist in your box? ls -lah /var/tmp
and does this file exist? ls -lah /var/tmp/unbound_cache

Yes. The file is there with correct timestamp:

ls -al /var/tmp/unbound_cache
-rw-r--r--  1 root  wheel  509698 Feb 19 01:15 /var/tmp/unbound_cache

@BBcan177:

So basically it runs thru these steps to reload the new DNSBL changes into Unbound…
If Unbound is running:

/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache > /var/tmp/unbound_cache
/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf reload
/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf load_cache < /var/tmp/unbound_cache
/usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf status

Running commands in sequence gives following:

[2.3.2-RELEASE][root@fortet.localdomain]/var/log: /usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf dump_cache > /var/tmp/unbound_cache
[2.3.2-RELEASE][root@fortet.localdomain]/var/log: /usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf reload
ok
[2.3.2-RELEASE][root@fortet.localdomain]/var/log: /usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf load_cache < /var/tmp/unbound_cache
error: SSL handshake failed
[2.3.2-RELEASE][root@fortet.localdomain]/var/log: /usr/sbin/chroot -u unbound -g unbound / /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf status
[1487493495] unbound-control[69667:0] error: connect: Operation timed out for 127.0.0.1

The unbound crashes on "reload" command, so the "load_cache" and "status" fail. During the reload I could see the: "fatal error: Could not read config file: /unbound.conf" in resolver log file.

@BBcan177:

Its really strange that it gives that error on multiple reload attempts…

Based on the latest tests, my explanation is that it crashes on the first reload and the second time when I run the reload it has a long timeout before it returns error.

@BBcan177:

Could you check to see if the /var/unbound/unbound.conf file is changed after each "Reload"?
or PM me a copy of the unbound.conf before each reload attempt…

Strangely enough the timestamp on the unbound.conf is changed, so the file is touched on reload. I copied the conf file before reload and run diff which gave me no changes.

@BBcan177:

Another thought is that the dump/load cache might have some issue… But can't tell until you run these commands and see if you can find another clue...

Based on the above, there is something in my setup that causes unbound to crash on "reload" command. I modified the pfblockerng.inc file and replaced:


                        exec("{$chroot_cmd} reload 2>&1", $result, $retval);
                        pfb_logger('.', 1);

with:


                        exec("{$chroot_cmd} stop");
                        sleep(1);
                        pfb_logger('.', 1);
                        mwexec("/usr/local/sbin/unbound -c {$g['unbound_chroot_path']}/unbound.conf 2>&1", $result, $retval);
                        pfb_logger('.', 1);

This procedure worked and didn't cause unbound to crash.

Since this is clearly caused by unbound, may I suggest that you add a flag in pfBlockerNG with option to replace reload with stop/start for people experiencing this problem?

bole5

My problem is now solved after following advice from BBcan177.
It turns out that all includes in unbound.conf options need to point to the same folder: /var/unbound.

I had following in my configuration:

My domain overrides

server:include: /usr/local/etc/unbound/se-servers.conf

All I needed to do is to change the /usr/local/etc/unbound to /var/unbound, and unbound stopped crashing on "reload".

My domain overrides

server:include: /var/unbound//se-servers.conf

Thank you very much BBcan177 for this great package!

renegade

Sorry, but i don't understand the solution.

My Logs are full of:

Mar 1 17:56:37 	unbound 	30737:0 	info: start of service (unbound 1.6.0).
Mar 1 17:56:33 	unbound 	30737:0 	error: cannot chdir to directory: (No such file or directory)
Mar 1 17:56:33 	unbound 	30737:0 	info: service stopped (unbound 1.6.0).
Mar 1 17:50:18 	unbound 	30737:0 	info: start of service (unbound 1.6.0).
Mar 1 17:50:14 	unbound 	30737:0 	error: cannot chdir to directory: (No such file or directory)
Mar 1 17:50:14 	unbound 	30737:0 	info: service stopped (unbound 1.6.0).
Mar 1 17:46:30 	unbound 	30737:0 	info: start of service (unbound 1.6.0).
Mar 1 17:46:25 	unbound 	30737:0 	error: cannot chdir to directory: (No such file or directory)
Mar 1 17:46:25 	unbound 	30737:0 	info: service stopped (unbound 1.6.0).

and very often in the morning i cannot resolve any dns, because the unbound did not startup.

bole5

@renegade: In my particular case I had my own domain overrides imported under DNS Resolver/Custom options. The file imported was located in /usr/local/etc/unbound/ folder. The unbound was crashing each time I would issue reload command and solution was to move my files to /var/unbound folder.

Please bear in mind that the error message I was getting was different than yours:

fatal error: Could not read config file: /unbound.conf

You can try running commands listed by BBcan177 and in his post: https://forum.pfsense.org/index.php?topic=125773.msg694881#msg694881 and see if it provides more clues. After running commands check unbound log and dmesg.

You can also force restart of unblocker which worked for me (before my problem was solved). For details about that read the end my post where I describe what needs to be modified in pfblockerng.inc file: https://forum.pfsense.org/index.php?topic=125773.msg694915#msg694915

Good luck!

kpa

Unbound is chrooted under /var/unbound in pfSense and that means all configuration files and files referenced by the configuration files must be placed under /var/unbound.

renegade

tried the commands but there was no error.
this morning i had to start unbound again manually. after the 24h reconnect.

RonpfS

Should be fixed at some point https://redmine.pfsense.org/issues/7326

renegade

I removed many feeds from DNSBL and it seems to work stable… until 2.3.3p1 came. the last two mornings same effect. I could not resolve any ip adress until i restarted bind manually by logging in and restarting it.

BBcan177

I have posted a patch in redmine for this issue:

https://redmine.pfsense.org/issues/7326