Bind upgrade producing errors on pfsense 2.5 upgrade
-
@gertjan
I need to come back to this. Without restarting my pfsense users again had issues and again I found that my pfsense was again not answering any DNS requests.Giving the above start command in shell resolved the issue.
Strangely pfSense showed the service running in the GUI and also the Service Watchdog did not detect that bind was not running.
I really need to fix the root cause of this happening.
The "named.sh" script in /usr/local/etc/rc.d uses the exactly same start command you've posted, the stop command is completely different:
#!/bin/sh # This file was automatically generated # by the pfSense service handler. rc_start() { if [ -z "`/bin/ps auxw | /usr/bin/grep "[n]amed " | /usr/bin/awk '{print $2}'`" ]; then /usr/local/sbin/named -4 -c /etc/namedb/named.conf -u bind -t /cf/named/ fi } rc_stop() { /usr/local/sbin/rndc -q -c "/usr/local/etc/rndc.conf" sync -clean 2>/dev/null /usr/local/sbin/rndc -q -c "/usr/local/etc/rndc.conf" stop -clean 2>/dev/null sleep 5 /usr/bin/killall -TERM named 2>/dev/null sleep 2 } case $1 in start) rc_start ;; stop) rc_stop ;; restart) rc_stop rc_start ;; esac
I'm fully puzzled, but this issue is mission critical. Do you have any idea what's going wrong here?
When I hit "Restart" for named in "Status" --> "Services" bind (named) is stopped according to the log, but never started again:
Apr 27 10:51:41 named 20303 general: notice: exiting Apr 27 10:51:41 named 20303 general: notice: stopping command channel on 127.0.0.1#8953 Apr 27 10:51:41 named 20303 general: info: shutting down: flushing changes Apr 27 10:51:41 named 20303 network: info: no longer listening on 80.152.208.158#53 Apr 27 10:51:41 named 20303 network: info: no longer listening on 10.0.0.5#53 Apr 27 10:51:41 named 20303 general: info: received control channel command 'stop -clean' Apr 27 10:51:41 named 20303 general: info: dumping all zones, removing journal files: success Apr 27 10:51:41 named 20303 general: info: received control channel command 'sync -clean'
"Status" --> "Services" still shows it running and not stopped.
Exectuting "named.sh" shows no error messages, but bind still does not start properly:
[2.5.1-RELEASE][root@router.mydomain.de]/usr/local/etc/rc.d: ./named.sh
I don't fully understand the condition in the start script could prevent named from being started:
if [ -z "`/bin/ps auxw | /usr/bin/grep "[n]amed " | /usr/bin/awk '{print $2}'`" ]; then
Update:
With "ps auxw | grep named" I found a second, older "named" thread running which did not react to a normal "kill" command. I've killed it with "kill -9", now the GUI showed named correctly stopped. I've started bind via the GUI and for now it's running.
I'll watch if it keeps working ... -
@jacotec said in Bind upgrade producing errors on pfsense 2.5 upgrade:
Exectuting "named.sh" shows no error messages, but bind still does not start properly:
[2.5.1-RELEASE][root@router.mydomain.de]/usr/local/etc/rc.d: ./named.shThe scripts tells you that :
@jacotec said in Bind upgrade producing errors on pfsense 2.5 upgrade:
case $1 in
This $1 is the first paramter on the command line.
I propose you use stop or start or restart like :/usr/local/etc/rc.d: ./named.sh restart
-
Just as a followup:
As I've read somewhere in forums that bind 9.16.12 is supposed to have a memory leak (which might have caused my crashes) I've manually updated bind to the current 9.16.15 where this was fixed:
pkg add -f https://pkg.freebsd.org/FreeBSD:12:amd64/latest/All/bind916-9.16.15.txz
I've never had issues since this update.
9.16.15 is not available via package manager yet (no idea how long this takes before the package is updated in the pfsense package manager).
-
@freebsd-man said in Bind upgrade producing errors on pfsense 2.5 upgrade:
After deleting the manual installed bind and lmdb packages und the old pfSense -pkg-bind package via shell, I installed the updated Package pfSense-pkg-bind-9.16_10 via GUI.
While installing I used "tail -f /var/log/resolver.log" to inspect the startup of the new bind.
I got rndc timeout messages from install log in GUI and errors from bind-startup in tail-output.
After GUI timouts the install finished successful.After deleting corrupted journal files with "rm /cf/named/etc/namedb/*jnl" I finally was able to start bind via GUI.
Now it is up and running.Just upgraded to 2.5.2. Nothing I did (restore config, reinstall package) would start bind, it simply was hung in lala-land. Deleting the journal files got it back to life immediately.
-
Hi, This bind issue is still not fixed
When I reinstall the bind package it takes very a long time (5 minutes or so) while I see the following in the installation log multiple times :xecuting custom_php_resync_config_command()...rndc: connect failed: 127.0.0.1#8953: timed out
rndc: connect failed: 127.0.0.1#8953: timed out
rndc: connect failed: 127.0.0.1#8953: timed out
rndc: connect failed: 127.0.0.1#8953: timed out
rndc: connect failed: 127.0.0.1#8953: timed outMaybe the issue can be reproduced by using a not default rndc port (8953 in my case)
The same happens when I reboot PFSense, it takes a long time before bind service is started because of the same thing (rndc not available on 127.0.0.1:8953) After 5 minutes or so bind finally starts and everything is working fine. -
@matthijs did "rm /cf/named/etc/namedb/*jnl" work for you?
-
@de0xyrib0se
No that did not work for me.I do not have problems with the working of BIND, I can start/stop BIND and BIND is running fine. The problem I still experience is a very slow start after reboot (5 minutes) and a very slow package reinstall (also 5 minutes) This is because of the "rndc: connect failed: 127.0.0.1#8953: timed out" X 5 times. After the fifth timeout BIND starts (after 5 minutes or so) succesfully. This is an issue for me because packages like PFBlockerNG services, VMWare Guest services start only after BIND succesfully started. So it takes a long time for all the services being up & running after a restart/reboot
-
@matthijs I have the same issue since I am also running the resolver, even worse I have many zones in bind and it takes up to 40mn to have all my services up.
-
Hi NetGate, is this going to be fixed anytime in the near future ?
We have this issue since the 2.5.0 release -
I presume 'bind' is a package with a non-Netgate-member maintainer. Like pfBlockerNG Suricata, postfix etc etc.
As such, it's done by people like 'you' and 'me' : pfSense users.I couldn't find who is maintaining it right now (but didn't really looked more then 30 seconds neither ;) )
Find him send send him a PM ?
edit : check also the redmine tickets : there are 10 tickets open for BIND.
-
@gertjan Ok thanks for the info, I did not know that :-)
-
@matthijs was bind turned off when you tried to remove the journal files?
-
@de0xyrib0se
No, first thing I did was raise my SOA serial number for my (master) zones (with a number higher than in the last .jnl zone update) I use the date serial format yyyymmddnn)
after that I logged in the PFSense host with ssh, went to /cf/named/etc/namedb/master/mymastername/
rm *.jnl
and then restarted bind
I think it is not related with my issueMy problem with bind (I think) is during statup/boot and also with install/reinstall package it is trying to connect to rndc 127.0.0.1#8953 for some reason, but it is not running at that very moment, resulting in the rndc: connect failed: 127.0.0.1#8953: timed out message (and it is trying 5 times or so taking a long time)
-
This :
@matthijs said in Bind upgrade producing errors on pfsense 2.5 upgrade:
rm *.jnl
and then restarted bindThese jnl are database-lookalike files, binary format en opened by bind permanently.
You can't 'delete' them while bind9 has them open for writing.is a major no go.
If the rm and restart had to be done (I doubt) I would do it like this :
( old fashioned debain service handling )service bind9 stop
Now I edit zone files, config files, whatever.
When done, I check my config and zone files :
named-checkconf -z
When no errors and all looke dandy :
service bind9 start
Btw : when I need to update a zone, for example : I want to change the SOA :
oot@ns311465:~# rndc freeze test-domaine.fr root@ns311465:~# nano /etc/bind/zones/db.test-domaine.fr root@ns311465:~# rndc reload test-domaine.fr zone reload queued root@ns311465:~# rndc thaw test-domaine.fr A zone reload and thaw was started. Check the logs to see the result.
No need to restart bind, no journal file issues.
Btw : journal files exists if the zone files are modified by other means as the admin.
For example : when the zone contains info that is update using RFC 2136.
Or when the zone is signed for DNSSEC.
Simple zones do have dot jnl and dot jbk files.Btw : I'm using the somewhat older
BIND 9.9.5-9+deb8u19-Debian (Extended Support Version)
-
@matthijs said in Bind upgrade producing errors on pfsense 2.5 upgrade:
@de0xyrib0se
No, first thing I did was raise my SOA serial number for my (master) zones (with a number higher than in the last .jnl zone update) I use the date serial format yyyymmddnn)
after that I logged in the PFSense host with ssh, went to /cf/named/etc/namedb/master/mymastername/
rm *.jnl
and then restarted bind
I think it is not related with my issueMy problem with bind (I think) is during statup/boot and also with install/reinstall package it is trying to connect to rndc 127.0.0.1#8953 for some reason, but it is not running at that very moment, resulting in the rndc: connect failed: 127.0.0.1#8953: timed out message (and it is trying 5 times or so taking a long time)
Shut down bind (command is listed above) and then do the rm, you cannot remove the files when it has a read lock on them. Restart bind afterwards and it will rebuild the journal files automatically.
This is what I did and it worked like a charm.
-
Thanks for your reply and suggestion. I did exactly as you described (I probably also stopped bind before, I forgot to mention that) Your suggestion did not solve my "rndc: connect failed: 127.0.0.1#8953: timed out message" during statup/boot and package install/reinstall.
rndc tries to connect to 127.0.0.1#8953 (during startup/boot and package install/reinstall) at a moment it is not running (hence the timeout). After bind is started it is running with no problems. Also rndc runs perfectly without errors AFTER bind is started.
-
@matthijs said in Bind upgrade producing errors on pfsense 2.5 upgrade:
rndc tries to connect to 127.0.0.1#8953 (during startup/boot and package install/reinstall) at a moment it is not running (hence the timeout). After bind is started it is running with no problems. Also rndc runs perfectly without errors AFTER bind is started.
If bind (named) is not running, rndc cannot contact it, hence the
rndc tries to connect to 127.0.0.1#8953
error.
If named is running, there error won't show (because rndc can now contact named on port 8953).
Again : running unbound and bind on the same device is something I wouldn't advice to do.
-
"If bind (named) is not running, rndc cannot contact it, hence the rndc tries to connect to 127.0.0.1#8953"
So why is rndc trying to connect to named (during reboot) when named it is not started yet ? And why is there no problem at all as soon as named is started ?
"Again : running unbound and bind on the same device is something I wouldn't advice to do."
Why not, I got unbound and named running on seperated interfaces and separated ports
-
@matthijs said in Bind upgrade producing errors on pfsense 2.5 upgrade:
Why not, I got unbound and named running on seperated interfaces and separated ports
As long as the control port, yours are 8953 for bind, and 853 for unbound, are not conflicting - thus not the case, and you 'bind' unbound to "interface 1" and named to "interface 2" they could co-exist.
The control port are normally only bound to 127.0.0.1 or ::1.@matthijs said in Bind upgrade producing errors on pfsense 2.5 upgrade:
So why is rndc trying to connect to named (during reboot) when named it is not started yet ?
I'm not using bind, the pfSense package, myself.
I don't know why and when rndc is used.
Check the logs to see if bind (named) is already started when this happens.@matthijs said in Bind upgrade producing errors on pfsense 2.5 upgrade:
And why is there no problem at all as soon as named is started ?
rndc is a program that controls the behaviour of bind (named) during run time.
Like unbound-control for unbound.rndc won't produce error messages f named is not running.
And it complains if it does.The real question is (for me) why is rndc executed if named isn't running yet ?
Btw : like a web server, or mail server, a DNS resolver + domain server like bind isn't really a service that can be made accessible with a GUI. There are just to many settings, options and different cases.
You wind up using the config files.
Take note : I use bind (named) a lot, as master domain server and several slaves, for all my domain names, DNSSEC stuff. The pfSEnse acme package works perfect using RFC2136, something bind supports very well. But I use it on my web/mail/whatever servers, all dedicated servers on the Internet, not my local firewall.
Again, this my my opinion of course. -
I did a complete reinstall of PFsense 2.5.2 and restored my last configuration.
It did not reinstall all of the packages incl bind/named automatically after the first reboot
I got the following notice in the upper right cornerGeneral
Package named does not exist in current pfSense version and it has been removed. @ 2021-10-11 13:44:20
Package reinstall process finished successfully @ 2021-10-11 13:44:45It did automatically reinstall all the other packages