DNS randomly stops working

collinatselect

I am also having this problem after upgrading directory from 2.4.5 CE to 2.5.1 on a Sophos SG-210.
In my case, enabling unbound on the Service Watchdog list restarts the service, but then the CPU is pegged at 100% and resolution still doesn't happen. Restarting the firewall works. I have not yet checked the PID or socket status during an outage, but I suspect unbound crashes, thinks its still running but can't clean itself up.

One thing I noticed on my system is pkg info unbound says Python is enabled, even though it is disabled in the configuration. I manually restarted after toggling Python on and off. Is this even relevant?

[2.5.1-RELEASE][admin@myfirewallnotyours]/root: pkg info unbound
unbound-1.13.1
Name           : unbound
Version        : 1.13.1
Installed on   : Thu Apr 15 03:10:26 2021 CDT
Origin         : dns/unbound
Architecture   : FreeBSD:12:amd64
Prefix         : /usr/local
Categories     : dns
Licenses       : BSD3CLAUSE
Maintainer     : jaap@NLnetLabs.nl
WWW            : https://www.nlnetlabs.nl/projects/unbound
Comment        : Validating, recursive, and caching DNS resolver
Options        :
        DNSCRYPT       : off
        DNSTAP         : off
        DOCS           : off
        DOH            : on
        ECDSA          : on
        EVAPI          : off
        FILTER_AAAA    : off
        GOST           : on
        HIREDIS        : off
        LIBEVENT       : on
        MUNIN_PLUGIN   : off
        PYTHON         : on
        SUBNET         : off
        TFOCL          : off
        TFOSE          : off
        THREADS        : on
Shared Libs required:
        libexpat.so.1
        libnghttp2.so.14
        libpython3.7m.so.1.0
        libevent-2.1.so.7
Shared Libs provided:
        libunbound.so.8
Annotations    :
        FreeBSD_version: 1202504
        cpe            : cpe:2.3:a:nlnetlabs:unbound:1.13.1:::::freebsd12:x64
        repo_type      : binary
        repository     : pfSense
Flat size      : 7.79MiB

Tried all recommendations on this post but nothing is working so far.

JasonAU

Eagerly following any threads about DNS, My watch dog is restarting Unbound all the time.

2.5.1-RELEASE (amd64)
pfBlockerNG-devel: 3.0.0_16
snort: 4.1.3_5
Telegraf: 0.9_5

Just wanted to share what see just on the off chance it helps the group, I did notice the already in use error in my system logs, when the watch dog is trying to start it back up

Apr 23 10:15:04 pfsense php[92161]: servicewatchdog_cron.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1619136904] unbound[68018:0] debug: creating udp4 socket 192.168.1.1 53 [1619136904] unbound[68018:0] error: bind: address already in use [1619136904] unbound[68018:0] fatal error: could not open ports'
Apr 23 10:15:04 pfsense php[73303]: notify_monitor.php: Message sent to XXXXX@hotmail.com OK
Apr 23 10:15:01 pfsense php[92161]: servicewatchdog_cron.php: Service Watchdog detected service unbound stopped. Restarting unbound (DNS Resolver)

collinatselect

Hi, so this is a documented upstream bug.
https://redmine.pfsense.org/issues/11316
I just found out about it because I submitted a trouble ticket.
Unfortunately, until this regression is fixed, the solution is either

Turn of Register "DHCP leases in DNS"
Downgrade to 2.4.5
Downgrade the package
use the DNS forwarder

Unfortunately 1) and 4) don't help if you need to Register DHCP in DNS in your organization.

So here's hoping the developers on unbound have an easy fix.

netblues

Service watchdog and unbound don't play well together.
Especially if pfblockerng is also used (since it does take time to come up)
In various situations, it ends up in unbound restart loops.

By enabling unbound python mode, and disabling dhcp integration, unbound is stable.
However, if wan ip changes due to pppoe restarting, unbound will die.
Always.
And since service watchdog is a no go for unbound, it has to be restarted manually
Yikes!.
At the time of ppp restart I get this
Apr 19 11:18:24 unbound 19913 [19913:0] info: service stopped (unbound 1.13.1).
2.5.1 pfblockerngng 3.0.16

bingo600

@netblues

I'm using unbound & service watchdog , and have no isues.
Not using pfblocker though.

JasonAU

@netblues said in DNS randomly stops working:

However, if wan ip changes due to pppoe restarting, unbound will die.
Always.

Hmm that interesting, for me I have noticed when watchdog finds unbound is broken I also see logs saying my Nord VPN got a new IP.. the core WAN (ppoe) is up but one of the VPNs out drop or looses some packets around the same time

grumple

I notice on my 2.5.1 box (I upgraded from 2.4.5 tonight) that after a reboot, if I check unbound, it's only listening on LOCALHOST and not my LAN interface (I have it set to only listen on LAN and LOCALHOST). After a manual restart, it's then listening properly. I've disabled register DHCP leases in DNS for now, and that still doesn't help on a reboot. If my server get's rebooted, my LAN is dead in the water until I can login and restart unbound.

Gertjan

@netblues said in DNS randomly stops working:

Service watchdog and unbound don't play well together.

Example good example :

@jasonau said in DNS randomly stops working:

Apr 23 10:15:04 pfsense php[92161]: servicewatchdog_cron.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1619136904] unbound[68018:0] debug: creating udp4 socket 192.168.1.1 53 [1619136904] unbound[68018:0] error: bind: address already in use [1619136904] unbound[68018:0] fatal error: could not open ports'
Apr 23 10:15:04 pfsense php[73303]: notify_monitor.php: Message sent to XXXXX@hotmail.com OK
Apr 23 10:15:01 pfsense php[92161]: servicewatchdog_cron.php: Service Watchdog detected service unbound stopped. Restarting unbound (DNS Resolver)

This is probably a perfect example where the Service watchdog cron pfSense package makes the issues worse.
It is 'normal' that the DNS unbound / the resolver is restarted as that this option exactly instructs that to happen :

So, it received a restart (a DHCP lease came in) : it stop, and start again : this can be seen in the logs !
The service watchdog sees the process unbound stopping (and doesn't know the difference between dying and stopping) and tries to start an instance of unbound.
But it was already in a restart phase : the instance that was running in the past is just restarting.. Now TWO instances of unbound are trying to compete for a start-up.
The first one - the one that was already running in the past) binds to the port 53 again on all selected interfaces, and the second, started by the service watchdog,sees this ports already in use, and complains. The error log message is now perfectly explained. We can see that there are two unbound processes here : 92161 and 73303 - and the second one logs an error and stops.

People : please, for your own sake : do not make the issue worse.

Unbound can be a light weight process, your device has nothing else to do as starting and stopping processes. But then pfBlockerNG came along, and it could put a "huge block of concrete" around the feeds of unbound. Now, unbound has some difficulties to start 'fast' as it needs to parse thousands of (DNSBL) lines first.
Or : you have thousands of devices on your network and they - as usual, all use DHCP to obtain an IP (lease).
Or : what's also very known now : networks contains device that are very low budget (the gadgets). The use bad behaving code, like DHCP clients that ask a new kease every xx seconds (see the DHCP logs to track them down) : so unbound restarts every xx seconds. So, locate these device and waste bin these devices or better : don't use them, don't buy them.

The Service watchdog cron pfSense can be useful if you have a dying system, and you decided not to invest any time and money into it any more, as it will get replaced asap.
You put a " Service watchdog cron pfSense" on it, an let it start to dig its own hole.

Btw : the person who wrote the Service watchdog (jimp from Netgate) said himself on this forum : Do not use this package - I'm not using it myself. The usage case is very, very rare and only the ones that know how to deal with it should use it.
This excludes 99,99 % of all of us on the forum.

Servers, and other 'always on' devices that are designed to on for days, month or even years do not need such a restart service. Very known programs like bind9, nginx, postfix and also unbound can run for thousands of hours.

IF you can see messages in the logs - the system log and unbound log, where you can see that unbound stops or even "segment faults" or it makes the kernel "dump" - and not starting any more, then, and only then, you could consider using the service watchdog. But in this case, I would take the time to see if this happens often, and if so, going back to the previous version of pfSense.
I did not saw unbound die on me EVER. I do use the amd64 'intel' version.
I know there was an 'arm' compiler issue recently, that could produce a faulty executable. As discussed on the forum.

netblues

Clearly, unbound isn't a candidate for service watchdog.
When it restarts, if its fast, one might get away with this. If its not, hell breaks loose.

Unbound without pfblockerng doesn't SEEM to crash, however stressing a program, does reveal issues.

JasonAU

@gertjan said in DNS randomly stops working:

This is probably a perfect example where the Service watchdog cron pfSense package makes the issues worse.
It is 'normal' that the DNS unbound / the resolver is restarted as that this option exactly instructs that to happen :

So, it received a restart (a DHCP lease came in) : it stop, and start again : this can be seen in the logs !

Thanks for the reply, I'm going to disable my watchdog 'however' the box to register DNS is not checked on any of my ranges (LAN/Wifi) so (AFAIK) pfsense should not be restarting as dhcp leases come in ? without the watchdog I still notice drops in DNS responding Chrome will say unable to resolve then moments later the page loads

Gertjan

@jasonau said in DNS randomly stops working:

pfsense should not be restarting as dhcp leases come in ?

"pfSense" is just the name of the box - an OS (FreeBSD) and some processes or programs.

It's known that the DHCP Registration restart the DNS, if you want it to do so.
If this option is checked, a process called "dhcpleases" is created that verifies the list with known leases, as the DHCP server maintains such a list. You can see it here : Status> DHCP Leases
As soon as this list changes, because a lease is renewed, the dhcpleases detects this, writes out the leases list to the system's /etc/hosts file, and restart unbound.
That ONE reason why unbound could get restarted regularly, or even very often.

Their are other process or events that restart unbound.
Like : an interface goes down and up (take out the LAN cable, put it back in again) and have a look at the sustem logs, and other logs : A LOT happened. Many process can get restarted. because a 'major' system event arrived.

Or, also popular : people use pfBlockerNG and ask feeds to update every xx minutes (totally nuts, I know, but hey, there are people out their who do so because the "can do so". An don't understand that their might be consequences.

It all boils down to : check the logs. Learn how to read them. Check why unbound get restarted : what event triggered the restart.
Now, you can ask yourself : can I influence this event. Do I need it ? Can I change it ?

I know, this is more basic administration system management, and isn't really related to 'simple' firewall/router management.
But hey, if yo want a simple system, go for that DLINK 20$ router box, or better : go for that Netgear thing.
pfSense doesn't have the same price tag for sure - and is on many fronts a quasi industrial product.
It needs some work, and understanding. we all have to fckg learn things again - and stay on it.

JasonAU

@gertjan said in DNS randomly stops working:

It all boils down to : check the logs. Learn how to read them. Check why unbound get restarted : what event triggered the restart.
Now, you can ask yourself : can I influence this event. Do I need it ? Can I change it ?

Very Very true, Some time ago I found that the cron job for PBlocker was running just before each DNS drop, I'll keep an eye on things after the recent change , I have Grafana setup so may even try to setup something to log what's happening around the time of the issues and make it pretty.

Thanks for taking the time & effort in your replies