Unbound crashes periodically with signal 11

SuperMaster

yes i mean the dhcpd process. the problem is not that dhpcd restarts unbound too often.

if I understand the code correctly https://github.com/pfsense/FreeBSD-ports/blob/devel/sysutils/dhcpleases/files/dhcpleases.c

	syslog(LOG_INFO, "Sending HUP signal to dns daemon(%u)", pidno);
	if (kill((pid_t)pidno, SIGHUP) < 0)
		syslog(LOG_ERR,
		    "Could not deliver signal HUP to process %d: %m.", pidno)

it will send kil -HUP (kill -1 is the same as HUP) to unobund but i dont find where unbound is startet after that again.

this works in version 2.4.5 without problems because unbound just starts again when you do a kill -HUP but not in 2.5 since the update to unbound 1.13

jimp

I checked a few different systems here and I have no problem doing a HUP to unbound. I tried it multiple times multiple ways. It stays running and operating properly.

: ps uxaww | grep unbound
unbound 36369   0.0  1.0  55904  4696  -  Is   14:05     0:01.58 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root    39523   0.0  0.0  10844     0  -  IWs  -         0:00.00 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d lab.jimp.pw -p /var/run/unbound.pid -u /var/unbound/dhcpleases_entries.conf -h /etc/hosts
root    33981   0.0  0.4  11044  1984  0  R+   16:00     0:00.00 grep unbound
: killall -HUP unbound
: ps uxaww | grep unbound
unbound 36369   3.9  5.0  51116 22852  -  Ss   14:05     0:01.82 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root    39523   0.0  0.0  10844     0  -  IWs  -         0:00.00 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d lab.jimp.pw -p /var/run/unbound.pid -u /var/unbound/dhcpleases_entries.conf -h /etc/hosts
root    34630   0.0  0.4  10988  1940  0  R+   16:00     0:00.00 grep unbound
: kill -HUP `cat /var/run/unbound.pid`
: ps uxaww | grep unbound
unbound 36369   2.0  4.6  51836 20828  -  Ss   14:05     0:02.03 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root    39523   0.0  0.0  10844     0  -  IWs  -         0:00.00 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d lab.jimp.pw -p /var/run/unbound.pid -u /var/unbound/dhcpleases_entries.conf -h /etc/hosts
root    20156   0.0  0.4  11036  1956  0  R+   16:01     0:00.00 grep unbound
: kill -1 `cat /var/run/unbound.pid`
: ps uxaww | grep unbound
unbound 36369   2.0  5.1  53956 23404  -  Ss   14:05     0:02.18 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
root    39523   0.0  0.0  10844     0  -  IWs  -         0:00.00 /usr/local/sbin/dhcpleases -l /var/dhcpd/var/db/dhcpd.leases -d lab.jimp.pw -p /var/run/unbound.pid -u /var/unbound/dhcpleases_entries.conf -h /etc/hosts
root    21124   0.0  0.5  11192  2064  0  S+   16:01     0:00.00 grep unbound

Perhaps there is something specific to your unbound configuration that is making it happen (custom options? pfblocker? python module?)

Try to narrow it down more.

Gertjan

@jimp said in pfSense 2.50 snapshots have been dying for the past couple of days:

killall -HUP unbound

Hummmm. Sounds great.
For my own curiosity, I'll check upon unbound's source what's its doing with a HUP. The "2.4.5p1 version" = 1.10.1 wa simply restarting itself. This could have - now very known - consequences.

This is a strong signal for me to try 2.5.0.xxxxx

And true : it's not only "dhcpleases"that can restart unbound.
If dhcpleases still exists under 2.5.0....
And if so, is it sending a HUP to unbound ?
Etc.

johnpoz

@gertjan said in pfSense 2.50 snapshots have been dying for the past couple of days:

This is a strong signal for me to try 2.5.0.xxxxx

There are many things looking forward to in 2.5 ;) But I am just going to wait til release. I am going to go with clean install.. Since going to take the opportunity of moving my sg4860 to zfs..

But yeah unbound updates, able to do dhcp registrations without issues will be big for many users. I personally am looking forward to the openssl update and gui using tls 1.3.. Plus many more.. 2.5 has lots of good stuff in it..

Was just looking over the release notes again - hadn't noticed the usb tethering.. So seems be easy enough to plug my IPhone in and have network wide internet... That will be slick on power outages.. Where my network is still up via UPS.. Or when ISP goes out..

Yeah 2.5 looks very nice.. Well worth the wait ;)

Salander27 0

@gertjan Did you ever figure this out? I just updated to 2.5.0-RELEASE and started having this issue (DNS was completely stable before the update). I've resorted to having Service Watchdog restart it when it goes down as a temporary measure.

I do not have "Enable registration of DHCP client names in DNS." enabled for either DHCP or DHCPv6.

Gertjan

@salander27-0 said in pfSense 2.50 snapshots have been dying for the past couple of days:

Did you ever figure this out?

Noop, sorry.
Had other occupations.

Fry-kun

@salander27-0 I just upgraded and I'm having the same problem
I have both "Register DHCP leases in the DNS Resolver" and "Register DHCP static mappings in the DNS Resolver" enabled

Edit: Enabled watchdog, too. This is ridiculous, hope it gets fixed for real very soon!

Salander27 0

@fry-kun Ah, I have both of those settings enabled too. I was thinking "Enable registration of DHCP client names in DNS" was what the above posters were referring to but I was mistaken.

The issue certainly seems to be DHCP-related. I had 4 crashes in a 10 minute span and then upped the DHCP lease time from 15 minutes to 6 hours and haven't seen any crashes yet (though I would expect a few in a few hours once the initial 6hr leases expire and get renewed). You may wish to increase your lease time as well just to help reduce the crash frequency.

Gertjan

@salander27-0 said in pfSense 2.50 snapshots have been dying for the past couple of days:

though I would expect a few in a few hours once the initial 6hr leases expire and get renewed

A lease gets renewed after half the duration of the lease.
A 15 minutes lease will get renewed after 7 minutes.
Known OS's like Windows, MAC etc are set up ask for leases lasting a day or two.

Why 15 minutes ??
Although, pfSense - the DHCP server - should handle that just fine.

jimp

We need a lot more detail about configurations. Nobody here can reproduce this in the lab or on our edge systems.

At a minimum we need:

List of installed and in-use packages (e.g. pfBlockerNG, DNSBL)
Contents of /var/unbound/unbound.conf
Whether or not DHCP lease registration is enabled or other similar features like "Register connected OpenVPN clients in the DNS Resolver"
If DHCP lease registration is enabled, we also need to know the lease time

e1219

@jimp

I updated to the latest stable release 2.5.0 from 2.4.5 last night and have started experiencing this issue as well. I have made no changes to my config since updating.

I see this error in my log which brought me to this thread.
Feb 18 09:25:46 kernel pid 62259 (unbound), jid 0, uid 59: exited on signal 11

Afterwards I see the following error as unbound pid 62259 died.
Feb 18 09:27:06 dhcpleases 52367 Could not deliver signal HUP to process 62259: No such process.
Feb 18 09:28:04 dhcpleases 52367 Could not deliver signal HUP to process 62259: No such process.
Feb 18 09:30:40 dhcpleases 81970 Could not deliver signal HUP to process 62259: No such process.

Here are some of the config details you mentioned, please let me know if there are any other details that might help.

Installed packages:
Avahi 2.1_1
pfBlockerNG-devel 3.0.0_10
Service_Watchdog 1.8.7_1

Contents of /var/unbound/unbound.conf
unbound.conf

Enabled:
Register DHCP leases in the DNS Resolver
Register DHCP static mappings in the DNS Resolver
Disabled:
Register connected OpenVPN clients in the DNS Resolver

My DHCP server is using the default default-lease-time (7200s) and default maximum lease time (86400s). Looking at my current lease table, my devices are respecting the 2hr lease duration, but register at different times.

Salander27 0

@jimp

Installed Packages:
acme 0.6.9_3
arping 1.2.2_2
iperf 3.0.2_5
nmap 1.4.4_2
openvpn-client-export 1.5_5
Service_Watchdog 1.8.7_1
softflowd 1.2.6_1
sudo 0.3_6

/var/unbound/unbound.conf
[0_1613672909675_unbound.conf](Uploading 100%)

Enabled:
Register DHCP leases in the DNS Resolver
Register DHCP static mappings in the DNS Resolver
Disabled:
Register connected OpenVPN clients in the DNS Resolver

Lease time is currently 6hrs (which is helping as I see there was only one crash in the last 12 hours) up from 15 minutes (which was crashing constantly).

mkernalcon

After updating just a few hours ago to the 2.5.0 release on our main router, I can confirm that I am having the same issue. I've temporarily fixed it by disabling the "Register DHCP leases in the DNS Resolver" option. I can confirm, looking through the logs, that several HUPs get sent properly, all to the same PID, before finally starting to fail with "No such process". The last HUP that doesn't immediately fail in DHCP logs, has exactly the same timestamp as "pid 55598 (unbound), jid 0, uid 59: exited on signal 11" in the general log. No information in the DNS resolver logs.

DHCP Leases are default (7200s) for all vlans except my "main" lan, which is 691200s. Looks like the HUPs that kill it come from the 7200s vlans, but this is probably just coincidence.

Installed Packages:
darkstat
iperf
nmap
nut
openvpn-client-export
Status_Traffic_Totals

unbound.conf (has not been edited manually):

##########################
# Unbound Configuration
##########################

##
# Server configuration
##
server:

chroot: /var/unbound
username: "unbound"
directory: "/var/unbound"
pidfile: "/var/run/unbound.pid"
use-syslog: yes
port: 53
verbosity: 2
hide-identity: yes
hide-version: yes
harden-glue: yes
do-ip4: yes
do-ip6: no
do-udp: yes
do-tcp: yes
do-daemonize: yes
module-config: "validator iterator"
unwanted-reply-threshold: 0
num-queries-per-thread: 512
jostle-timeout: 200
infra-host-ttl: 900
infra-cache-numhosts: 10000
outgoing-num-tcp: 10
incoming-num-tcp: 10
edns-buffer-size: 4096
cache-max-ttl: 86400
cache-min-ttl: 0
harden-dnssec-stripped: yes
msg-cache-size: 4m
rrset-cache-size: 8m

num-threads: 12
msg-cache-slabs: 8
rrset-cache-slabs: 8
infra-cache-slabs: 8
key-cache-slabs: 8
outgoing-range: 4096
#so-rcvbuf: 4m
auto-trust-anchor-file: /var/unbound/root.key
prefetch: no
prefetch-key: no
use-caps-for-id: no
serve-expired: no
aggressive-nsec: no
# Statistics
# Unbound Statistics
statistics-interval: 0
extended-statistics: yes
statistics-cumulative: yes

# TLS Configuration
tls-cert-bundle: "/etc/ssl/cert.pem"

# Interface IP(s) to bind to
interface: 192.168.2.1
interface: 192.168.3.1
interface: 192.168.4.1
interface: 192.168.11.1
interface: 192.168.99.1
interface: 127.0.0.1
interface: ::1

# Outgoing interfaces to be used

# DNS Rebinding
# For DNS Rebinding prevention
private-address: 127.0.0.0/8
private-address: 10.0.0.0/8
private-address: ::ffff:a00:0/104
private-address: 172.16.0.0/12
private-address: ::ffff:ac10:0/108
private-address: 169.254.0.0/16
private-address: ::ffff:a9fe:0/112
private-address: 192.168.0.0/16
private-address: ::ffff:c0a8:0/112
private-address: fd00::/8
private-address: fe80::/10
# Set private domains in case authoritative name server returns a Private IP address



# Access lists
include: /var/unbound/access_lists.conf

# Static host entries
include: /var/unbound/host_entries.conf

# dhcp lease entries
include: /var/unbound/dhcpleases_entries.conf



# Domain overrides
include: /var/unbound/domainoverrides.conf




###
# Remote Control Config
###
include: /var/unbound/remotecontrol.conf

jimp

OK so nothing jumps out in those configs. I still can't make it crash here even hammering on it.

I see that Unbound 1.13.1 is out now, we might need to pull that in and test against it. I reopened https://redmine.pfsense.org/issues/11316 which was initially closed since we didn't have enough information.

Keep the details coming here on this forum post, we may still be able to spot a pattern.

mxw39

@jimp thanks for the support! Is there some debug command that helps collect logs? dumps syscalls before segfault? I hope to contribute my crashing unbound somehow.

jkv

After upgrading two systems (one to CE 2.5.0 and the other, running negate hardware (SG-5100) , to + 21.02) I have also started seeing this issue.

Just in case it helps in spotting any common patterns - I note that:

On both systems (between every 5 to 10 minutes) I see unbound stopping and restarting in the DNS Resolver log.
Only on SG-5100 (pfsense + 21.02) I am also seeing (probably 8 times a day or so) in the General System log that unbound exited on signal 11 (for example "pid 73090 (unbound), jid 0, uid 59: exited on signal 11").
The only packages installed in common on both systems are Cron (0.3.7_5), openvpn-client-export (1.5_5) and Service_Watchdog (1.8.7_1).
The SG-5100 (pfsense + 21.02) also has installed arpwatch (0.2.0_4), freeradius3 (0.15.7_29), and pfBlockerNG-devel (3.0.0_10).
Both systems have WAN connections with dynamic IPs (so ddns is in use on the WAN side).
Both systems also have some Static DHCP entries set (with "Register DHCP static mappings in the DNS Resolver" enabled).

Gertjan

@jkv said in pfSense 2.50 snapshots have been dying for the past couple of days:

I see unbound stopping

&

@jkv said in pfSense 2.50 snapshots have been dying for the past couple of days:

General System log that unbound exited on signal 11

You see it dying.
You use Service_Watchdog to restart it - right ?

@jkv said in pfSense 2.50 snapshots have been dying for the past couple of days:

pfBlockerNG-devel (3.0.0_10).

How often is the pfBlockerNG-devel doing it's cron task ? This task is logged. Does it restart unbound ?
What happens when you stop "Service_Watchdog ", so it doesn't restart unbound ?

What I'm trying to find out : if Service_Watchdog detects that unbound stops, it launches another instance. But it was actually just stopping and restarting, ordered by pfBlockerNG-devel. So, two instances are started, one dies .....
This is just a theory, as I'm not using Service_Watchdog myself

Also, SG-5100 is an Intel based machine, so "You and I" are using the same executable / same binary. Only our "config" differs. I don't know nothing about ARM based binaries, but I tend to say the "Intel" ones are pretty solid.

These :

@jkv said in pfSense 2.50 snapshots have been dying for the past couple of days:

have some Static DHCP entries

do nothing to unbound. The "static DHCP settings" (host name IP relation) are copied in the /etc/hosts file during boot. this file is (also) read by unbound during it's initial start up. These 'static DHCP setting' rarely change, that is, only if you delete/modify/add one. Look at this file, you'll see what I mean.
(In the past) the "DHCP Registration / Register DHCP leases in the DNS Resolver" could be problematic. The ""static DHCP settings"" were never a source of issue.

jkv

@gertjan

the cron job for pfBlockerNG-devel is hourly and there does not appear to be any correlation between this cron job and unbound exiting. I will do some testing with Service_Watchdog disabled to see what happens to unbound.

Salander27 0

I doubt that service watchdog is the cause of the issue. It wasn't even present on my installation until I installed it so I wouldn't have to manually restart unbound after the crashes.

Salander27 0

If there was a way for me to get a testing version of pfSense with Unbound 1.13.1 I would be more than happy to install that promptly and give feedback as to whether or not it is helpful at dealing with the issue.

Also, can we get the title of this forum post updated to something like "DNS Resolver/Unbound crashing on pfSense 2.5" so that we can attract the attention of anyone else searching for this issue?