23.09 Unbound killed failing to reclaim memory

Mission-Ghost

@stephenw10, ok: on the 23.09 box I disabled all services in service_watchdog and restarted unbound. Nothing at all was logged in System/General after the unbound restart, so the service_watchdog messages about unbound did not appear.

Under System/DNS Resolver, just five messages:

notice: init module 0: python
info: [pfBlockerNG]: pfb_unbound.py script loaded
info: [pfBlockerNG]: init_standard script loaded
notice: init module 1: iterator
info: start of service (unbound 1.18.0).

That seems to isolate that issue.

Now I just have to nail down the cause of the "failed to reclaim memory" error that started appearing after the 23.09 update. I'll leave this box running and see if it throws the error. It's not connected to any WAN so that might or might not affect the diagnosis, but we'll see. It could take some hours or overnight to throw the error, if it does.

stephenw10

The unreclaimed memory is less concerning than the fact it was killed by something and seemingly not restarted. Though possibly the watchdog simply triggered before it could restart.
I would wait to see if it still gets killed after some time and whether it restarts. And what's logged when it does.

Gertjan

@Mission-Ghost

Compare your :

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

Nov 15 11:15:03 kernel pid 26339 (unbound), jid 0, uid 59, was killed: failed to reclaim memory

With :
@jorgek said in Suricata 7.0.0 being killed by kernel in 23.09.:

Nov 15 16:01:28 kernel pid 28473 (suricata), jid 0, uid 0, was killed: failed to reclaim memory

from this thread, Suricata 7.0.0 being killed by kernel in 23.09. started yesterday.

I've not seen this "failed to reclaim memory" error myself while using 23.09,
Let's see :

[23.09-RELEASE][root@pfSense.bhf.tld]/root: cd /var/log
[23.09-RELEASE][root@pfSense.bhf.tld]/var/log: grep 'reclaim' *.log

return nothing, so no lines with 'reclaim' in any log files.

In the other thread with the same 'reclaim' error it was Surricata that manipulates pf firewall rules - and pf changed recently.
AFAIK : Unbound doesn't manipulate pf rules.
So, not related ?

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

[1699845727] unbound[12876:0] error: can't bind socket: Can't assign requested address for 192.168.20.1 port 53 [1699845727] unbound[12876:0] fatal error: could not open ports'

I don't know if the above is related to the errors now manifesting in the -09 version but it seems plausible. Could this be getting smoked out by -09 version?

What does this mean and what does one do about it?

What it means : unbound is started and want to use the selected interface(s), and bind (== use) port 53, UDP and TCP.
But it couldn't use these port(s) as it was already "in use" by another process.
There is of course just one ** process that uses 'port 53' on pfSense ; unbound.
The message means : there is already instance of unbound running, so the new process, with PID 12876, will bail out.

If there is a "service watchdog" in play, then don't look any further : the "service watchdog" is a nuclear war head peace keeper solution : when it is used, there will be some fall out.

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

I manually stopped unbound and then manually started it, and service_watchdog appears to have tried to start it:

There you have it.
"unbound" itself needs some time to start, as it has lot to do.
"unbound with python mode" has a little bit, like several ms more, work to do.
"unbound with python mode with a load of DNSBL entries" has even more to do. A startup time of "second" or more isn't unusual.

Example : an extract of my nightly pfBlockerng DNSBL cron based reload :

....
----------------------------------------
Original    Matches    Removed    Final     
----------------------------------------
184025      41667      45702      138323    
-----------------------------------------
TLD finalize... completed [ 04/10/23 00:45:24 ]

Saving DNSBL statistics... completed [ 04/10/23 00:45:26 ]
Reloading Unbound Resolver (DNSBL python).
Stopping Unbound Resolver.
Unbound stopped in 2 sec.
Additional mounts (DNSBL python):
 No changes required.
Starting Unbound Resolver... completed [ 04/10/23 00:45:28 ]
Resolver cache restored
DNSBL update [ 138323 | PASSED  ]... completed [ 04/10/23 00:45:29 ]
------------------------------------------------------------------------
....

I've 138323 DNSBL entries - just "StevenBlack_ADs" list.
It took 00:45:28 - 00:45:26 = 2 seconds to stop, and then start.

What will happen if this delay is a bit longer ? and/or "service watchdog" was testing at that moment (it does so every 60 seconds) if unbound was running ?
It will also start unbound, so a second instance starts.
This will .... at best, fail and produce the log message you saw.
At worst : as "starting a service" will regenerate all config files just before the service is actually started, the first starting instance is reading the config files ... that are wiped and regenerated at that very moment.
This will create very unpredictable results.

That's why this :

@stephenw10 said in 23.09 Unbound killed failing to reclaim memory:

Generally speaking you shouldn't use the service watchdog unless you're debugging something,

is, IMHO, a huge understatement.
"service watchdog" just isn't smart enough to offer a useful start facility for (especially) unbound.

** you'll see this error if you start unbind and dnsmasq using the same interfaces.
Both want to use port 53, the one that starts last will bail out, as it will f=ind the port already 'occupied'.

Mission-Ghost

@stephenw10 an overnight run with service_watchdog disabled resulted in no 'killed failing to reclaim memory' errors. However, conditions of the test were different than production, namely, no WAN connection.

So, I've connected the box to "WAN" in a lab setting, which is a LAN on my network served by the 23.05.1 box and my normal ISPs.

And, it wouldn't be science without changing more than one variable at at a time, so I'm also testing a hypothesis that the elevated advanced options I'd set for unbound (bigger message cache size, more in and out TCP buffers, for example) in 23.05.1 were causing the "warning: the ulimit(data seg size) is smaller than the expected memory usage..." messages. To do this, I've reset all those memory related advanced settings back to default. Perhaps this is a factor in the "...reclaim memory..." errors.

The restart of unbound after applying the changes resulted in no ulimit warning messages. I'll run in this state for more time and see if those errors no longer appear. If so, then some of the high cache and TCP buffer amounts possible in advanced settings in the pull-downs may not be suitable and might benefit from a review to see if they should be culled.

Thank you for your help. While I don't know the internals well, your ideas have gotten me thinking and maybe I'll be able to at least stop the problem even if I don't understand it. It is curious that the 23.09 update has surfaced these issues.

tim_co

@Mission-Ghost

I'm also seeing this same error.

"kernel pid 47012 (unbound), jid 0, uid 59, was killed: failed to reclaim memory "

It only started after I updated Suricata yesterday (11/17/23). See the thread below for all the details.

https://forum.netgate.com/topic/184101/suricata-process-dying-due-to-hyperscan-problem/17

stephenw10

Also on an 1100?

Mission-Ghost

@Mission-Ghost @tim_co

I've been running 23.09 in production for 24+ hours after reducing the unbound Message Cache Size to 4mb (default) from 512mb, and the Incoming and Outgoing buffers to 10 (default) from 50. I have not seen the 'killed failing to reclaim memory" error since and the routers have had a very stable run so far.

This experience is consistent on both my A and B Netgate 1100 boxes. I ran the B box for about 18 hours without incident, then swapped in the A box and restored the configuration of the B box to the A box so they're the same except for the Hostname. Service_watchdog is installed but set up to watch no services.

So it appears the error in my case is probably related to advanced memory parameters of unbound. Even though my 1100s have sufficient memory, setting up unbound to use more seems to stimulate a latent defect that did not surface in 23.05.1.

Thanks again for your thoughtful comments and help.

Gertjan

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

unbound Message Cache Size to 4mb (default) from 512mb,

Isn't that half the system memory > Memory: 1GB DDR4 ?

stephenw10

Yes that does seem large for a 1G device. Yet presumably it was running like that in 23.05.1 so some other change has caused that to now become too large.

Unfortunately there's no way we can test every combination of custom values like that. So whilst we ran many upgrade tests on the 1100 none of them would have included that.

Anyway I'm glad you found what looks like the cause here.

Mission-Ghost

@stephenw10 In retrospect it does seem a bit high. But even so it didn't use up all the memory on the box. Memory usage was at about 80%, and, well, why waste it?

Memory usage had gone down some with 23.09 and the old parameters, but with these parameters reset to defaults, it's stable at around 60%.

The error message about being killed for not releasing memory is still interesting. The box wasn't short of memory when these happened. Are there process quotas for memory usage?

stephenw10

There are for some things, like individual php processes for example. I suspect you might have hit that somehow since I'm not aware of a limit for Unbound directly.

tim_co

@stephenw10

Version 2.7.0-RELEASE (amd64)
built on Wed Jun 28 03:53:34 UTC 2023
FreeBSD 14.0-CURRENT

The system is on the latest version.
Version information updated at Mon Nov 20 10:38:18 MST 2023
CPU Type Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
Current: 1204 MHz, Max: 2700 MHz
4 CPUs: 1 package(s) x 2 core(s) x 2 hardware threads
AES-NI CPU Crypto: Yes (active)
QAT Crypto: No
Hardware crypto AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS
Kernel PTI Enabled
MDS Mitigation VERW

stephenw10

That's not the latest pfSense version. But yes there is a known issue with Suricata. There should be an update for that shortly.

jimp

Might not be related but there is a newer version of unbound up now for 23.09 (1.18.0_1) which fixes a bug where unbound could get stuck, which may manifest itself in different ways.

See https://redmine.pfsense.org/issues/14980 for details there.

On 23.09 you can pull in the new version with pkg-static update; pkg-static upgrade -y unbound and then after that make sure you save/apply in the DNS Resolver settings (or reboot) to make sure the new daemon is running.

Mission-Ghost

@jimp thank you very much for following up. I'll go get this update and apply.

Seemed to go in ok:

Updating pfSense-core repository catalogue...
Fetching meta.conf:
pfSense-core repository is up to date.
Updating pfSense repository catalogue...
Fetching meta.conf:
pfSense repository is up to date.
All repositories are up to date.
Updating pfSense-core repository catalogue...
Fetching meta.conf:
pfSense-core repository is up to date.
Updating pfSense repository catalogue...
Fetching meta.conf:
pfSense repository is up to date.
All repositories are up to date.
The following 1 package(s) will be affected (of 0 checked):

Installed packages to be UPGRADED:
	unbound: 1.18.0 -> 1.18.0_1 [pfSense]

Number of packages to be upgraded: 1

1 MiB to be downloaded.
[1/1] Fetching unbound-1.18.0_1.pkg: .......... done
Checking integrity... done (0 conflicting)
[1/1] Upgrading unbound from 1.18.0 to 1.18.0_1...
===> Creating groups.
Using existing group 'unbound'.
===> Creating users
Using existing user 'unbound'.
[1/1] Extracting unbound-1.18.0_1: .......... done
------System/General log:
Nov 20 18:43:08 	check_reload_status 	342 	Syncing firewall
Nov 20 18:43:07 	php-fpm 	51606 	/services_unbound.php: Configuration Change: admin@192.168.10.110 (Local Database): DNS Resolver configured.
Nov 20 18:41:46 	pkg-static 	51529 	unbound upgraded: 1.18.0 -> 1.18.0_1

stephenw10

Let us know if that helps.

Mission-Ghost

[Edit 1: Ooma box started flashing again for no apparent reason, but the 'not found' error discussed below isn't in the DHCP server logs this time. Confounding problem so far. I have tested the drop cable and cable run from the switch--both ok. I've cold-booted the Ooma box multiple times.]

[Edit 2: Could the ISC DHCP binary being different in 23.09 than in 23.05.1 be the issue, as in these discussions: https://www.reddit.com/r/PFSENSE/comments/17xi9ut/isc_dhcp_server_in_2309/ and https://redmine.pfsense.org/issues/15011 ? If so, the switch to Kea may resolve this. I'll report back.]

@stephenw10 unbound seems to continue to work ok with the patch, although it already was working ok after I lowered the advanced settings for buffers and memory.

Oddly, since the update my Ooma Telo2 box has periodically stopped responding to pings and the status light goes from steady blue (everything's fine) to flashing blue-red (not ok). During these episodes the telephone service still works but it's concerning that at some point it won't work. Prior to 23.09 it has worked fine for many months.

I can find no clear cause-effect of the pfSense update to this Ooma box problem beyond the coincidence. I've made no changes to the DHCP configurations nor firewall rules before, during or after the update.

I'm having a bit of a difficult time diagnosing this issue too. Ooma box has no logs, and the pfSense logs are all clear except DHCP server (still ISC DHCP) which while reissuing the address periodically seems to be saying it's "not found" occasionally at times that correspond with the loss of ping. The Ooma box is 192.168.30.103 (on VLAN 30):

Nov 21 10:33:08 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 10:33:08 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30
Nov 21 09:33:07 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 09:33:07 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30
Nov 21 09:33:07 	dhcpd 	89986 	DHCPRELEASE of 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30 (not found)
Nov 21 09:22:24 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 09:22:24 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30
Nov 21 09:20:31 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 09:20:31 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30

I am not observing this "not found" error from a sampling of other devices on the .30 or other VLANs on the network. Just this one.

stephenw10

The timing on those updates is interesting. Exactly 1h between the last updates looks expected. Where it eventually shows 'not found' looks like failed lease renewals. It would be interesting to see what the interval was before that initial failure. Did it go past the lease time for some reason maybe.

Mission-Ghost

@stephenw10 I switched to the Kea DHCP server at at 11:54:53, and I have not had the Ooma box fail to answer pings or function properly since:

Nov 21 15:43:53 	kea-dhcp4 	41486 	INFO [kea-dhcp4.leases.0x7f70e8815f00] DHCP4_LEASE_ALLOC [hwtype=1 00:18:61:59:56:75], cid=[01:00:18:61:59:56:75], tid=0xa77b51c9: lease 192.168.30.103 has been allocated for 7200 seconds
Nov 21 14:43:53 	kea-dhcp4 	41486 	INFO [kea-dhcp4.leases.0x7f70e8818200] DHCP4_LEASE_ALLOC [hwtype=1 00:18:61:59:56:75], cid=[01:00:18:61:59:56:75], tid=0xa77b51c9: lease 192.168.30.103 has been allocated for 7200 seconds
Nov 21 13:43:53 	kea-dhcp4 	41486 	INFO [kea-dhcp4.leases.0x7f70e8817400] DHCP4_LEASE_ALLOC [hwtype=1 00:18:61:59:56:75], cid=[01:00:18:61:59:56:75], tid=0xa77b51c9: lease 192.168.30.103 has been allocated for 7200 seconds
Nov 21 12:43:53 	kea-dhcp4 	41486 	INFO [kea-dhcp4.leases.0x7f70e8816600] DHCP4_LEASE_ALLOC [hwtype=1 00:18:61:59:56:75], cid=[01:00:18:61:59:56:75], tid=0xa77b51c9: lease 192.168.30.103 has been allocated for 7200 seconds
Nov 21 11:43:52 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:43:52 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:43:52 	dhcpd 	89986 	DHCPRELEASE of 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30 (not found)
Nov 21 11:33:08 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:33:08 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:31:15 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:31:15 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:29:22 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:29:22 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:25:38 	dhcpd 	89986 	DHCPACK on 192.168.30.103 to 00:18:61:59:56:75 via mvneta0.30
Nov 21 11:25:38 	dhcpd 	89986 	DHCPREQUEST for 192.168.30.103 from 00:18:61:59:56:75 via mvneta0.30

This certainly supports the theory there is something wrong with the 23.09 ISC DHCP server. This is a weird regression. I wouldn't think the ISC binary would have been fiddled with during the 23.09 development given it's being retired. Someone in the previously quoted forum threads replaced their 23.09 ISC binary with a 23.05.1 binary and their problem stopped.

Did something go wrong in the version control system for this module in the 23.09 release?

Something I don't quite get looking at this log why only the Ooma box is logging a lease reallocation every hour. The .30 network alone has half a dozen devices on it, never mind the other five VLANs.

So far the Kea server has been working fine, so I imagine I'll stay with it unless something malfunctions.

SteveITS

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

something wrong with the 23.09 ISC DHCP server

It's noted in the redmine but the file did change even though the version number didn't. Per the redmine reporters it seems the updated program responds on a random port. Some devices don't particularly care, apparently, and others ignore the response, I suppose until happens to use the correct port, or a port to which it pays attention.

Overall some devices retry quite a bit at each renewal, and per reports others try and try but eventually lose their lease.

Also annoyingly, I found some IoT type devices renew whenever they heck they want to (e.g. 10 minutes not 1 day) so the log spam can be high in some cases.

For me at home, I don't need the things Kea DHCP doesn't have so switched to Kea.

This is all in forum thread https://forum.netgate.com/topic/184104/dhcp-weirdness-after-23-09-upgrade/ :)