23.09 Unbound killed failing to reclaim memory

stephenw10

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

This is a test run of my 'clean install' 23.09

Is that a clean install and config restore or really a bare clean install of 23.09?

Mission-Ghost

@stephenw10, clean install and config restore. There's a substantial amount of effort in the configuration. A do-over there isn't a good option.

The in-place upgrade 1100 box is also throwing the same error. I'm going to revert to 23.05.1 and wish for better luck next time:

Nov 15 14:14:35 	kernel 		pid 63112 (unbound), jid 0, uid 59, was killed: failed to reclaim memory

I'm grateful this box has a 23.05.1 boot environment to go back to.

stephenw10

What do you have configured there? It's almost certainly something in the config that for some reason Unbound is choking on in 23.09. I'd guess it's pfBlocker-NG with DNS-BL enabled.

Mission-Ghost

@stephenw10, thanks for your engagement on this issue. Yes, I do have PFBlocker NG running with DNS-BL enabled in unbound python mode.

I've got the 1100 running again in 23.05.1 now, and looking back in the log, I found no 23.05.1 "failed to reclaim memory" errors going back past July. However, I see the following errors both in -05.1 and -09 revs:

Nov 12 20:22:07 	php-cgi 	44902 	servicewatchdog_cron.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1699845727] unbound[12876:0] warning: the ulimit(data seg size) is smaller than the expected memory usage (added size of caches). 1073741824 < 1619767040 bytes [1699845727] unbound[12876:0] error: can't bind socket: Can't assign requested address for 192.168.20.1 port 53 [1699845727] unbound[12876:0] fatal error: could not open ports'

I don't know if the above is related to the errors now manifesting in the -09 version but it seems plausible. Could this be getting smoked out by -09 version?

What does this mean and what does one do about it?

stephenw10

That seems more likely to be the watchdog attempting to start Unbound when it's already running.

Generally speaking you shouldn't use the service watchdog unless you're debugging something, Did you enable it because Unbound was stopping?

Try manually stopping and starting Unbound and see what's logged.

Mission-Ghost

@stephenw10, I think you're correct.

I manually stopped unbound and then manually started it, and service_watchdog appears to have tried to start it:

Nov 15 17:18:11 	php-cgi 	18042 	servicewatchdog_cron.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1700093891] unbound[80102:0] warning: the ulimit(data seg size) is smaller than the expected memory usage (added size of caches). 1073741824 < 1619767040 bytes [1700093891] unbound[80102:0] error: bind: address already in use [1700093891] unbound[80102:0] fatal error: could not open ports'
Nov 15 17:18:06 	php-cgi 	45999 	notify_monitor.php: Message sent to [redacted] OK
Nov 15 17:18:01 	php-cgi 	18042 	servicewatchdog_cron.php: Service Watchdog detected service unbound stopped. Restarting unbound (DNS Resolver)

I've been running service watchdog to keep the router in production if something fails and I'm not here to fix it. It doesn't happen often, but it has happened to me.

I believe pfBlocker restarts unbound when it does its nightly update, so some of these errors may be related to that, although pfBlocker update doesn't happen at 8pm, more like 1am.

If this is all the above means, I don't think it's a significant issue. Either way, I've disabled service_watchdog for now on my 23.05.1 box which is the one in service. My 23.09 box is cold on my desk at the moment, although I can set it up as a lab box connected to my laptop if we need to get into it.

stephenw10

Ok I would try disabling the watchdog on the 23.09 box and then manually restarting Unbound to see if it shows that same failure. Also note what else is logged at that time.

Mission-Ghost

@stephenw10, ok: on the 23.09 box I disabled all services in service_watchdog and restarted unbound. Nothing at all was logged in System/General after the unbound restart, so the service_watchdog messages about unbound did not appear.

Under System/DNS Resolver, just five messages:

notice: init module 0: python
info: [pfBlockerNG]: pfb_unbound.py script loaded
info: [pfBlockerNG]: init_standard script loaded
notice: init module 1: iterator
info: start of service (unbound 1.18.0).

That seems to isolate that issue.

Now I just have to nail down the cause of the "failed to reclaim memory" error that started appearing after the 23.09 update. I'll leave this box running and see if it throws the error. It's not connected to any WAN so that might or might not affect the diagnosis, but we'll see. It could take some hours or overnight to throw the error, if it does.

stephenw10

The unreclaimed memory is less concerning than the fact it was killed by something and seemingly not restarted. Though possibly the watchdog simply triggered before it could restart.
I would wait to see if it still gets killed after some time and whether it restarts. And what's logged when it does.

Gertjan

@Mission-Ghost

Compare your :

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

Nov 15 11:15:03 kernel pid 26339 (unbound), jid 0, uid 59, was killed: failed to reclaim memory

With :
@jorgek said in Suricata 7.0.0 being killed by kernel in 23.09.:

Nov 15 16:01:28 kernel pid 28473 (suricata), jid 0, uid 0, was killed: failed to reclaim memory

from this thread, Suricata 7.0.0 being killed by kernel in 23.09. started yesterday.

I've not seen this "failed to reclaim memory" error myself while using 23.09,
Let's see :

[23.09-RELEASE][root@pfSense.bhf.tld]/root: cd /var/log
[23.09-RELEASE][root@pfSense.bhf.tld]/var/log: grep 'reclaim' *.log

return nothing, so no lines with 'reclaim' in any log files.

In the other thread with the same 'reclaim' error it was Surricata that manipulates pf firewall rules - and pf changed recently.
AFAIK : Unbound doesn't manipulate pf rules.
So, not related ?

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

[1699845727] unbound[12876:0] error: can't bind socket: Can't assign requested address for 192.168.20.1 port 53 [1699845727] unbound[12876:0] fatal error: could not open ports'

I don't know if the above is related to the errors now manifesting in the -09 version but it seems plausible. Could this be getting smoked out by -09 version?

What does this mean and what does one do about it?

What it means : unbound is started and want to use the selected interface(s), and bind (== use) port 53, UDP and TCP.
But it couldn't use these port(s) as it was already "in use" by another process.
There is of course just one ** process that uses 'port 53' on pfSense ; unbound.
The message means : there is already instance of unbound running, so the new process, with PID 12876, will bail out.

If there is a "service watchdog" in play, then don't look any further : the "service watchdog" is a nuclear war head peace keeper solution : when it is used, there will be some fall out.

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

I manually stopped unbound and then manually started it, and service_watchdog appears to have tried to start it:

There you have it.
"unbound" itself needs some time to start, as it has lot to do.
"unbound with python mode" has a little bit, like several ms more, work to do.
"unbound with python mode with a load of DNSBL entries" has even more to do. A startup time of "second" or more isn't unusual.

Example : an extract of my nightly pfBlockerng DNSBL cron based reload :

....
----------------------------------------
Original    Matches    Removed    Final     
----------------------------------------
184025      41667      45702      138323    
-----------------------------------------
TLD finalize... completed [ 04/10/23 00:45:24 ]

Saving DNSBL statistics... completed [ 04/10/23 00:45:26 ]
Reloading Unbound Resolver (DNSBL python).
Stopping Unbound Resolver.
Unbound stopped in 2 sec.
Additional mounts (DNSBL python):
 No changes required.
Starting Unbound Resolver... completed [ 04/10/23 00:45:28 ]
Resolver cache restored
DNSBL update [ 138323 | PASSED  ]... completed [ 04/10/23 00:45:29 ]
------------------------------------------------------------------------
....

I've 138323 DNSBL entries - just "StevenBlack_ADs" list.
It took 00:45:28 - 00:45:26 = 2 seconds to stop, and then start.

What will happen if this delay is a bit longer ? and/or "service watchdog" was testing at that moment (it does so every 60 seconds) if unbound was running ?
It will also start unbound, so a second instance starts.
This will .... at best, fail and produce the log message you saw.
At worst : as "starting a service" will regenerate all config files just before the service is actually started, the first starting instance is reading the config files ... that are wiped and regenerated at that very moment.
This will create very unpredictable results.

That's why this :

@stephenw10 said in 23.09 Unbound killed failing to reclaim memory:

Generally speaking you shouldn't use the service watchdog unless you're debugging something,

is, IMHO, a huge understatement.
"service watchdog" just isn't smart enough to offer a useful start facility for (especially) unbound.

** you'll see this error if you start unbind and dnsmasq using the same interfaces.
Both want to use port 53, the one that starts last will bail out, as it will f=ind the port already 'occupied'.

Mission-Ghost

@stephenw10 an overnight run with service_watchdog disabled resulted in no 'killed failing to reclaim memory' errors. However, conditions of the test were different than production, namely, no WAN connection.

So, I've connected the box to "WAN" in a lab setting, which is a LAN on my network served by the 23.05.1 box and my normal ISPs.

And, it wouldn't be science without changing more than one variable at at a time, so I'm also testing a hypothesis that the elevated advanced options I'd set for unbound (bigger message cache size, more in and out TCP buffers, for example) in 23.05.1 were causing the "warning: the ulimit(data seg size) is smaller than the expected memory usage..." messages. To do this, I've reset all those memory related advanced settings back to default. Perhaps this is a factor in the "...reclaim memory..." errors.

The restart of unbound after applying the changes resulted in no ulimit warning messages. I'll run in this state for more time and see if those errors no longer appear. If so, then some of the high cache and TCP buffer amounts possible in advanced settings in the pull-downs may not be suitable and might benefit from a review to see if they should be culled.

Thank you for your help. While I don't know the internals well, your ideas have gotten me thinking and maybe I'll be able to at least stop the problem even if I don't understand it. It is curious that the 23.09 update has surfaced these issues.

tim_co

@Mission-Ghost

I'm also seeing this same error.

"kernel pid 47012 (unbound), jid 0, uid 59, was killed: failed to reclaim memory "

It only started after I updated Suricata yesterday (11/17/23). See the thread below for all the details.

https://forum.netgate.com/topic/184101/suricata-process-dying-due-to-hyperscan-problem/17

stephenw10

Also on an 1100?

Mission-Ghost

@Mission-Ghost @tim_co

I've been running 23.09 in production for 24+ hours after reducing the unbound Message Cache Size to 4mb (default) from 512mb, and the Incoming and Outgoing buffers to 10 (default) from 50. I have not seen the 'killed failing to reclaim memory" error since and the routers have had a very stable run so far.

This experience is consistent on both my A and B Netgate 1100 boxes. I ran the B box for about 18 hours without incident, then swapped in the A box and restored the configuration of the B box to the A box so they're the same except for the Hostname. Service_watchdog is installed but set up to watch no services.

So it appears the error in my case is probably related to advanced memory parameters of unbound. Even though my 1100s have sufficient memory, setting up unbound to use more seems to stimulate a latent defect that did not surface in 23.05.1.

Thanks again for your thoughtful comments and help.

Gertjan

@Mission-Ghost said in 23.09 Unbound killed failing to reclaim memory:

unbound Message Cache Size to 4mb (default) from 512mb,

Isn't that half the system memory > Memory: 1GB DDR4 ?

stephenw10

Yes that does seem large for a 1G device. Yet presumably it was running like that in 23.05.1 so some other change has caused that to now become too large.

Unfortunately there's no way we can test every combination of custom values like that. So whilst we ran many upgrade tests on the 1100 none of them would have included that.

Anyway I'm glad you found what looks like the cause here.

Mission-Ghost

@stephenw10 In retrospect it does seem a bit high. But even so it didn't use up all the memory on the box. Memory usage was at about 80%, and, well, why waste it?

Memory usage had gone down some with 23.09 and the old parameters, but with these parameters reset to defaults, it's stable at around 60%.

The error message about being killed for not releasing memory is still interesting. The box wasn't short of memory when these happened. Are there process quotas for memory usage?

stephenw10

There are for some things, like individual php processes for example. I suspect you might have hit that somehow since I'm not aware of a limit for Unbound directly.

tim_co

@stephenw10

Version 2.7.0-RELEASE (amd64)
built on Wed Jun 28 03:53:34 UTC 2023
FreeBSD 14.0-CURRENT

The system is on the latest version.
Version information updated at Mon Nov 20 10:38:18 MST 2023
CPU Type Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
Current: 1204 MHz, Max: 2700 MHz
4 CPUs: 1 package(s) x 2 core(s) x 2 hardware threads
AES-NI CPU Crypto: Yes (active)
QAT Crypto: No
Hardware crypto AES-CBC,AES-CCM,AES-GCM,AES-ICM,AES-XTS
Kernel PTI Enabled
MDS Mitigation VERW

stephenw10

That's not the latest pfSense version. But yes there is a known issue with Suricata. There should be an update for that shortly.