Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory
-
@jrey said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
@bmeeks said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
Perhaps not all the bugs have been slain ???
seems he is running 24.03 which should be running
start of service (unbound 1.19.3). <- at least on my 2100Might be a fair question, even though @Mission-Ghost doesn't think it is related to pfblocker based on time stamps of events, but that we also ask for the version of pfblocker that is installed on the system?
Here's my memory chart for the last month; uptime now is 28 days:
I still also say - turn the service watchdog off.
I'd love to...once this stops killing production at random times and requiring manual intervention.
-
@jrey said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
seems to be a recurring issue for you going back to 23.05 and 23.09?
So has the problem ever gone away?
https://forum.netgate.com/topic/184130/23-09-unbound-killed-failing-to-reclaim-memory
No. It ebbs and flows. It seemed better enough for a while that I turned off service watchdog and then it recurred killing the network, so I turned service watchdog back on. So I'm trying again to figure it out with this forum's generous help.
So, after my last post I have changed in unbound the following:
- memory cache size down from 10 to 4mb
- in and out tcp buffers down from 20 to 10
- EDNS buffer to default from automatic
We'll see if this has a material effect on the errors posting to the log and unbound being killed for failing to reclaim memory.
-
@Mission-Ghost said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
We'll see if this has a material effect on the errors posting to the log and unbound being killed for failing to reclaim memory.
Do not get confused by the unfortunate poor wording of the kernel's error message. The OOM killer is a kernel process that is launched to unconditionally terminate the largest consumer of user space memory in order to prevent the system from becoming unstable. It's the kernel saying "
unbound
has continued to consume more and more memory and has not returned any back to the system pool". The fault here is withunbound
consuming memory. It isunbound
consuming memory and never releasing it back to the system. -
@Mission-Ghost said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
The only Advanced check boxes I have in the DNS Resolver (unbound) are Prefetch Support and Keep Probing.
just for fun try to Turn off Prefetch support and restart unbound
-
@jrey said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
@Mission-Ghost said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
The only Advanced check boxes I have in the DNS Resolver (unbound) are Prefetch Support and Keep Probing.
just for fun try to Turn off Prefetch support and restart unbound
This was my first experiment. Unfortunately it was ineffective, so I turned it back on.
So far no failures following the changes I mentioned a couple of posts ago. I may return them to original values one at a time until a failure is noted so I can isolate the issue.
Someday a project to substantially improve the log messages to aid in end-user understanding of what is happening and proper recovery procedures would add tremendous value to the product.
-
@Mission-Ghost said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
This was my first experiment.
Ah sorry, I missed that. How long did it run then ?
I just mentioned it because it was mentioned in an earlier bug on NLnet Labs
(that should have been fixed by 1.19.3 (the version I am running, as should you on 24.03).Do you keep a log of your setting and the runtime you get from each test before it fails, it may or may not prove helpful in trying to find the answer if you don't.
although it is manifesting as a unbound memory error, I wouldn't rule out anything else reducing the available memory over time or running past something at a given point in time. You have established (by lack of matching event times) that when it occurs not being consistent with say pfBlockerNG, what about say mailreport?. I've never used it but if it is doing something that requires lookups when generating the reports,(again I don't know that it does) on a system with limited memory it might stretch to the point the unbound isn't releasing fast enough in a limited pool and it gets wacked because that is what the kernel sees as the "bad guy" at that time.
at a glance mailreport doesn't really say much about what it is reporting on (or what you are reporting on) saying only that:
"Allows you to setup periodic e-mail reports containing command output, and log file contents"huge spikes in memory likely wouldn't show on a 28 day memory graph with a 1 day resolution in the sample.
when it happens try a narrow the graph with less time and more frequent resolution and see what is going on, maybe something like an hour/with 1 minute resolution.
OR
Since the data is still there, you could use the custom feature of the graph and specify the time range and graph resolution from a previous time it gave up. again not to small time slice, but lean toward more time leading up to the logged event.
Again may or may not be helpful or give some clues.really just trying to see why unbound is the unwilling victim getting killed so randomly. It should run for days and days, weeks and weeks without this and only restart when something like pfBlocker tell it to restart (those restart events will be listed in the pfblockerng.log. as well as the resolver log On most system that restart is usually pretty quick and you wouldn't normally notice. (but one would also assume at that point in time unbound would flush everything during the restart. The service is stopping and starting. So any memory it had would/should be cleared at that time.
It is at that point in time (when unbound is restarting) that if things are a little sluggish the watchdog may see it down and try to start another or worse multiple copies. things will usually go really bad, really fast. I understand that you believe the watchdog is in your best interest, but it is really not, in this case. Still up to you how to proceed, of course.
Edit: if you looking for the restarts in pfblockerng.log they will look like this (but may not be there every time pfblocker runs as it only restarts when it has to (ie a list has changed) )
Saving DNSBL statistics... completed Reloading Unbound Resolver (DNSBL python) Stopping Unbound Resolver. Unbound stopped in 2 sec. Additional mounts (DNSBL python): No changes required. Starting Unbound Resolver... completed [ 09/13/24 14:15:09 ]
-
Thank you for your thoughtful reply. It's interesting and gave me some things to think about.
@jrey said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
@Mission-Ghost said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
This was my first experiment.
Ah sorry, I missed that. How long did it run then ?
Just a day or two before the next event occurred.
Do you keep a log of your setting and the runtime you get from each test before it fails, it may or may not prove helpful in trying to find the answer if you don't.
No. I imagine it might be helpful.
although it is manifesting as a unbound memory error, I wouldn't rule out anything else reducing the available memory over time or running past something at a given point in time.
An interesting idea. Without logs or symptoms pointing specifically to an item it seems it would be a bit of a fishing expedition.
at a glance mailreport doesn't really say much about what it is reporting on (or what you are reporting on) saying only that:
"Allows you to setup periodic e-mail reports containing command output, and log file contents"I use mailreport to email the last several lines of some logs to myself once a day to keep tabs on possible problems without having to sign in every day and review logs in detail. It's how I flagged this problem.
really just trying to see why unbound is the unwilling victim getting killed so randomly. It should run for days and days, weeks and weeks without this and only restart when something like pfBlocker tell it to restart
Indeed, I have not had a failure event since I lowered the memory cache size, in/out buffers and edns buffer three days ago.
So far the evidence supports the hypothesis that while these higher settings didn't precipitate an unbound failure right away, it did allow it to grow at a later time outside of the resources available on the 1100. Clearly the 1100 can't support more than the default settings in at least a couple of areas. I previously had an update fail because I'd used non-default settings in another part of the system. It was tricky to diagnose.
It would be helpful to have some more discussion in the manuals that recommend against adjusting the defaults upward on an 1100 due to the limitations.
-
Do you use swap ?? If so you should have a crash report in the crash logs
-
@JonathanLee said in Repeating problem: (unbound), jid 0, uid 59, was killed: failed to reclaim memory:
Do you use swap ?? If so you should have a crash report in the crash logs
Best I can tell from documentation and forum posts is the 1100 does not have/use swap space.
-
@Mission-Ghost we you could always configure usb drive to be your swap