KEA service stopping through the day
-
@marcosm I have uploaded the file I downloaded the other day, but will also upload another from today which definitely aligns to when the service unexpectedly stopped this morning - I resolved by simply restarting the service. I was surprised that the watchdog did not restart the service for me.
-
-
@DavidIr It would help to have some additional info about the system. You can get that by going to /status.php.
-
@marcosm status_output.tgz uploaded to the same link provided above.
Since the previous messages I have installed and configured the Service Watchdog plugin
-
In case you need any additional info I am now on holiday until Jan 5th so will not see or be able to respond to any posts or requests for info until I return.
-
Hi
this weekend the core dump happened exactly the same in my Netgate 3100.I wonder if there is a solution for this problem?
Regards,
-
"Good news" is that the reason of the core dump was a signal 6, which means the process itself has chosen to 'pull the brakes', most probably because resources were missing, like not enough RAM to name one.
-
@Gertjan said in KEA service stopping through the day:
"Good news" is that the reason of the core dump was a signal 6, which means the process itself has chosen to 'pull the brakes', most probably because resources were missing, like not enough RAM to name one.
Yes, heap corruption in this case. This is turning into quite the rabbit hole. Unfortunately, this looks like an issue deeper than Kea, like failure in libcxxrt or jemalloc. We've got some test hardware setup with some additional logging and tuning to jemalloc to try to get a better view of the state of the world before the abort. But the core dump is gnarly, the heap is trashed. The effort required to fix this might be out of scope for an EOL platform, both for us and for upstream. Will know more soon.
-
@cmcdonald Thank you for looking into this. I was hoping that my submissions would help others, but sounds quite challenging, and as you say the EOL hardware (and no doubt the additional challenges of it running 32bit) may bring an end to the investigations.
If I can contribute anything to help let me know.
Not sure if this is relevant or helpful but I do seem to have managed to reduce the frequency of the service failing by removing NUT from the box (which was having issues talking to my UPS on USB port), although this may be an unrelated correlation rather than anything linked.
-
@DavidIr That sounds interesting. I also installed NUT package recently. Maybe it is correlated?
I will uninstall it just in case. -
@rafal-arciszewski There is no evidence of a connection to NUT, was only an idea I had last night, don't read too much into that bit. was more the post from cmcdonald who is trying to analyse the error at a deep technical level.