KEA service stopping through the day

Markito

@SteveITS Thanks :) I had not realized that.

ThomasDr

@w0w
This can happen if you have switched from dhcpd to kea but have not changed the service watchdog.

chudak

@jimp Today see TS state unexpected state: NoState and removing /tmp/kea4-ctrl-socket.lock does no help

something new?

marcosm

This issue should be handled with the 24.11-RC. Feedback on it would be helpful if you were hitting this previously.

darkrolder

@marcosm This issue is happening to me
a few nights prior i woke up to some "IOT" things flashing as they couldnt connect to their wifi.
and found i didnt have internet, however when i got up at 6 it was working again without user intervention so i am not sure..

this morning i woke up to no "internet"
(some statically set things over ethernet were working, obv) but everything wifi was offline.

on the router, kea ipv4 was offline i had to click the start button, for now i have installed the watch dog server to auto restart. id send logs if i knew where and which ones you wanted to help diag this? or if this is even related? (running 24.11)

best regards
-Rolder

DavidIr

@darkrolder said in KEA service stopping through the day:

on the router, kea ipv4 was offline i had to click the start button, for now i have installed the watch dog server to auto restart. id send logs if i knew where and which ones you wanted to help diag this? or if this is even related? (running 24.11)

I have also had this experience on my Netgate 3100. Only details I could find in the logs was:

Dec 5 20:58:54	kernel		pid 67465 (kea-dhcp4), jid 0, uid 0: exited on signal 6 (core dumped)

For some reason the kia-dhcp4 process does not seem to be generating any log entries on my device so really hard to work out if it's connected.
I have after reading a few posts increased the size of my DHCP pool in case some IoT devices are doing something odd (I saw this as a possibility in another thread.

I am now in the monitoring phase but I would not have expected a DHCP service to fail so spectacularly if it ran out of addresses in it's pool to give out.

darkrolder

@DavidIr I don't think i have that many IOT devices, maybe 10-15? mostly light switches and my
cell phones, tv, etc. i have them on their own vlan with the DHCP pool size at about 150 IPs, my brother (IT admin) has suggested changing the DHCP lease from the default 2 hours to 24 hours. i have done this, so far so good. but its only been 1 day. s: only time will tell if this helps

-Rolder

marcosm

@DavidIr Did this happen on 24.11? The core dump file should be in /root - sharing that would help determine what happened.

DavidIr

@marcosm

@marcosm said in KEA service stopping through the day:

@DavidIr Did this happen on 24.11? The core dump file should be in /root - sharing that would help determine what happened.

Yes it did.

I assume the kea-dhcp4.core file?
I'm not familiar with these files - is it safe to just attach to the forum, or should I send in some other way?

marcosm

@DavidIr Yes. Check the timestamp of the core files, e.g. with ls -lha /root/*.core and if they are smilar (i.e. potentially related), upload them here.

DavidIr

@marcosm I have uploaded the file I downloaded the other day, but will also upload another from today which definitely aligns to when the service unexpectedly stopped this morning - I resolved by simply restarting the service. I was surprised that the watchdog did not restart the service for me.

DavidIr

@DavidIr Hi @marcosm any joy looking at the dump files?

I have implemented the changes in https://forum.netgate.com/post/1199521 for now in the hope this will restart the DHCP service when it fails, but would love to understand what's going on and help solve the potentially wider issue.

Thank you

marcosm

@DavidIr It would help to have some additional info about the system. You can get that by going to /status.php.

DavidIr

@marcosm status_output.tgz uploaded to the same link provided above.

Since the previous messages I have installed and configured the Service Watchdog plugin

DavidIr

In case you need any additional info I am now on holiday until Jan 5th so will not see or be able to respond to any posts or requests for info until I return.

rafal.arciszewski

Hi
this weekend the core dump happened exactly the same in my Netgate 3100.

I wonder if there is a solution for this problem?

Regards,

Gertjan

@rafal-arciszewski

"Good news" is that the reason of the core dump was a signal 6, which means the process itself has chosen to 'pull the brakes', most probably because resources were missing, like not enough RAM to name one.

cmcdonald

@Gertjan said in KEA service stopping through the day:

@rafal-arciszewski

"Good news" is that the reason of the core dump was a signal 6, which means the process itself has chosen to 'pull the brakes', most probably because resources were missing, like not enough RAM to name one.

Yes, heap corruption in this case. This is turning into quite the rabbit hole. Unfortunately, this looks like an issue deeper than Kea, like failure in libcxxrt or jemalloc. We've got some test hardware setup with some additional logging and tuning to jemalloc to try to get a better view of the state of the world before the abort. But the core dump is gnarly, the heap is trashed. The effort required to fix this might be out of scope for an EOL platform, both for us and for upstream. Will know more soon.

DavidIr

@cmcdonald Thank you for looking into this. I was hoping that my submissions would help others, but sounds quite challenging, and as you say the EOL hardware (and no doubt the additional challenges of it running 32bit) may bring an end to the investigations.

If I can contribute anything to help let me know.

Not sure if this is relevant or helpful but I do seem to have managed to reduce the frequency of the service failing by removing NUT from the box (which was having issues talking to my UPS on USB port), although this may be an unrelated correlation rather than anything linked.

rafal.arciszewski

@DavidIr That sounds interesting. I also installed NUT package recently. Maybe it is correlated?
I will uninstall it just in case.