Kea DHCP Server Constantly Fails to Give out Leases in HA Mode - Unusable in Production in HA
-
Netgate Support Ticket — Kea DHCP HA Lease Conflict Failures
Environment
- pfSense Version: CE 2.8.1-RELEASE
- Hardware: Two-node HA pair
- Primary (pfs601i): 192.168.156.1 / 10.200.0.2
- Standby (pfs602i): 192.168.156.2 / 10.200.0.3
- HA Mode: Hot-standby
- DHCP Backend: Kea DHCP
- Subnets: 10.200.0.0/16, 172.16.108.0/22, 172.16.0.0/22, 192.168.200.0/24
- Client Base: Mixed — Apple devices (iOS 14+/macOS Ventura+ with MAC randomization), Android, Windows, UniFi U7 Pro APs
Problem Description
Every morning during peak hours, users are unable to obtain IP addresses via DHCP, resulting in complete loss of connectivity. The failures are caused by
HA_LEASE_UPDATE_CONFLICTerrors withResourceBusy(error code 4) on the standby node, causing the HA pair to accumulate conflicts until it reaches theterminatedstate and stops serving leases entirely.The issue is reproducible daily and affects all client types on the primary subnet.
Root Causes Identified
1. HA Resync Failure After Any Interruption
Any restart of either Kea node — whether planned or unplanned — causes the lease databases to diverge. The standby node does not automatically resync from the primary when it comes back online. Subsequent lease update attempts from the primary are rejected by the standby with
ResourceBusybecause the standby holds conflicting lease records. There is no automatic mechanism to resolve this divergence without manual intervention.2. Apple MAC Address Randomization Interaction
iOS 14+ and macOS Ventura+ rotate MAC addresses per network by default. With the default 24-hour maximum lease time, old leases from rotated MACs accumulate in the standby's database overnight. When clients reconnect in the morning with new MACs, the primary attempts to issue new leases but the standby rejects updates because it still holds the old conflicting records.
3. wait-backup-ack Defaults to True With No GUI Exposure
The default
wait-backup-ack: truecauses the primary to hold DHCP responses until the standby acknowledges each lease update. When the standby is in a conflict state or restarting, this blocks all DHCP responses to clients — a complete outage rather than a degraded state. This setting is not exposed anywhere in the pfSense GUI and requires direct editing of/etc/inc/services.incto change.4. ip-reservations-unique Hardcoded to False
The pfSense-generated
kea-dhcp4.confhardcodesip-reservations-unique: falsewith three identifier types (hw-address,client-id,duid). For networks with no static reservations this is unnecessary and actively harmful — it allows Kea to maintain multiple conflicting lease records for the same client across different identifier types, compounding the MAC rotation problem. This setting is also not exposed in the GUI.5. Low Default HA Thresholds
The default
max-rejected-lease-updatesof 10 (GUI default 15) is far too low for environments with Apple devices or infrastructure that reprovisioning simultaneously. On a busy morning, this threshold is reached within seconds, causing the HA pair to transition toterminatedstate and stop serving leases to all clients.6. No Automatic Post-Restart Resync
When a node restarts and rejoins the HA pair, there is no automatic mechanism to sync the lease database from the primary to the standby before the standby begins processing lease updates. The standby immediately starts rejecting updates it cannot reconcile, rather than completing a sync first.
Impact
- Complete DHCP outage for all clients during morning peak hours
- Manual intervention required daily (service restart or forced ha-sync)
- Restarting services to resolve conflicts triggers additional conflict storms, worsening the outage
- ISP environment with paying customers affected
Workarounds Applied
The following changes partially mitigated the issue but did not fully resolve it:
Change Method Result Lease times reduced to 14400 (4hr) pfSense GUI Reduced overnight stale lease accumulation Max Rejected Updates raised to 100 pfSense GUI Prevented premature terminated state Max Unacked Clients raised to 50 pfSense GUI Reduced false partner-down transitions wait-backup-ack: falseDirect edit of /etc/inc/services.incPrevented client blocking during standby issues ip-reservations-unique: trueDirect edit of /etc/inc/services.incReduced duplicate lease record conflicts Static reservations for all infrastructure APs pfSense GUI Eliminated AP reprovisioning conflicts Kea HA disabled entirely pfSense GUI Final resolution — single node now serving DHCP Note: The
services.incedits are overwritten by firmware upgrades and require reapplication after every pfSense update.
Suggested Fixes / Feature Requests
Fix 1 — Automatic Resync on Standby Recovery
When the standby node transitions from any non-operational state back to
ready, it should automatically perform a full lease sync from the primary before enteringload-balancingorhot-standbymode. This would prevent the database divergence that causes conflict storms after any restart.Kea config parameter:
sync-leases: trueshould be enforced on standby recovery, not just at initial startup.Fix 2 — Expose wait-backup-ack in the GUI
wait-backup-ackshould be a configurable option in the pfSense HA settings GUI. The default oftrueis inappropriate for most production environments — when the standby has issues, clients should not be blocked from receiving IP addresses. Defaulting tofalseor exposing the setting prominently would prevent outages caused by this behavior.Suggested GUI location: Services > DHCP Server > Settings > High Availability > Advanced Options
Fix 3 — Expose ip-reservations-unique in the GUI
ip-reservations-uniqueshould not be hardcoded tofalseinservices.inc. For networks without static reservations this setting actively causes harm. It should default totrueand only be set tofalsewhen the operator explicitly configures multiple reservation identifier types.Suggested GUI location: Services > DHCP Server > Settings > General or Advanced
Fix 4 — Nightly Lease Reclamation and Resync
pfSense should provide a built-in scheduled task option to perform lease reclamation and ha-sync from primary to standby on a configurable schedule (e.g., 4am daily). This would clear stale lease records before morning peak hours and prevent the overnight accumulation that causes conflict storms.
Suggested GUI location: Services > DHCP Server > Settings > Maintenance
Fix 5 — Raise Default HA Thresholds
The default
max-rejected-lease-updatesof 10 is too aggressive for networks with any significant client churn (Apple devices, BYOD, IoT). Recommend raising the default to at least 50, with clear documentation on the consequences of the terminated state.Fix 6 — Conflict Resolution on Resync
When
ha-syncis executed, conflicts on the standby should be automatically resolved in favor of the primary rather than requiring manuallease4-delcommands for each conflicting record. The primary should be authoritative and the standby should accept its state unconditionally during a sync operation.
Relevant Log Entries
Typical morning conflict storm (pfs601i logs):
WARN [kea-dhcp4.ha-hooks] HA_LEASE_UPDATE_CONFLICT pfs601i: lease update [hwtype=1 60:3e:5f:80:81:8b], cid=[01:60:3e:5f:80:81:8b] sent to pfs602i returned conflict status code: ResourceBusy: IP address:10.200.14.140 could not be updated. (error code 4)Standby rejecting updates after restart (pfs602i logs):
WARN [kea-dhcp4.lease-cmds-hooks] LEASE_CMDS_UPDATE4_CONFLICT lease4-update command failed due to conflict (parameters: { "hostname": "pauls-mbp", "hw-address": "60:3e:5f:80:81:8b", "ip-address": "10.200.14.140", "origin": "ha-partner", "valid-lft": 86400 }, reason: ResourceBusy: IP address:10.200.14.140 could not be updated.)Heartbeat failure triggering cascade:
WARN [kea-dhcp4.ha-hooks] HA_HEARTBEAT_COMMUNICATIONS_FAILED pfs602i: failed to send heartbeat to pfs601i: Operation timed out WARN [kea-dhcp4.ha-hooks] HA_COMMUNICATION_INTERRUPTED pfs602i: communication with pfs601i is interrupted
Community Reports of Same Issue
Multiple users are reporting the same Kea HA instability:
- https://forum.netgate.com/topic/197056/kea-dhcp-server-in-ha-mode-drops-50-of-dhcp-requests
- https://forum.netgate.com/topic/187408/so-many-issues-with-kea-dhcp
- https://forum.netgate.com/topic/188337/kea-dhcp-stops-working
- https://forum.pfsense.com/topic/195347/seeing-kea-dhcp-issues-after-upgrade-to-24-11/20
- https://redmine.pfsense.org/issues/15956
- https://redmine.pfsense.org/issues/15328
Current Status
Kea HA has been disabled. pfs601i is serving DHCP as a single node. DHCP is stable but redundancy has been sacrificed. We will re-enable HA when the above issues are addressed in a future pfSense release.
We are happy to provide full debug logs if useful for diagnosis.
-
Can the
wait-backup-ackandip-reservations-uniqueparameters not be modified via webConfigurator's 'Custom Configuration > JSON Configuration'? Assuming they can, what issue/s would be left? -
@tinfoilmatt Fixing the failover so the damn service works in HA mode. It doesn’t work reliably period. Read the issues others are having. Up to half the DHCP requests fail and the the system goes into panic mode because of race conditions & sync failures. Read the whole analysis. It’s horribly broken.
-
I read the whole analysis, both here and on your Redmine. (You should also respond to @cmcdonald, one of the Netgate devs, on the other Redmine you left a comment on, that you intended "the latest release" to mean CE 2.8.1).
Do you have any Kea custom configuration entered whatsoever?
-
@tinfoilmatt no every time I enter custom JSON and save it, the config breaks & the service won’t start. I manually modified the files that build the configs & verified the changes were in place. Service is unreliable in HA configuration no matter what settings are in place.
-
every time I enter custom JSON and save it, the config breaks & the service won’t start
There's a likelihood that that's syntax issue. Could you post what you've tried so far, and also confirm which DHCP Server 'tab' it's being entered under?
For example—and don't quote me on this since it's based only on some cursory reading I did after reading your analysis earlier—that
wait-backup-ackappears to be a global parameter and would therefore go under the "Settings" tab (i.e., the "Dhcp4" section of the Kea config file). Theip-reservations-uniqueparameter may or may not be different.Once you enter any custom confirmation and click Save, you should always manually review the generated Kea config to ensure proper syntax.
-
@tinfoilmatt Still doesn't change the fact that the sync between nodes is broken. The minute it fails to communicate to the other node, all hell breaks loose. The master quits doling out IP addresses leading to a restart of the service, but the hung service doesn't exit gracefully before the new instance starts. The HA service is completely unreliable and has gotten worse with the latest version. I also had to hunt down misconfigurations in service pointers and correct them in the XML templates. So overall, not much attention has been given to this critical service. Anyone out there willing to dig in and fix it?