Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login
    Introducing Netgate Nexus: Multi-Instance Management at Your Fingertips.

    Kea DHCP Server Constantly Fails to Give out Leases in HA Mode - Unusable in Production in HA

    Scheduled Pinned Locked Moved DHCP and DNS
    7 Posts 2 Posters 150 Views 4 Watching
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • N Offline
      N0m0fud
      last edited by

      Netgate Support Ticket — Kea DHCP HA Lease Conflict Failures

      Environment

      • pfSense Version: CE 2.8.1-RELEASE
      • Hardware: Two-node HA pair
      • Primary (pfs601i): 192.168.156.1 / 10.200.0.2
      • Standby (pfs602i): 192.168.156.2 / 10.200.0.3
      • HA Mode: Hot-standby
      • DHCP Backend: Kea DHCP
      • Subnets: 10.200.0.0/16, 172.16.108.0/22, 172.16.0.0/22, 192.168.200.0/24
      • Client Base: Mixed — Apple devices (iOS 14+/macOS Ventura+ with MAC randomization), Android, Windows, UniFi U7 Pro APs

      Problem Description

      Every morning during peak hours, users are unable to obtain IP addresses via DHCP, resulting in complete loss of connectivity. The failures are caused by HA_LEASE_UPDATE_CONFLICT errors with ResourceBusy (error code 4) on the standby node, causing the HA pair to accumulate conflicts until it reaches the terminated state and stops serving leases entirely.

      The issue is reproducible daily and affects all client types on the primary subnet.


      Root Causes Identified

      1. HA Resync Failure After Any Interruption

      Any restart of either Kea node — whether planned or unplanned — causes the lease databases to diverge. The standby node does not automatically resync from the primary when it comes back online. Subsequent lease update attempts from the primary are rejected by the standby with ResourceBusy because the standby holds conflicting lease records. There is no automatic mechanism to resolve this divergence without manual intervention.

      2. Apple MAC Address Randomization Interaction

      iOS 14+ and macOS Ventura+ rotate MAC addresses per network by default. With the default 24-hour maximum lease time, old leases from rotated MACs accumulate in the standby's database overnight. When clients reconnect in the morning with new MACs, the primary attempts to issue new leases but the standby rejects updates because it still holds the old conflicting records.

      3. wait-backup-ack Defaults to True With No GUI Exposure

      The default wait-backup-ack: true causes the primary to hold DHCP responses until the standby acknowledges each lease update. When the standby is in a conflict state or restarting, this blocks all DHCP responses to clients — a complete outage rather than a degraded state. This setting is not exposed anywhere in the pfSense GUI and requires direct editing of /etc/inc/services.inc to change.

      4. ip-reservations-unique Hardcoded to False

      The pfSense-generated kea-dhcp4.conf hardcodes ip-reservations-unique: false with three identifier types (hw-address, client-id, duid). For networks with no static reservations this is unnecessary and actively harmful — it allows Kea to maintain multiple conflicting lease records for the same client across different identifier types, compounding the MAC rotation problem. This setting is also not exposed in the GUI.

      5. Low Default HA Thresholds

      The default max-rejected-lease-updates of 10 (GUI default 15) is far too low for environments with Apple devices or infrastructure that reprovisioning simultaneously. On a busy morning, this threshold is reached within seconds, causing the HA pair to transition to terminated state and stop serving leases to all clients.

      6. No Automatic Post-Restart Resync

      When a node restarts and rejoins the HA pair, there is no automatic mechanism to sync the lease database from the primary to the standby before the standby begins processing lease updates. The standby immediately starts rejecting updates it cannot reconcile, rather than completing a sync first.


      Impact

      • Complete DHCP outage for all clients during morning peak hours
      • Manual intervention required daily (service restart or forced ha-sync)
      • Restarting services to resolve conflicts triggers additional conflict storms, worsening the outage
      • ISP environment with paying customers affected

      Workarounds Applied

      The following changes partially mitigated the issue but did not fully resolve it:

      Change Method Result
      Lease times reduced to 14400 (4hr) pfSense GUI Reduced overnight stale lease accumulation
      Max Rejected Updates raised to 100 pfSense GUI Prevented premature terminated state
      Max Unacked Clients raised to 50 pfSense GUI Reduced false partner-down transitions
      wait-backup-ack: false Direct edit of /etc/inc/services.inc Prevented client blocking during standby issues
      ip-reservations-unique: true Direct edit of /etc/inc/services.inc Reduced duplicate lease record conflicts
      Static reservations for all infrastructure APs pfSense GUI Eliminated AP reprovisioning conflicts
      Kea HA disabled entirely pfSense GUI Final resolution — single node now serving DHCP

      Note: The services.inc edits are overwritten by firmware upgrades and require reapplication after every pfSense update.


      Suggested Fixes / Feature Requests

      Fix 1 — Automatic Resync on Standby Recovery

      When the standby node transitions from any non-operational state back to ready, it should automatically perform a full lease sync from the primary before entering load-balancing or hot-standby mode. This would prevent the database divergence that causes conflict storms after any restart.

      Kea config parameter: sync-leases: true should be enforced on standby recovery, not just at initial startup.

      Fix 2 — Expose wait-backup-ack in the GUI

      wait-backup-ack should be a configurable option in the pfSense HA settings GUI. The default of true is inappropriate for most production environments — when the standby has issues, clients should not be blocked from receiving IP addresses. Defaulting to false or exposing the setting prominently would prevent outages caused by this behavior.

      Suggested GUI location: Services > DHCP Server > Settings > High Availability > Advanced Options

      Fix 3 — Expose ip-reservations-unique in the GUI

      ip-reservations-unique should not be hardcoded to false in services.inc. For networks without static reservations this setting actively causes harm. It should default to true and only be set to false when the operator explicitly configures multiple reservation identifier types.

      Suggested GUI location: Services > DHCP Server > Settings > General or Advanced

      Fix 4 — Nightly Lease Reclamation and Resync

      pfSense should provide a built-in scheduled task option to perform lease reclamation and ha-sync from primary to standby on a configurable schedule (e.g., 4am daily). This would clear stale lease records before morning peak hours and prevent the overnight accumulation that causes conflict storms.

      Suggested GUI location: Services > DHCP Server > Settings > Maintenance

      Fix 5 — Raise Default HA Thresholds

      The default max-rejected-lease-updates of 10 is too aggressive for networks with any significant client churn (Apple devices, BYOD, IoT). Recommend raising the default to at least 50, with clear documentation on the consequences of the terminated state.

      Fix 6 — Conflict Resolution on Resync

      When ha-sync is executed, conflicts on the standby should be automatically resolved in favor of the primary rather than requiring manual lease4-del commands for each conflicting record. The primary should be authoritative and the standby should accept its state unconditionally during a sync operation.


      Relevant Log Entries

      Typical morning conflict storm (pfs601i logs):

      WARN [kea-dhcp4.ha-hooks] HA_LEASE_UPDATE_CONFLICT pfs601i: lease update 
      [hwtype=1 60:3e:5f:80:81:8b], cid=[01:60:3e:5f:80:81:8b] sent to pfs602i 
      returned conflict status code: ResourceBusy: IP address:10.200.14.140 
      could not be updated. (error code 4)
      

      Standby rejecting updates after restart (pfs602i logs):

      WARN [kea-dhcp4.lease-cmds-hooks] LEASE_CMDS_UPDATE4_CONFLICT lease4-update 
      command failed due to conflict (parameters: { "hostname": "pauls-mbp", 
      "hw-address": "60:3e:5f:80:81:8b", "ip-address": "10.200.14.140", 
      "origin": "ha-partner", "valid-lft": 86400 }, 
      reason: ResourceBusy: IP address:10.200.14.140 could not be updated.)
      

      Heartbeat failure triggering cascade:

      WARN [kea-dhcp4.ha-hooks] HA_HEARTBEAT_COMMUNICATIONS_FAILED pfs602i: 
      failed to send heartbeat to pfs601i: Operation timed out
      WARN [kea-dhcp4.ha-hooks] HA_COMMUNICATION_INTERRUPTED pfs602i: 
      communication with pfs601i is interrupted
      

      Community Reports of Same Issue

      Multiple users are reporting the same Kea HA instability:

      • https://forum.netgate.com/topic/197056/kea-dhcp-server-in-ha-mode-drops-50-of-dhcp-requests
      • https://forum.netgate.com/topic/187408/so-many-issues-with-kea-dhcp
      • https://forum.netgate.com/topic/188337/kea-dhcp-stops-working
      • https://forum.pfsense.com/topic/195347/seeing-kea-dhcp-issues-after-upgrade-to-24-11/20
      • https://redmine.pfsense.org/issues/15956
      • https://redmine.pfsense.org/issues/15328

      Current Status

      Kea HA has been disabled. pfs601i is serving DHCP as a single node. DHCP is stable but redundancy has been sacrificed. We will re-enable HA when the above issues are addressed in a future pfSense release.

      We are happy to provide full debug logs if useful for diagnosis.

      1 Reply Last reply Reply Quote 0
      • tinfoilmattT Offline
        tinfoilmatt LAYER 8
        last edited by

        Can the wait-backup-ack and ip-reservations-unique parameters not be modified via webConfigurator's 'Custom Configuration > JSON Configuration'? Assuming they can, what issue/s would be left?

        N 1 Reply Last reply Reply Quote 0
        • N Offline
          N0m0fud @tinfoilmatt
          last edited by

          @tinfoilmatt Fixing the failover so the damn service works in HA mode. It doesn’t work reliably period. Read the issues others are having. Up to half the DHCP requests fail and the the system goes into panic mode because of race conditions & sync failures. Read the whole analysis. It’s horribly broken.

          tinfoilmattT 1 Reply Last reply Reply Quote 0
          • tinfoilmattT Offline
            tinfoilmatt LAYER 8 @N0m0fud
            last edited by

            I read the whole analysis, both here and on your Redmine. (You should also respond to @cmcdonald, one of the Netgate devs, on the other Redmine you left a comment on, that you intended "the latest release" to mean CE 2.8.1).

            Do you have any Kea custom configuration entered whatsoever?

            N 1 Reply Last reply Reply Quote 0
            • N Offline
              N0m0fud @tinfoilmatt
              last edited by

              @tinfoilmatt no every time I enter custom JSON and save it, the config breaks & the service won’t start. I manually modified the files that build the configs & verified the changes were in place. Service is unreliable in HA configuration no matter what settings are in place.

              tinfoilmattT 1 Reply Last reply Reply Quote 0
              • tinfoilmattT Offline
                tinfoilmatt LAYER 8 @N0m0fud
                last edited by

                every time I enter custom JSON and save it, the config breaks & the service won’t start

                There's a likelihood that that's syntax issue. Could you post what you've tried so far, and also confirm which DHCP Server 'tab' it's being entered under?

                For example—and don't quote me on this since it's based only on some cursory reading I did after reading your analysis earlier—that wait-backup-ack appears to be a global parameter and would therefore go under the "Settings" tab (i.e., the "Dhcp4" section of the Kea config file). The ip-reservations-unique parameter may or may not be different.

                Once you enter any custom confirmation and click Save, you should always manually review the generated Kea config to ensure proper syntax.

                N 1 Reply Last reply Reply Quote 0
                • N Offline
                  N0m0fud @tinfoilmatt
                  last edited by

                  @tinfoilmatt Still doesn't change the fact that the sync between nodes is broken. The minute it fails to communicate to the other node, all hell breaks loose. The master quits doling out IP addresses leading to a restart of the service, but the hung service doesn't exit gracefully before the new instance starts. The HA service is completely unreliable and has gotten worse with the latest version. I also had to hunt down misconfigurations in service pointers and correct them in the XML templates. So overall, not much attention has been given to this critical service. Anyone out there willing to dig in and fix it?

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  Copyright 2026 Rubicon Communications LLC (Netgate). All rights reserved.