DHCP broken in HA setup w/ one node down due to a HW failure on primary node

Fluidtime_Support

Hello,

we have an issue w/ our DHCP service in an HA setup w/ two pfSense SG-4860 firewalls. A few days ago (on January, 4th, 2021) the primary node had a hardware failure and had to be taken offline. The secondary pfSense node took over all CARP vips and everything seemed to be working as expected.

But yesterday (January 12th) suddenly clients did no longer get new IP addresses via DHCP. We tried to debug and troubleshoot the issue but were not able to get this up and running again.
The Status => DHCP Leases shows this state for the two Pools:

The unknown state of the peer seems legit as the primary firewall is not reachable due to HW replacement. But we'd like to know if there is a possibility to bring the state to normal on the secondary node now that we know that the other node won't be available for another few days.

In the dhcpd.log a lot of these entries are visible:

clog /var/log/dhcpd.log | less

Jan 13 09:45:18 firewall_name dhcpd: DHCPDISCOVER from MAC-ADDRESS via igb3: peer holds all free leases
Jan 13 09:45:19 firewall_name dhcpd: DHCPDISCOVER from MAC-ADDRESS via igb3: peer holds all free leases
Jan 13 09:45:19 firewall_name dhcpd: DHCPDISCOVER from MAC-ADDRESS via igb4.50: peer holds all free leases
Jan 13 09:45:24 firewall_name dhcpd: DHCPDISCOVER from MAC-ADDRESS via igb3: peer holds all free leases
Jan 13 09:45:26 firewall_name dhcpd: DHCPDISCOVER from MAC-ADDRESS via igb3: peer holds all free leases

=> So we assume we need to somehow tell the secondary node that it is alone now and can assign leases w/o respecting leases of the primary (now offline!) node.

On startup these state transitions are visible in the log:

Jan 13 06:52:46 firewall_name dhcpd: Listening on BPF/igb3/00:08:a2:09:a6:6a/10.220.4.0/23
Jan 13 06:52:46 firewall_name dhcpd: Sending on   BPF/igb3/00:08:a2:09:a6:6a/10.220.4.0/23
Jan 13 06:52:46 firewall_name dhcpd: Listening on BPF/igb4.50/00:08:a2:09:a6:6b/10.220.2.0/24
Jan 13 06:52:46 firewall_name dhcpd: Sending on   BPF/igb4.50/00:08:a2:09:a6:6b/10.220.2.0/24
Jan 13 06:52:46 firewall_name dhcpd: Sending on   Socket/fallback/fallback-net
Jan 13 06:52:46 firewall_name dhcpd: failover peer dhcp_opt2: I move from recover to startup
Jan 13 06:52:46 firewall_name dhcpd: failover peer dhcp_opt5: I move from recover to startup
Jan 13 06:52:46 firewall_name dhcpd: Server starting service.
Jan 13 06:53:01 firewall_name dhcpd: failover peer dhcp_opt2: I move from startup to recover
Jan 13 06:53:01 firewall_name dhcpd: failover peer dhcp_opt5: I move from startup to recover

=> and the state never changes from recover to normal (we waited for many hours and restarted again,...)

At the moment we are using manual static address assignments on the clients as workaround.

Does anybody know what the best solution for us would be in this situation? Without totally breaking CARP as this would affect all users connected via VPN.

Could remove the Failover peer IP in the DHCP service configuration help?
Or could this help (could be done only on the secondary node):
https://docs.netgate.com/pfsense/en/latest/troubleshooting/ha-dhcp-failover.html

If all else fails, stop the DHCP daemon on both nodes, remove the DHCP lease database from /var/dhcpd/var/db/dhcpd.leases, then start the daemons again.

Unfortunately we do not have dhcpd.logs of the time it really occured but I think this could be related w/ the primary node being down and then the dhcp leases expiring.

Any help would be greatly appreciated!
Regards,
Guenther

Fluidtime_Support

Hello,
does nobody know what we could do?
Could this be a valid approach as we know the downtime of our primary node will take longer?
https://forum.netgate.com/topic/82185/carp-promote-backup?_=1610532552415

If you know the primary node will be gone for quite some time, just grab a backup off both units, power it off, restore the primary backup file to the secondary, and now you just took your "secondary (formerly known as the primary)" is down for maintenance. :-) When the time comes to switch back, you could either restore the secondary node config to the repaired unit or swap them back around.

Thanks,
Guenther

Fluidtime_Support

Just to report what we did to re-establish dhcp service functionality with the one remaining working node:

In a maintenance window we first stopped the dhcp daemon on the only remaining working node, removed the /var/dhcpd/var/db/dhcpd.leases file and started the dhcp daemon again. This did not change anything, same error messages as before.

Then we went through the dhcp server settings and removed the "Failover peer ip" entry for every pool. After starting the dhcp server it now acted as single dhcp server and started to issue ip addresses again as expected! Also the error message "peer holds all free leases" went away.

So this is fine for us in the current situation. When the primary pfSense has been repaired and put in production again it will sync the settings and start with the dhcp failover pools again.

This issue can be closed. Thanks.