[SOLVED] DHCPd failover group issues

turtle

Hey,

I've got a pfSense HA setup with two machines and multiple VLANs.

Everything is working fine, except the dhcp failover groups.

When I try to create a failover group, the state is always stuck in "recover", "recover-done" or "recover-wait" and will never change, no matter how long I wait.

Also I experienced some inconsistent behavior, sometimes if im adding another failover group on another VLAN, the state will change to "normal" on both peers.
This appears totally random and will change to "recover", "recover-done" or "recover-wait" whenever something else on the dhcp server is changed(like adding up antoher failover group).

This is what my dhcpd.conf looks like:

Primary:

failover peer "dhcp_opt6" {
primary;
address 10.104.0.2;
port 519;
peer address 10.104.0.3;
peer port 520;
max-response-delay 10;
max-unacked-updates 10;
split 128;
mclt 600;

load balance max seconds 3;
}
subnet 10.104.0.0 netmask 255.255.0.0 {
pool {
option domain-name-servers 10.104.0.1;
deny dynamic bootp clients;
failover peer "dhcp_opt6";
range 10.104.0.10 10.104.1.254;
}

option routers 10.104.0.1;
option domain-name "test";
option domain-name-servers 10.104.0.1;
default-lease-time 86399;

}

Secondary:

failover peer "dhcp_opt6" {
secondary;
address 10.104.0.3;
port 520;
peer address 10.104.0.2;
peer port 519;
max-response-delay 10;
max-unacked-updates 10;
mclt 600;

load balance max seconds 3;
}
subnet 10.104.0.0 netmask 255.255.0.0 {
pool {
option domain-name-servers 10.104.0.1;
deny dynamic bootp clients;
failover peer "dhcp_opt6";
range 10.104.0.10 10.104.1.254;
}

option routers 10.104.0.1;
option domain-name "test";
option domain-name-servers 10.104.0.1;
default-lease-time 86399;

}

Both machines can reach each other, so this shouldn't be a problem.

Thanks for your help!

turtle

I managed to fix this problem by myself, but I still don't know the reason, why the failover groups inititally broke.

The solution was very simple, but yet not very satisfying.

I deleted all failover groups and restarted both dhcp servers.
Afterwards I recreated all failover groups and restarted both dhcp servers again.

But the failover groups kept their "recover", "recover-wait" state, but this time i waited a really long time (about 2 hours) and their state changed to "normal".

So consider my problem as fixed.

jimp

When things like that happen out of the blue the best fixes tend to be:

1. Check/fix the clocks and NTP on both nodes
2. Wipe the DHCP database and let them rebuild it

It may have been stuck a while as they resynchronized their lease databases