DHCP Stuck in Recover

Salmon

The DHCP server on both nodes is stuck in the 'recover' state. I did search first and have tried all the stop/starting recommended but nothing seems to work.

Following https://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_(CARP) I have set up HA between two pfSense VMs running within VirtualBox and have a Lubuntu VM within their network.

The network is setup so that:

WAN is NAT
LAN is Internal Network 'pfsense'
OPT1 is Internal Network 'pfsenseCARP'

Version for both nodes:

Version 2.2-RELEASE (amd64)
built on Thu Jan 22 14:03:54 CST 2015
FreeBSD 10.1-RELEASE-p4

pfsense01.local: /var/dhcpd/etc/dhcpd.conf

option domain-name "local";
option ldap-server code 95 = text;
option domain-search-list code 119 = text;
option arch code 93 = unsigned integer 16; # RFC4578

default-lease-time 7200;
max-lease-time 86400;
log-facility local7;
one-lease-per-client true;
deny duplicates;
ping-check true;
update-conflict-detection false;
authoritative;
failover peer "dhcp_lan" {
primary;
address 192.168.1.1;
port 519;
peer address 192.168.1.2;
peer port 520;
max-response-delay 10;
max-unacked-updates 10;
split 128;
mclt 600;

load balance max seconds 3;
}

subnet 192.168.1.0 netmask 255.255.255.0 {
pool {
option domain-name-servers 192.168.1.10;
deny dynamic bootp clients;
failover peer "dhcp_lan";
range 192.168.1.100 192.168.1.245;
}

option routers 192.168.1.10;
option domain-name-servers 192.168.1.10;

}

pfsense02.local: /var/dhcpd/etc/dhcpd.conf

option domain-name "local";
option ldap-server code 95 = text;
option domain-search-list code 119 = text;
option arch code 93 = unsigned integer 16; # RFC4578

default-lease-time 7200;
max-lease-time 86400;
log-facility local7;
one-lease-per-client true;
deny duplicates;
ping-check true;
update-conflict-detection false;
authoritative;
failover peer "dhcp_lan" {
secondary;
address 192.168.1.2;
port 520;
peer address 192.168.1.1;
peer port 519;
max-response-delay 10;
max-unacked-updates 10;

load balance max seconds 3;
}

subnet 192.168.1.0 netmask 255.255.255.0 {
pool {
option domain-name-servers 192.168.1.10;
deny dynamic bootp clients;
failover peer "dhcp_lan";
range 192.168.1.100 192.168.1.245;
}

option routers 192.168.1.10;
option domain-name-servers 192.168.1.10;

}

pfsense01.local: pfsense01.local: /cf/conf/config.xml (VirtualIP Section)

<virtualip><vip><mode>carp</mode>
<interface>lan</interface>
<vhid>1</vhid>
<advskew>0</advskew>
<advbase>1</advbase>
<password>pf</password>
<descr><type>single</type>
<subnet_bits>24</subnet_bits>
<subnet>192.168.1.10</subnet></descr></vip></virtualip>

pfsense02.local: pfsense01.local: /cf/conf/config.xml (VirtualIP Section)

<virtualip><vip><mode>carp</mode>
<interface>lan</interface>
<vhid>1</vhid>
<advskew>100</advskew>
<advbase>1</advbase>
<password>pf</password>
<descr><type>single</type>
<subnet_bits>24</subnet_bits>
<subnet>192.168.1.10</subnet></descr></vip></virtualip>

The only thread I found that talked directly about this issue was from 6 years ago and said the problem was resolved but it seems to be a different issue I'm having. https://forum.pfsense.org/index.php?topic=18285.0

EDIT:/ One thing I've noticed that seems off is a lot of entries like this in the firewall log:

block/1000107060 Feb 8 12:01:19 lo0 192.168.1.1:519 192.168.1.10:59293 TCP:SA

Salmon

I don't know exactly what fixed this issue but I did this:

Disabled the firewall on both nodes (pfctl -d)
Turned off the DHCP service on both
Turned on the DHCP service on node1
Waited a long time (forgot about it so was probably around 10 minutes)
Turned on the DHCP service on node2
Waited about 2 minutes
Enabled firewall (pftcl -e)

Now the DHCP service is reporting normal operation and getting DHCP leases seems to work after failover.

cthomas

I have issues with this occasionally as well. Generally speaking, shutting down the dhcpd service on both firewalls and bringing them back up one at a time, about 5-10 seconds apart seems to do the trick.

ljorgensen

It seems the DHCP failover does not work properly when a large amount of leases is in use.

Nico37

Another important point to check when using DHCP failover which can have an impact on the recover/normal mode is the adskew advertisement.
As mentionned on the GUI:

Ensure one machine's advskew<20 (and the other is >20).

On th virtual CARP IP I would check if the primary firewall respect this.
I previsouly had issues with the DHCP service going into recover mode because of this, since I set all the CARP on the primary node to skew 0 everything is stable.