DHCP Stuck in Recover



  • The DHCP server on both nodes is stuck in the 'recover' state. I did search first and have tried all the stop/starting recommended but nothing seems to work.

    Following https://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_(CARP) I have set up HA between two pfSense VMs running within VirtualBox and have a Lubuntu VM within their network.

    The network is setup so that:

    • WAN is NAT

    • LAN is Internal Network 'pfsense'

    • OPT1 is Internal Network 'pfsenseCARP'

    Version for both nodes:

    Version 2.2-RELEASE (amd64)
    built on Thu Jan 22 14:03:54 CST 2015
    FreeBSD 10.1-RELEASE-p4

    pfsense01.local: /var/dhcpd/etc/dhcpd.conf

    option domain-name "local";
    option ldap-server code 95 = text;
    option domain-search-list code 119 = text;
    option arch code 93 = unsigned integer 16; # RFC4578

    default-lease-time 7200;
    max-lease-time 86400;
    log-facility local7;
    one-lease-per-client true;
    deny duplicates;
    ping-check true;
    update-conflict-detection false;
    authoritative;
    failover peer "dhcp_lan" {
      primary;
      address 192.168.1.1;
      port 519;
      peer address 192.168.1.2;
      peer port 520;
      max-response-delay 10;
      max-unacked-updates 10;
      split 128;
      mclt 600;

    load balance max seconds 3;
    }

    subnet 192.168.1.0 netmask 255.255.255.0 {
    pool {
    option domain-name-servers 192.168.1.10;
    deny dynamic bootp clients;
    failover peer "dhcp_lan";
    range 192.168.1.100 192.168.1.245;
    }

    option routers 192.168.1.10;
    option domain-name-servers 192.168.1.10;

    }

    pfsense02.local: /var/dhcpd/etc/dhcpd.conf

    option domain-name "local";
    option ldap-server code 95 = text;
    option domain-search-list code 119 = text;
    option arch code 93 = unsigned integer 16; # RFC4578

    default-lease-time 7200;
    max-lease-time 86400;
    log-facility local7;
    one-lease-per-client true;
    deny duplicates;
    ping-check true;
    update-conflict-detection false;
    authoritative;
    failover peer "dhcp_lan" {
      secondary;
      address 192.168.1.2;
      port 520;
      peer address 192.168.1.1;
      peer port 519;
      max-response-delay 10;
      max-unacked-updates 10;
     
      load balance max seconds 3;
    }

    subnet 192.168.1.0 netmask 255.255.255.0 {
    pool {
    option domain-name-servers 192.168.1.10;
    deny dynamic bootp clients;
    failover peer "dhcp_lan";
    range 192.168.1.100 192.168.1.245;
    }

    option routers 192.168.1.10;
    option domain-name-servers 192.168.1.10;

    }

    pfsense01.local: pfsense01.local: /cf/conf/config.xml (VirtualIP Section)

    <virtualip><vip><mode>carp</mode>
    <interface>lan</interface>
    <vhid>1</vhid>
    <advskew>0</advskew>
    <advbase>1</advbase>
    <password>pf</password>
    <descr><type>single</type>
    <subnet_bits>24</subnet_bits>
    <subnet>192.168.1.10</subnet></descr></vip></virtualip>

    pfsense02.local: pfsense01.local: /cf/conf/config.xml (VirtualIP Section)

    <virtualip><vip><mode>carp</mode>
    <interface>lan</interface>
    <vhid>1</vhid>
    <advskew>100</advskew>
    <advbase>1</advbase>
    <password>pf</password>
    <descr><type>single</type>
    <subnet_bits>24</subnet_bits>
    <subnet>192.168.1.10</subnet></descr></vip></virtualip>

    The only thread I found that talked directly about this issue was from 6 years ago and said the problem was resolved but it seems to be a different issue I'm having. https://forum.pfsense.org/index.php?topic=18285.0

    EDIT:/ One thing I've noticed that seems off is a lot of entries like this in the firewall log:

    block/1000107060 Feb 8 12:01:19 lo0 192.168.1.1:519 192.168.1.10:59293 TCP:SA



  • I don't know exactly what fixed this issue but I did this:

    Disabled the firewall on both nodes (pfctl -d)
    Turned off the DHCP service on both
    Turned on the DHCP service on node1
    Waited a long time (forgot about it so was probably around 10 minutes)
    Turned on the DHCP service on node2
    Waited about 2 minutes
    Enabled firewall (pftcl -e)

    Now the DHCP service is reporting normal operation and getting DHCP leases seems to work after failover.



  • I have issues with this occasionally as well. Generally speaking, shutting down the dhcpd service on both firewalls and bringing them back up one at a time, about 5-10 seconds apart seems to do the trick.



  • It seems the DHCP failover does not work properly when a large amount of leases is in use.



  • Another important point to check when using DHCP failover which can have an impact on the recover/normal mode is the adskew advertisement.
    As mentionned on the GUI:

    Ensure one machine's advskew<20 (and the other is >20).

    On th virtual CARP IP I would check if the primary firewall respect this.
    I previsouly had issues with the DHCP service going into recover mode because of this, since I set all the CARP on the primary node to skew 0 everything is stable.


Log in to reply