DHCP failover not working

sef1414

I've just set up a second instance of pfsense for HA. The pair is syncing, but I'm running into issues with DHCP.

I've followed the guide for inputting the CARP VIP in the DNS Server / Gateway fields for the DHCP server, and inputting the peer failover IP.

On the DHCP Pool Status page, interfaces reflect "recover" in the "My State" column, and "unknown-state" in the "Peer State" column.

I've gone through this troubleshooting list and performed each step without any improvement:

https://docs.netgate.com/pfsense/en/latest/troubleshooting/ha-dhcp-failover.html

DHCP logs indicate an issue but don't provide much detail:

Jan 25 06:38:08	dhcpd	95441	failover peer dhcp_opt2: I move from startup to recover
Jan 25 06:37:53	dhcpd	95441	failover peer dhcp_opt2: host down
Jan 25 06:37:53	dhcpd	95441	failover peer dhcp_opt2: I move from recover to startup

I am running the same development snapshot (2.7.0-DEVELOPMENT (amd64)
built on Fri Jan 20 03:01:02 UTC 2023) as I could not get the newer host with newer hardware working on 2.6

Any suggestions would be appreciated.

jimp

DHCP failover works on 2.7 in general, it's running fine on a pair of systems in my lab.

You likely have some part of the configuration that isn't resulting in the correct/expected config parameters being in the right places.

Something to remember is that DHCP communicates with its peer on each interface separately, so make sure they can freely communicate on each local interface involved in DHCP failover between UDP port 519 (primary) and port 520 (secondary)

If that doesn't help, post the /var/dhcpd/etc/dhcpd.conf from both nodes. You can sanitize the addresses a bit but please try to keep the last octet intact (e.g. replace 192.168.1.1 with x.x.x.1).

sef1414

@jimp

So I was going to post the .conf files in here, but before I did that, I wanted to triple check the HA recipe steps and troubleshooting steps and make sure everything was set up properly. It does indeed appear that it is.. however, along the way I noticed that states do not appear to be syncing either.

There is a line in the system logs:

carp: demoted by 0 to 0 (pfsync bulk fail)

So, I am not sure if this is a related or separate issue to the DHCP syncing.

I am fairly sure I read on a forum that is was now ok to use different NICs in HA pairs, but that's no guarantee that info was accurate..

Does your statement from this old forum post still hold?

"The usual reason on 2.2.x for states to not sync is that the interfaces are mismatched. States in 2.2.x are interface-bound, meaning the interface is a part of the state. For example if the primary node has igb(4) NICs and the secondary has em(4), the states can't sync."

If so, I suppose that is causing the sync issues. If psync is not working, should I expect the culprit to be the same as DHCP syncing not working?

sef1414

@jimp

Here are the configs just in case:

Master


option domain-name "localnet";
option ldap-server code 95 = text;
option domain-search-list code 119 = text;
option arch code 93 = unsigned integer 16; # RFC4578

default-lease-time 7200;
max-lease-time 86400;
log-facility local7;
one-lease-per-client true;
deny duplicates;
update-conflict-detection false;
authoritative;
failover peer "dhcp_opt14" {
  primary;
  address 192.168.91.3;
  port 519;
  peer address 192.168.91.2;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;

  load balance max seconds 3;
}

failover peer "dhcp_opt15" {
  primary;
  address 192.168.35.3;
  port 519;
  peer address 192.168.35.2;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;

  load balance max seconds 3;
}

failover peer "dhcp_opt16" {
  primary;
  address 10.0.66.103;
  port 519;
  peer address 10.0.66.102;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;

  load balance max seconds 3;
}

failover peer "dhcp_opt17" {
  primary;
  address 192.168.56.103;
  port 519;
  peer address 192.168.56.102;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;

  load balance max seconds 3;
}

failover peer "dhcp_opt18" {
  primary;
  address 192.168.76.3;
  port 519;
  peer address 192.168.76.2;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;

  load balance max seconds 3;
}

Secondary:


option domain-name "localnet";
option ldap-server code 95 = text;
option domain-search-list code 119 = text;
option arch code 93 = unsigned integer 16; # RFC4578

default-lease-time 7200;
max-lease-time 86400;
log-facility local7;
one-lease-per-client true;
deny duplicates;
update-conflict-detection false;
authoritative;
failover peer "dhcp_lan" {
  secondary;
  address 192.168.1.2;
  port 520;
  peer address 192.168.1.3;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;
  
  load balance max seconds 3;
}

failover peer "dhcp_opt2" {
  secondary;
  address 192.168.20.1;
  port 520;
  peer address 192.168.20.3;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;
  
  load balance max seconds 3;
}

failover peer "dhcp_opt14" {
  secondary;
  address 192.168.91.1;
  port 520;
  peer address 192.168.91.3;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;
  
  load balance max seconds 3;
}

failover peer "dhcp_opt15" {
  secondary;
  address 192.168.35.1;
  port 520;
  peer address 192.168.35.3;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;
  
  load balance max seconds 3;
}

failover peer "dhcp_opt16" {
  secondary;
  address 10.0.66.1;
  port 520;
  peer address 10.0.66.103;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;
  
  load balance max seconds 3;
}

failover peer "dhcp_opt17" {
  secondary;
  address 192.168.56.1;
  port 520;
  peer address 192.168.56.103;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;
  
  load balance max seconds 3;
}

failover peer "dhcp_opt18" {
  secondary;
  address 192.168.76.1;
  port 520;
  peer address 192.168.76.3;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;
  
  load balance max seconds 3;
}

Didn't include all the static mappings with hostnames, but I can if those are needed.

jimp

At the very least you have some mismatches in the config that are likely causing you problems. The config for the secondary has pools for dhcp_lan and dhcp_opt2 which are not on the primary.

Also in some of these pools the secondary has its own address as .1 but the primary has the peer address as .2. Normally you'd see the it be .2/.3 and .3/.2 as they should be using their own interface addresses in these cases, not VIPs.

sef1414

@jimp

Thanks for the follow up, much appreciated. I think the mismatch was due to some ill timed testing where I had tried re-adding interfaces without re-enabling CARP.

I went ahead and stripped it down to just one interface for simplicity, and ran through the troubleshooting steps again.

Master

option domain-name "localnet";
option ldap-server code 95 = text;
option domain-search-list code 119 = text;
option arch code 93 = unsigned integer 16; # RFC4578

default-lease-time 7200;
max-lease-time 86400;
log-facility local7;
one-lease-per-client true;
deny duplicates;
update-conflict-detection false;
authoritative;
failover peer "dhcp_opt15" {
  primary;
  address 192.168.35.3;
  port 519;
  peer address 192.168.35.2;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;

  load balance max seconds 3;
}

Secondary:

option domain-name "localnet";
option ldap-server code 95 = text;
option domain-search-list code 119 = text;
option arch code 93 = unsigned integer 16; # RFC4578

default-lease-time 7200;
max-lease-time 86400;
log-facility local7;
one-lease-per-client true;
deny duplicates;
update-conflict-detection false;
authoritative;
failover peer "dhcp_opt15" {
  secondary;
  address 192.168.35.1;
  port 520;
  peer address 192.168.35.3;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;
  
  load balance max seconds 3;
}

I believe the "address" on the master should be the CARP VIP, but maybe I'm mistaken. I did follow the guide for setting up the DHCP server

Here are my setting for the interface / DHCP / CARP VIP on the master:

Here are the logs after my starting DHCP daemons:

Master:

Jan 26 14:29:44	dhcpleases	251	Sending HUP signal to dns daemon(86952)
Jan 26 14:29:44	dhcpd	84618	failover peer dhcp_opt15: I move from startup to recover
Jan 26 14:29:29	dhcpleases	251	Sending HUP signal to dns daemon(86952)
Jan 26 14:29:29	dhcpd	84618	Server starting service.
Jan 26 14:29:29	dhcpd	84618	failover peer dhcp_opt15: I move from recover to startup
Jan 26 14:29:29	dhcpd	84618	Sending on Socket/fallback/fallback-net

Secondary:

Jan 26 14:30:02	dhcpd	41294	failover peer dhcp_opt15: I move from startup to recover
Jan 26 14:29:47	dhcpleases	5555	Sending HUP signal to dns daemon(32665)
Jan 26 14:29:47	dhcpleases	5555	Sending HUP signal to dns daemon(32665)
Jan 26 14:29:47	dhcpd	41294	Server starting service.
Jan 26 14:29:47	dhcpd	41294	failover peer dhcp_opt15: host unreachable
Jan 26 14:29:47	dhcpd	41294	failover peer dhcp_opt15: I move from recover to startup
Jan 26 14:29:47	dhcpd	41294	Sending on Socket/fallback/fallback-net

jimp

Something still isn't right there. If the VIP is .1 then neither of them should be using that as their "address" in the subnet for DHCP.

The config should show address <-> peer in both directions, like this:

Primary:

failover peer "dhcp_lan" {
  primary;
  address 10.11.0.2;
  port 519;
  peer address 10.11.0.3;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;

  load balance max seconds 3;
}

Secondary:

failover peer "dhcp_lan" {
  secondary;
  address 10.11.0.3;
  port 520;
  peer address 10.11.0.2;
  peer port 519;
  max-response-delay 10;
  max-unacked-updates 10;

  load balance max seconds 3;
}

Note that it's 10.11.0.2:519 <-> 10.11.0.3:520 both ways.

I'm not sure how that secondary is pulling the VIP for its own address there.

sef1414

@jimp

Hmm ok. Changing it manually just gets overwritten, as I expected. Any thoughts on where to go from here?

For the host unreachable messages, do I need some explicit firewall rule to pass the traffic on that interface? I wouldn't think I would since its not mentioned in docs and the traffic is on the some interface.

jimp

Not sure why it's picking the VIP there, but it might be related to https://redmine.pfsense.org/issues/11545 -- I don't think I've ever seen that be triggered by a CARP VIP though, especially not that reliably.

I'd take a closer look at the interface and VIP settings and see if anything stands out there.

sef1414

@jimp

Alright. No joy on re-saving interface / VIP. Pretty sure I have everything configured correctly, have run through too many times to count.

Any shot its caused by different phsyical NIC models?

jimp

No, the NIC models only affect state sync, not DHCP sync. And even then the state sync isn't affected anymore since we moved back away from interface-bound states.