DHCP Servers in HA Recovery State

splinegear

Hello, I had a working system in HA, but after some trivial address range changes to one of the DHCP servers, all of them are now stuck in recovery.

I have 5 networks, and 3 DHCP servers. All DHCP servers are on vLANs, but the untagged management network they are carried on is NOT offering DHCP.
CARP is working well and I can shift Master behaviour back and forth between two units.

I've read all the troubleshooting for DHCP issues with HA, restarted each side, stood on my head and wiped the leases file.

After looking at port 519/520, I found that each side keeps transmitting SYNs to each other with a number of TCP Retransmission (as found when decoding in Wireshark, but not clearly shown here.)
I don't have any good traffic to look at but this is a bit suspect.

23:58:27.196467 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.20.0.2.5069 > 10.20.0.3.519: Flags [S], cksum 0x145b (incorrect -> 0xe5bb), seq 4108732373, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 4215210363 ecr 0], length 0
23:58:43.114475 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.20.0.2.7616 > 10.20.0.3.519: Flags [S], cksum 0x145b (incorrect -> 0x3ef4), seq 450303092, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 275027612 ecr 0], length 0
23:58:44.115396 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.20.0.2.7616 > 10.20.0.3.519: Flags [S], cksum 0x145b (incorrect -> 0x3b0b), seq 450303092, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 275028613 ecr 0], length 0
23:58:45.059483 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.20.0.3.32726 > 10.20.0.2.519: Flags [S], cksum 0x2e4a (correct), seq 34146524, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 772295666 ecr 0], length 0
23:58:46.327090 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.20.0.2.7616 > 10.20.0.3.519: Flags [S], cksum 0x145b (incorrect -> 0x3268), seq 450303092, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 275030824 ecr 0], length 0
23:58:50.527367 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.20.0.2.7616 > 10.20.0.3.519: Flags [S], cksum 0x145b (incorrect -> 0x2200), seq 450303092, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 275035024 ecr 0], length 0
23:58:58.755973 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    10.20.0.2.7616 > 10.20.0.3.519: Flags [S], cksum 0x145b (incorrect -> 0x01db), seq 450303092, win 65228, options [mss 1460,nop,wscale 7,sackOK,TS val 275043253 ecr 0], length 0

Any suggestions for how to debug this? (other than disable/re-enable)

Thanks in advance,
Daryl

splinegear

Solved. (Although I still have temporary outages after each DHCP configuration change.)

TLDR; I had the skew on my VIP addresses set to 100 and 200 instead of 0 and 100. Unfortunately I hadn't noticed this since the DHCP failover was working as expected prior to the config change. The notes in troubleshooting HA DHCP failover are worth careful study.

The smoking gun in my post above is that all the communication is on port 519 and not a mix of 519 and 520. This was caused by both of my pfSense units believing they were secondary (since the skews were high). This was found by browsing /var/dhcpd/etc/dhcpd/conf and looking under the clause marked "failover peer".

Thanks! Daryl