DHCP Failover with CARP - Both in Recover, Peer Unknown State

acherman

Hey everyone, I have moved over to pfSense from m0n0wall to take advantage of failover (and a couple other items). I have been using a good CARP setup (2 boxes, failover, no loadbalance, many VLANs) on some temp hardware for a bit, and I'm getting ready to deploy permanent hardware. Of course I am doing a ton of testing beforehand, and I am still fighting with getting the DHCP server to failover.

I am running 1.2.3RC1 right now, but saw the same thing with 1.2.2. Again, my CARP config is good. The failover works perfect every time. My DHCP config is relatively good - each server hands out leases on it's own when setup. But once I try to enable peer failover, no more leases are handed out. In the DHCP status I see both servers show their state to be recover and their peer state to be unknown.

I have been reading several posts regarding this same issues including this one http://forum.pfsense.org/index.php/topic,7986.0.html. I have read through the CARP document here http://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_(CARP)#Setting_DHCP_Server_to_use_CARP_LAN_IP_Address and I have watched and stepped slowly through the demo here http://files.chi.pfsense.org/mirror/tutorials/carp/carp-cluster-new.htm. I still have this issue.

As others have said, I do not have short lease times and generally don't have outages/failures for any length of time, so I can live with having to manually enable DHCP on the standby machine if needed, but it sure would be nice to have full failover and true backup hardware. Any help is appreciated on this.

Thanks in advance.

dotdash

I've got this running on a couple of CARP setups. I do remember having to fiddle a bit to get it to calm down and sync. I don't recall exactly, but try stopping both DHCP servers, then start the primary, wait a minute, start the backup dhcp service. Just for reference, here is the basic configuration off of a cluster where this is working. It's a very simple config. The cluster is on 1.2.2.

LAN CARP IP: 192.168.1.1
Master-
LAN IP 192.168.1.2
DHCP config
Range 100-150
DNS server: 192.168.1.1
Gateway:192.168.1.1
Failover Peer: 192.168.1.3

Backup-
LAN IP 192.168.1.3
DHCP config
Range 100-150
DNS server: 192.168.1.1
Gateway:192.168.1.1
Failover Peer: 192.168.1.2

acherman

Thanks for the reply. When you say stop and start the DHCP service, do you mean to just disable/enable via the webpage config? I agree the config is (shold be) simple. Your info matches what I have done. I have some cables to build but I will work with it again in a bit.

dotdash

I just went into status, services and stopped DHCP from there.

acherman

Gotcha. I will try that in a bit and let you know.

Thanks again.

acherman

Ummmm…. .I'm not even able to stop the service. haha I go to Status -> Services and see the dhcpd service listed with the restart and stop buttons beside it. I click the stop button, a couple seconds later the top of the window says dhcpd has been stopped, but the status beside the service does not change - see screenshot. I thought perhaps an issue with 1.2.3RC1, so I downgraded both boxes to 1.2.2 and the same thing happens. I tried IE7 and Firefox with the same results (trying all the little things). Do you know if/how I can try the same thing from the command line?

Thanks.

1.jpg_thumb

dotdash

Looking at your screenshot with NTP stopped makes me wonder if you are working on your cluster off-line. If so, be aware that there is a gotcha that breaks your DHCP sync if the time on the two boxes is not synchronized. You can verify this by trying to start dhcpd from a console/ssh prompt. Something like:

/usr/local/sbin/dhcpd -cf /var/dhcpd/etc/dhcpd.conf

should tell you what's happening. Add -f to that to run in the foreground.

acherman

You are right, they are offline on the test bench. I will set up a test NTP server for them to sync to.

JoshW

Apparently the dhcp status interface somewhat flaky.

The DHCP leases status appears to update slowly/is incorrect. Look at the logs instead, as those appear to be correct.

The service status of dhcpd is also incorrect. The stop and start buttons DO work, but the dhcpd service is always shown as started. Use top from a shell and show user dhcpd to see this behavior.

I had to reboot the firewalls after changing dhcpd settings to get things to work correctly. Also, I noticed that the failover IP address must be on the same subnet as the pool or the code will set both servers to secondary.

Check your configs in /var/dhcpd/etc/dhcpd.conf

acherman

Thanks for your input Josh. I have not been able to get this working. I do have the time between these machines synced now, and as you say the dhcpd status never changes. haha I have rebooted these boxes a number of times - even made the changes to dhcp on each box and rebooted the primary, then the secondary a few seconds later (reboots take the same time) to make the primary come online first. Same issue right after boot.

My addressing is okay - boxes are 10.61.32.250/24 and 251, CARP address is 254 (for this testing interface), pool is from 10 to 20. dhcpd.conf snip:

failover peer "dhcp0" {
  primary;
  address 10.61.32.250;
  port 519;
  peer address 10.61.32.251;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;
|
|
|
|
subnet 10.61.32.0 netmask 255.255.252.0 {
	pool {
		option domain-name-servers 10.61.32.254;
		deny dynamic bootp clients;
		failover peer "dhcp0";
		range 10.61.32.10 10.61.32.20;
	}
	option routers 10.61.32.254;
	option domain-name-servers 10.61.32.254;
}

JoshW

Check the dhcpd log files on both ends to see if dhcpd is complaining about anything.

acherman

The only bad log entries I see are like this:

dhcpd: failover peer dhcp0: I move from recover to startup
dhcpd: failover peer dhcp0: I move from startup to recover

And when a request is made I see entries like this:

dhcpd: DHCPREQUEST for 10.61.32.20 from 00:0b:db:7e:8e:5d via em1: not responding (recovering)

richardsc

Just to bump this thread back up, as I've been facing the same issues as noted here in this thread.

My setup:

2x pfsense boxes doing CARP on 4 separate vlans.
DHCP configured on each VLAN

Enabling failover DHCP, I would just get the same log messages as posted by acherman … and then dhcpd will not hand out leases while it's in the recover state!

What I ended up finding:

check your dhcpd.conf file (/var/dhcpd/etc/dhcpd.conf) on your secondary pfsense server. I found that it was not properly receiving the "secondary" designation in the failover section.

It seems this designation is assigned when the service is started / config is generated by the file /etc/inc/services.inc in the section beginning at line 139.

I've not yet analyzed the code to try and figure out if there's a bug here … I think that there may be an issue with how the $skew value is being determined.

I needed to get this working ASAP, so on my secondary firewall, I've just forced it to always be a secondary by modifying line 156 to be: $type = "secondary"; (it was always being set to primary, even though it shouldn't be…)

Finally, I had to manually kill dhcpd on each box, remove the dhcpd.leases file on both, and then start dhcpd on the primary, then the secondary. After about 5 minutes, DHCP leases status was "normal", and now they've been running fine for several hours, after doling out nearly 100 leases.

dotdash

Question for richardsc- do you have any 'other' type VIPs? I had an issue like this ages back, and it was due to the other VIPs throwing off the master/backup check. I also used a cheap hack to fix the issue. The problem went away when I only had CARP VIPs.

richardsc

@dotdash:

Question for richardsc- do you have any 'other' type VIPs? I had an issue like this ages back, and it was due to the other VIPs throwing off the master/backup check. I also used a cheap hack to fix the issue. The problem went away when I only had CARP VIPs.

nope. I only have CARP virtual VIP's.

If I can find time this week, I'm going to try and investigate further to find the root cause of the problem.

acherman

Just for fun I upgraded both boxes to 1.2.3 RC3 today and tried this again. I still can not get it to work properly. I may resort to the mod mentioned above to get this working.

acherman

:( Still no go. Ii can not get dhcp failover working. I have accepted the fact that it is broken and I will have to manually start dhcp on the backup unit during a failure. :'(

dotdash

Check the dhcpd.conf on both boxes and verify the main is set to primary and the backup is set to secondary.

Eugene

Today I tried to set it up and hit the same problem, quick tcpdump showed how it can be fixed. I've just enabled TCP ports 519 and 520 from LAN net to LAN Interface (this rule will be replicated to passive one), restarted dhcpd on Active one and that is it. It is working properly.

blackb1rd

Also got problems getting this to work with pfSense 2.0 snapshots May 9th and May 11th. After changing the line in services.inc (and removed another one) as mentioned by richard, it worked for me. Somehow the skew counter isn't working correctly, not sure how this exactly works, but I know both routers have the exact same time and timezone set. Seems to me there is some kind of bug.