DHCP Failover with CARP - Both in Recover, Peer Unknown State

acherman

Thanks for the reply. When you say stop and start the DHCP service, do you mean to just disable/enable via the webpage config? I agree the config is (shold be) simple. Your info matches what I have done. I have some cables to build but I will work with it again in a bit.

dotdash

I just went into status, services and stopped DHCP from there.

acherman

Gotcha. I will try that in a bit and let you know.

Thanks again.

acherman

Ummmm…. .I'm not even able to stop the service. haha I go to Status -> Services and see the dhcpd service listed with the restart and stop buttons beside it. I click the stop button, a couple seconds later the top of the window says dhcpd has been stopped, but the status beside the service does not change - see screenshot. I thought perhaps an issue with 1.2.3RC1, so I downgraded both boxes to 1.2.2 and the same thing happens. I tried IE7 and Firefox with the same results (trying all the little things). Do you know if/how I can try the same thing from the command line?

Thanks.

1.jpg_thumb

dotdash

Looking at your screenshot with NTP stopped makes me wonder if you are working on your cluster off-line. If so, be aware that there is a gotcha that breaks your DHCP sync if the time on the two boxes is not synchronized. You can verify this by trying to start dhcpd from a console/ssh prompt. Something like:

/usr/local/sbin/dhcpd -cf /var/dhcpd/etc/dhcpd.conf

should tell you what's happening. Add -f to that to run in the foreground.

acherman

You are right, they are offline on the test bench. I will set up a test NTP server for them to sync to.

JoshW

Apparently the dhcp status interface somewhat flaky.

The DHCP leases status appears to update slowly/is incorrect. Look at the logs instead, as those appear to be correct.

The service status of dhcpd is also incorrect. The stop and start buttons DO work, but the dhcpd service is always shown as started. Use top from a shell and show user dhcpd to see this behavior.

I had to reboot the firewalls after changing dhcpd settings to get things to work correctly. Also, I noticed that the failover IP address must be on the same subnet as the pool or the code will set both servers to secondary.

Check your configs in /var/dhcpd/etc/dhcpd.conf

acherman

Thanks for your input Josh. I have not been able to get this working. I do have the time between these machines synced now, and as you say the dhcpd status never changes. haha I have rebooted these boxes a number of times - even made the changes to dhcp on each box and rebooted the primary, then the secondary a few seconds later (reboots take the same time) to make the primary come online first. Same issue right after boot.

My addressing is okay - boxes are 10.61.32.250/24 and 251, CARP address is 254 (for this testing interface), pool is from 10 to 20. dhcpd.conf snip:

failover peer "dhcp0" {
  primary;
  address 10.61.32.250;
  port 519;
  peer address 10.61.32.251;
  peer port 520;
  max-response-delay 10;
  max-unacked-updates 10;
  split 128;
  mclt 600;
|
|
|
|
subnet 10.61.32.0 netmask 255.255.252.0 {
	pool {
		option domain-name-servers 10.61.32.254;
		deny dynamic bootp clients;
		failover peer "dhcp0";
		range 10.61.32.10 10.61.32.20;
	}
	option routers 10.61.32.254;
	option domain-name-servers 10.61.32.254;
}

JoshW

Check the dhcpd log files on both ends to see if dhcpd is complaining about anything.

acherman

The only bad log entries I see are like this:

dhcpd: failover peer dhcp0: I move from recover to startup
dhcpd: failover peer dhcp0: I move from startup to recover

And when a request is made I see entries like this:

dhcpd: DHCPREQUEST for 10.61.32.20 from 00:0b:db:7e:8e:5d via em1: not responding (recovering)

richardsc

Just to bump this thread back up, as I've been facing the same issues as noted here in this thread.

My setup:

2x pfsense boxes doing CARP on 4 separate vlans.
DHCP configured on each VLAN

Enabling failover DHCP, I would just get the same log messages as posted by acherman … and then dhcpd will not hand out leases while it's in the recover state!

What I ended up finding:

check your dhcpd.conf file (/var/dhcpd/etc/dhcpd.conf) on your secondary pfsense server. I found that it was not properly receiving the "secondary" designation in the failover section.

It seems this designation is assigned when the service is started / config is generated by the file /etc/inc/services.inc in the section beginning at line 139.

I've not yet analyzed the code to try and figure out if there's a bug here … I think that there may be an issue with how the $skew value is being determined.

I needed to get this working ASAP, so on my secondary firewall, I've just forced it to always be a secondary by modifying line 156 to be: $type = "secondary"; (it was always being set to primary, even though it shouldn't be…)

Finally, I had to manually kill dhcpd on each box, remove the dhcpd.leases file on both, and then start dhcpd on the primary, then the secondary. After about 5 minutes, DHCP leases status was "normal", and now they've been running fine for several hours, after doling out nearly 100 leases.

dotdash

Question for richardsc- do you have any 'other' type VIPs? I had an issue like this ages back, and it was due to the other VIPs throwing off the master/backup check. I also used a cheap hack to fix the issue. The problem went away when I only had CARP VIPs.

richardsc

@dotdash:

Question for richardsc- do you have any 'other' type VIPs? I had an issue like this ages back, and it was due to the other VIPs throwing off the master/backup check. I also used a cheap hack to fix the issue. The problem went away when I only had CARP VIPs.

nope. I only have CARP virtual VIP's.

If I can find time this week, I'm going to try and investigate further to find the root cause of the problem.

acherman

Just for fun I upgraded both boxes to 1.2.3 RC3 today and tried this again. I still can not get it to work properly. I may resort to the mod mentioned above to get this working.

acherman

:( Still no go. Ii can not get dhcp failover working. I have accepted the fact that it is broken and I will have to manually start dhcp on the backup unit during a failure. :'(

dotdash

Check the dhcpd.conf on both boxes and verify the main is set to primary and the backup is set to secondary.

Eugene

Today I tried to set it up and hit the same problem, quick tcpdump showed how it can be fixed. I've just enabled TCP ports 519 and 520 from LAN net to LAN Interface (this rule will be replicated to passive one), restarted dhcpd on Active one and that is it. It is working properly.

blackb1rd

Also got problems getting this to work with pfSense 2.0 snapshots May 9th and May 11th. After changing the line in services.inc (and removed another one) as mentioned by richard, it worked for me. Somehow the skew counter isn't working correctly, not sure how this exactly works, but I know both routers have the exact same time and timezone set. Seems to me there is some kind of bug.

itsmorefun

Same issue with "2.0-BETA4 built on Mon Aug 2 21:49:34 EDT 2010 FreeBSD 8.1-RELEASE"

Any have dhcp-failover working?

Thank

jimp

It works fine if you have valid configurations, the problem is that certain invalid configurations can trick the logic to make it not work.

The usual reason is that someone is using Proxy ARP VIPs which sync to the secondary as empty, which triggers a bug in the dhcp server logic that makes it think it's primary when it's not. I thought I committed a fix for that a week or two ago.

If you still have the bug, I need copies of /var/dhcpd/etc/dhcpd.conf from the primary and secondary, along with at least the <virtualip>section of the primary and secondary config.xml files.

The "skew" on the VIPs is used to trigger the logic for slave, so if you have manually set the skew on the secondary to less than 20, that would also break it.</virtualip>