DHCP Failover with CARP - Both in Recover, Peer Unknown State



  • Hey everyone, I have moved over to pfSense from m0n0wall to take advantage of failover (and a couple other items).  I have been using a good CARP setup (2 boxes, failover, no loadbalance, many VLANs) on some temp hardware for a bit, and I'm getting ready to deploy permanent hardware.  Of course I am doing a ton of testing beforehand, and I am still fighting with getting the DHCP server to failover.

    I am running 1.2.3RC1 right now, but saw the same thing with 1.2.2.  Again, my CARP config is good.  The failover works perfect every time.  My DHCP config is relatively good - each server hands out leases on it's own when setup.  But once I try to enable peer failover, no more leases are handed out.  In the DHCP status I see both servers show their state to be recover and their peer state to be unknown.

    I have been reading several posts regarding this same issues including this one http://forum.pfsense.org/index.php/topic,7986.0.html.  I have read through the CARP document here http://doc.pfsense.org/index.php/Configuring_pfSense_Hardware_Redundancy_(CARP)#Setting_DHCP_Server_to_use_CARP_LAN_IP_Address and I have watched and stepped slowly through the demo here http://files.chi.pfsense.org/mirror/tutorials/carp/carp-cluster-new.htm.  I still have this issue.

    As others have said, I do not have short lease times and generally don't have outages/failures for any length of time, so I can live with having to manually enable DHCP on the standby machine if needed, but it sure would be nice to have full failover and true backup hardware.  Any help is appreciated on this.

    Thanks in advance.



  • I've got this running on a couple of CARP setups. I do remember having to fiddle a bit to get it to calm down and sync. I don't recall exactly, but try stopping both DHCP servers, then start the primary, wait a minute, start the backup dhcp service. Just for reference, here is the basic configuration off of a cluster where this is working. It's a very simple config. The cluster is on 1.2.2.

    LAN CARP IP: 192.168.1.1
    Master-
    LAN IP 192.168.1.2
    DHCP config
    Range 100-150
    DNS server: 192.168.1.1
    Gateway:192.168.1.1
    Failover Peer: 192.168.1.3

    Backup-
    LAN IP 192.168.1.3
    DHCP config
    Range 100-150
    DNS server: 192.168.1.1
    Gateway:192.168.1.1
    Failover Peer: 192.168.1.2



  • Thanks for the reply.  When you say stop and start the DHCP service, do you mean to just disable/enable via the webpage config?  I agree the config is (shold be) simple.  Your info matches what I have done.  I have some cables to build but I will work with it again in a bit.



  • I just went into status, services and stopped DHCP from there.



  • Gotcha.  I will try that in a bit and let you know.

    Thanks again.



  • Ummmm…. .I'm not even able to stop the service.  haha  I go to Status -> Services and see the dhcpd service listed with the restart and stop buttons beside it.  I click the stop button, a couple seconds later the top of the window says dhcpd has been stopped, but the status beside the service does not change - see screenshot.  I thought perhaps an issue with 1.2.3RC1, so I downgraded both boxes to 1.2.2 and the same thing happens.  I tried IE7 and Firefox with the same results (trying all the little things).  Do you know if/how I can try the same thing from the command line?

    Thanks.




  • Looking at your screenshot with NTP stopped makes me wonder if you are working on your cluster off-line. If so, be aware that there is a gotcha that breaks your DHCP sync if the time on the two boxes is not synchronized. You can verify this by trying to start dhcpd from a console/ssh prompt. Something like:

    /usr/local/sbin/dhcpd -cf /var/dhcpd/etc/dhcpd.conf
    

    should tell you what's happening. Add -f to that to run in the foreground.



  • You are right, they are offline on the test bench.  I will set up a test NTP server for them to sync to.



  • Apparently the dhcp status interface somewhat flaky.

    The DHCP leases status appears to update slowly/is incorrect.  Look at the logs instead, as those appear to be correct.

    The service status of dhcpd is also incorrect.  The stop and start buttons DO work, but the dhcpd service is always shown as started.  Use top from a shell and show user dhcpd to see this behavior.

    I had to reboot the firewalls after changing dhcpd settings to get things to work correctly.  Also, I noticed that the failover IP address must be on the same subnet as the pool or the code will set both servers to secondary.

    Check your configs in /var/dhcpd/etc/dhcpd.conf



  • Thanks for your input Josh.  I have not been able to get this working.  I do have the time between these machines synced now, and as you say the dhcpd status never changes.  haha  I have rebooted these boxes a number of times - even made the changes to dhcp on each box and rebooted the primary, then the secondary a few seconds later (reboots take the same time) to make the primary come online first.  Same issue right after boot.

    My addressing is okay - boxes are 10.61.32.250/24 and 251, CARP address is 254 (for this testing interface), pool is from 10 to 20.  dhcpd.conf snip:

    failover peer "dhcp0" {
      primary;
      address 10.61.32.250;
      port 519;
      peer address 10.61.32.251;
      peer port 520;
      max-response-delay 10;
      max-unacked-updates 10;
      split 128;
      mclt 600;
    |
    |
    |
    |
    subnet 10.61.32.0 netmask 255.255.252.0 {
    	pool {
    		option domain-name-servers 10.61.32.254;
    		deny dynamic bootp clients;
    		failover peer "dhcp0";
    		range 10.61.32.10 10.61.32.20;
    	}
    	option routers 10.61.32.254;
    	option domain-name-servers 10.61.32.254;
    }
    


  • Check the dhcpd log files on both ends to see if dhcpd is complaining about anything.



  • The only bad log entries I see are like this:

    dhcpd: failover peer dhcp0: I move from recover to startup
    dhcpd: failover peer dhcp0: I move from startup to recover

    And when a request is made I see entries like this:

    dhcpd: DHCPREQUEST for 10.61.32.20 from 00:0b:db:7e:8e:5d via em1: not responding (recovering)



  • Just to bump this thread back up, as I've been facing the same issues as noted here in this thread.

    My setup:

    • 2x pfsense boxes doing CARP on 4 separate vlans.
    • DHCP configured on each VLAN

    Enabling failover DHCP, I would just get the same log messages as posted by acherman … and then dhcpd will not hand out leases while it's in the recover state!

    What I ended up finding:

    • check your dhcpd.conf file (/var/dhcpd/etc/dhcpd.conf) on your secondary pfsense server. I found that it was not properly receiving the "secondary" designation in the failover section.

    It seems this designation is assigned when the service is started / config is generated by the file /etc/inc/services.inc in the section beginning at line 139.

    I've not yet analyzed the code to try and figure out if there's a bug here … I think that there may be an issue with how the $skew value is being determined.

    I needed to get this working ASAP, so on my secondary firewall, I've just forced it to always be a secondary by modifying line 156 to be: $type = "secondary"; (it was always being set to primary, even though it shouldn't be…)

    Finally, I had to manually kill dhcpd on each box, remove the dhcpd.leases file on both, and then start dhcpd on the primary, then the secondary. After about 5 minutes, DHCP leases status was "normal", and now they've been running fine for several hours, after doling out nearly 100 leases.



  • Question for richardsc- do you have any 'other' type VIPs? I had an issue like this ages back, and it was due to the other VIPs throwing off the master/backup check. I also used a cheap hack to fix the issue. The problem went away when I only had CARP VIPs.



  • @dotdash:

    Question for richardsc- do you have any 'other' type VIPs? I had an issue like this ages back, and it was due to the other VIPs throwing off the master/backup check. I also used a cheap hack to fix the issue. The problem went away when I only had CARP VIPs.

    nope. I only have CARP virtual VIP's.

    If I can find time this week, I'm going to try and investigate further to find the root cause of the problem.



  • Just for fun I upgraded both boxes to 1.2.3 RC3 today and tried this again.  I still can not get it to work properly.  I may resort to the mod mentioned above to get this working.



  • :(  Still no go.  Ii can not get dhcp failover working.  I have accepted the fact that it is broken and I will have to manually start dhcp on the backup unit during a failure.  :'(



  • Check the dhcpd.conf on both boxes and verify the main is set to primary and the backup is set to secondary.



  • Today I tried to set it up and hit the same problem, quick tcpdump showed how it can be fixed. I've just enabled TCP ports 519 and 520 from LAN net to LAN Interface (this rule will be replicated to passive one), restarted dhcpd on Active one and that is it. It is working properly.



  • Also got problems getting this to work with pfSense 2.0 snapshots May 9th and May 11th. After changing the line in services.inc (and removed another one) as mentioned by richard, it worked for me. Somehow the skew counter isn't working correctly, not sure how this exactly works, but I know both routers have the exact same time and timezone set. Seems to me there is some kind of bug.



  • Same issue with "2.0-BETA4 built on Mon Aug 2 21:49:34 EDT 2010 FreeBSD 8.1-RELEASE"

    Any have dhcp-failover working?

    Thank


  • Rebel Alliance Developer Netgate

    It works fine if you have valid configurations, the problem is that certain invalid configurations can trick the logic to make it not work.

    The usual reason is that someone is using Proxy ARP VIPs which sync to the secondary as empty, which triggers a bug in the dhcp server logic that makes it think it's primary when it's not. I thought I committed a fix for that a week or two ago.

    If you still have the bug, I need copies of /var/dhcpd/etc/dhcpd.conf from the primary and secondary, along with at least the <virtualip>section of the primary and secondary config.xml files.

    The "skew" on the VIPs is used to trigger the logic for slave, so if you have manually set the skew on the secondary to less than 20, that would also break it.</virtualip>



  • Hello,
    @jimp:

    If you still have the bug, I need copies of /var/dhcpd/etc/dhcpd.conf from the primary and secondary, along with at least the <virtualip>section of the primary and secondary config.xml files.</virtualip>

    pfSense LEFT dhcpd.conf:

    option domain-name "localdomain";
    option ldap-server code 95 = text;
    option domain-search-list code 119 = text;

    default-lease-time 7200;
    max-lease-time 86400;
    log-facility local7;
    ddns-update-style none;
    one-lease-per-client true;
    deny duplicates;
    ping-check true;
    authoritative;
    failover peer "dhcp0" {
     primary;
     address 192.168.3.1;
     port 519;
     peer address 192.168.3.2;
     peer port 520;
     max-response-delay 10;
     max-unacked-updates 10;
     split 128;
     mclt 600;

    load balance max seconds 3;
    }
    authoritative;
    failover peer "dhcp1" {
     primary;
     address 192.168.4.1;
     port 519;
     peer address 192.168.4.2;
     peer port 520;
     max-response-delay 10;
     max-unacked-updates 10;
     split 128;
     mclt 600;

    load balance max seconds 3;
    }
    subnet 192.168.3.0 netmask 255.255.255.0 {
    pool {
    option domain-name-servers 192.168.3.10;
    deny dynamic bootp clients;
    failover peer "dhcp0";
    range 192.168.3.100 192.168.3.199;
    }
    option routers 192.168.3.10;
    option domain-name-servers 192.168.3.10;

    }
    subnet 192.168.4.0 netmask 255.255.255.0 {
    pool {
    option domain-name-servers 192.168.4.10;
    deny dynamic bootp clients;
    failover peer "dhcp1";
    range 192.168.4.100 192.168.4.199;
    }
    option routers 192.168.4.10;
    option domain-name-servers 192.168.4.10;

    }

    pfSense RIGHT dhcpd.conf:

    option domain-name "localdomain";
    option ldap-server code 95 = text;
    option domain-search-list code 119 = text;

    default-lease-time 7200;
    max-lease-time 86400;
    log-facility local7;
    ddns-update-style none;
    one-lease-per-client true;
    deny duplicates;
    ping-check true;
    authoritative;
    failover peer "dhcp0" {
     secondary;
     address 192.168.3.2;
     port 520;
     peer address 192.168.3.1;
     peer port 519;
     max-response-delay 10;
     max-unacked-updates 10;
     mclt 600;

    load balance max seconds 3;
    }
    authoritative;
    failover peer "dhcp1" {
     secondary;
     address 192.168.4.2;
     port 520;
     peer address 192.168.4.1;
     peer port 519;
     max-response-delay 10;
     max-unacked-updates 10;
     mclt 600;

    load balance max seconds 3;
    }
    subnet 192.168.3.0 netmask 255.255.255.0 {
    pool {
    option domain-name-servers 192.168.3.10;
    deny dynamic bootp clients;
    failover peer "dhcp0";
    range 192.168.3.100 192.168.3.199;
    }
    option routers 192.168.3.10;
    option domain-name-servers 192.168.3.10;

    }
    subnet 192.168.4.0 netmask 255.255.255.0 {
    pool {
    option domain-name-servers 192.168.4.10;
    deny dynamic bootp clients;
    failover peer "dhcp1";
    range 192.168.4.100 192.168.4.199;
    }
    option routers 192.168.4.10;
    option domain-name-servers 192.168.4.10;

    }

    pfSense LEFT config.xml:

    <virtualip><vip><vip><mode>carp</mode>
    <interface>wan</interface>
    <vhid>1</vhid>
    <advskew>0</advskew>
    <password>wanpass</password>
    <descr><type>single</type>
    <subnet_bits>24</subnet_bits>
    <subnet>192.168.1.50</subnet></descr></vip>
    <vip><vip><mode>carp</mode>
    <interface>lan</interface>
    <vhid>2</vhid>
    <advskew>0</advskew>
    <password>lanpass</password>
    <descr><type>single</type>
    <subnet_bits>24</subnet_bits>
    <subnet>192.168.3.10</subnet></descr></vip>
    <vip><vip><mode>carp</mode>
    <interface>opt2</interface>
    <vhid>3</vhid>
    <advskew>0</advskew>
    <password>wifipass</password>
    <descr><type>single</type>
    <subnet_bits>24</subnet_bits>
    <subnet>192.168.4.10</subnet></descr></vip></vip></vip></vip></virtualip>

    pfSense RIGHT config.xml:

    <virtualip><vip><vip><mode>carp</mode>
    <interface>wan</interface>
    <vhid>1</vhid>
    <advskew>100</advskew>
    <password>wanpass</password>
    <descr><type>single</type>
    <subnet_bits>24</subnet_bits>
    <subnet>192.168.1.50</subnet></descr></vip>
    <vip><vip><mode>carp</mode>
    <interface>lan</interface>
    <vhid>2</vhid>
    <advskew>100</advskew>
    <password>lanpass</password>
    <descr><type>single</type>
    <subnet_bits>24</subnet_bits>
    <subnet>192.168.3.10</subnet></descr></vip>
    <vip><vip><mode>carp</mode>
    <interface>opt2</interface>
    <vhid>3</vhid>
    <advskew>100</advskew>
    <password>wifipass</password>
    <descr><type>single</type>
    <subnet_bits>24</subnet_bits>
    <subnet>192.168.4.10</subnet></descr></vip></vip></vip></vip></virtualip>


  • Rebel Alliance Developer Netgate

    @itsmorefun:

    editing…

    Those came through in e-mail before you edited them out, and it looks like you might have hit a bug that I fixed the other day that made them both show up as secondary instead of primary, but that shouldn't have made them in recover/peer-known state, but both in communications-interrupted state. Should be OK in current snapshots though.



  • @jimp:

    @itsmorefun:

    editing…

    Those came through in e-mail before you edited them out, and it looks like you might have hit a bug that I fixed the other day that made them both show up as secondary instead of primary, but that shouldn't have made them in recover/peer-known state, but both in communications-interrupted state. Should be OK in current snapshots though.

    Ok,

    Sorry my pfsense crashed… I am retesting :-).



  • All work now.

    Thank


Log in to reply