DHCP load-balancing (failover enabled) causing hostnames to be unavailable for resolution by Unbound

  • I have a moderately large lab environment (>100 hosts) that I recently transitioned to using an HA pfsense (2.4.4_1) cluster for DHCP & local name resolution services.

    In dhcpd.conf, I see that the failover peers default to a load-balancing arrangement (ala the: "split 128;" setting). This rather surprised me as all the other routing features in an HA cluster are very much an active-standby relationship.

    Thus, with each dhcpd peer only responding to half of the local hosts -- unbound is stuck with only half the picture and cannot "authoritatively" resolve all the local host names.

    My preference was to have the local hosts simply use the cluster's LAN VIP as their DNS to keep things simple and reliable via cluster failover.

    Am I missing something stupid, or is this an oversight in the behind-the-scenes failover dhcpd configuration?

    Any thoughts on the best way to work-around this?

  • I'm battling this same issue on 2.4.4 and it is driving me crazy! All the other HA features work great for me.

    I have 2 ESXi nodes with a pfSense node on each with a CARP VIP for each VLAN.

    Both my DHCP servers are in the 'normal' state. Each DHCP server is assigning IPs to half of my VMs. Each Unbound instance can only resolve half the hosts, the rest I get this with nslookup
    ** server can't find <hostname>: NXDOMAIN

    I'm assuming the issue is caused by DHCP only registering hosts to the local Unbound instance, and not the other one. I'm also assuming there is no DNS zone replication in unbound :(

    My thoughts at this point is to either create some replication script of my own, or move DNS+DHCP to a pair of Windows servers. What were you using prior to pfsense?

  • Rebel Alliance Developer Netgate

    Unfortunately this is 100% on the ISC DHCP daemon. The failover mechanism doesn't (always?) share hostnames between peers, so both nodes don't know about all the hostnames, so they can't publish them all in DNS.

    The best workaround is to have a real BIND or other suitable DNS server and then have DHCP DNS registration setup that way, rather than relying on the firewall to handle it.

  • LAYER 8 Global Moderator

    My take on resolving dhcp clients, if you want/need to resolve dhcp clients.. You might as well setup a reservation for this client so you always know what its IP is, and it will not change and can just setup actual dns entry for it vs having to deal with dhcp registering anything.

    Just another way to skin the cat ;)

    If your running a lab with over 100 devices, and you want to be able to resolve their names, etc. Its prob time to run real authoritative Name Services and have either the dhcpd register the names, or even the client themselves, etc.

  • After a lot of testing I think I fixed my resolutions. For me it was NTP!

    One of my nodes was out by a good margin. I know this causes issues with dhcpd. I think this put the cluster in a weird state where they both handed out IPs but would not exchange hostname data. I fixed this yesterday but it wasn't until today when I restarted VMs en masse that resolutions started working.

    @foobert maybe worth checking the time on each pfsense node?

    Thanks for the informative replies @jimp and @johnpoz . I understand the mechanism a lot better now!

  • @jimp -- Thanks for the confirmation on what I'm seeing. I suppose I should follow up with ISC.

    @johnpoz I completely respect that point of view on reservations. It's just not realistic when I have a dozen worker bees setting up/tearing down stuff every day. They need autonomy w/o getting me involved constantly.

    At this point, I'm strongly considering going back to dnsmasq -- it worked flawlessly for this. I may absorb the headache of running BIND, but, I'm not sure its really worth the HA benefit that prompted the change in the first place. "don't fix what isn't broken" ¯\(ツ)