DHCP relay stopped working, started blocking--should I report?
-
Recently I've been fiddling with DHCP servers looking for ways to get DHCP if the hypervisor cluster for some reason is stopped--it's a small cluster so redundancy isn't much.
I settled on a bare-metal domain controller, now in charge of DHCP server with pfSense as relay and DHCP snooping enabled on the core switches; a couple of media players had been randomly becoming DHCP servers. IDK what triggers it yet.
On Windows Server I set up DHCP scopes without the fancy settings (
) like Policies and stuff and without the DHCP relay agent found in RRAS; as far as I understand that's included in the switches (or pfSense), they are the same components, plus Windows Server is only in one VLAN while pfSense has access to all of them…and then some.
That's the setup, super basic.
This was working fine, though. I think yesterday (I haven't slept) all clients except for those on the directly on the server's network lost their address and could get one. I went to the Windows Server first and there were no apparent issues, nothing on the Event Viewer, to be sure I reset the network settings [netsh winsock reset]. Next I tried the switches, they were not working in L3, so the most I could do was to disable the ports of the messy clients and disable DHCP snooping. Still nothing. Nothing in the logs, nothing in the log server.
I tried next pfSense and in the live filter there were tons of blocked entries from the DHCP to 255.255.255.255:68/UDP. I thought maybe it was some cache or something, somethiing out of sync, I rebooted the firewall but it wasn't fixed. I tried addind a rule to accept the broadcast traffic but it wasn't having it.
I launched Wireshark and selected the two interfaces, both using DHCP. One already had an IP address because it was on the server's network, the other didn't. I turned it off and back on (the other interface) and observed the traffic and I could actually watch it in the window being relayed to the other VLAN and the server was trying desperately to respond but it was being blocked. I forgot to save it, my bad.
Before enabling routing on a switch, I went back to pfSense to try my luck toggling things on and off, sometimes that resets things--it didn't, but I did stumble in the fix when I disabled Append circuit ID and agent ID to requests.
I understand relaying DHCP better over VPN tunnels or RADIUS than on L2, ironically where it belongs. I spent some time in the Microsoft docs site but it's so ambiguous it made it worse.
Was it misconfigured? Why did it worked before and still works (option 82, circuit/agent ID) in switches. Was it merely tolerating my ineptitude and now it got fed up? Or is it really a bug? AD aside, Windows Server sucks at any of its features, including DHCP, but this time I think it was for once working correctly. I should've saved the capture, but if anyone tells me it's not supposed to do that I'll try reproducing it to upload it.
I nearly forgot--IDK if it relates because of the complexity (read: mess): days ealier DHCP relay went out again but this time I was testing OSPF (on a switch, not pfSense) and I was about to remove most interfaces from pfSense to leave it (this instance) in the network for HAProxy while another was already taking its place using the same subnets but on a different IP address each subnet--i.e; competing firewalls on each VLAN, DHCP only pointed to the newer one though and the new firewall wasn't broadcasting DHCP. This was way more complicated so I was not that surprised when I saw in the pfSense system logs that DHCP relay had crashed and wouldn't start.