DHCP weirdness after 23.09 upgrade

mathiasringhof

I've created a Redmine as a couple of people on reddit are seeing similar behaviour so it might be something a bit more common and looking into (i.e. not some stupidity with my config, at least hopefully):
https://redmine.pfsense.org/issues/15011

SteveITS

@mathiasringhof I don’t know another non-business setting I can look at with pfSense and multiple IOT type devices. It’s quite possible they just ignore the lease time and do whatever they want. And Kea might log that differently. Windows Server we set as the DHCP server or a static IP so none there.

Looking at ISC logs before an upgrade to 23.09 would be informative I think, to compare. Otherwise we (I) may be assuming the logging is different.

mathiasringhof

@SteveITS Luckily while the IoT device is the one losing the IP address the biggest offender right now is a Windows Server 2019 so it's not just a "IoT device has stupid implementation" although that might be part of it.

I've added a comparison log to the red mine where one can see that with 23.05 the same server just issues a couple of messages in 24h vs thousands.

In addition, the screenshot in the first post should also give a good indicator that as soon as you upgrade my installation to 23.09 there is a significant jump in log volume.

SteveITS

@mathiasringhof in my head I guess I was thinking, is it a logging issue or a device issue but just logging wouldn’t explain the devices losing leases as described.

A better way to say it is the 5-10 minute lease renewal time I saw (vs 1 day) may be a red herring but the repeated requests every 10m is the behavior change, right?

We usually set up with a RAM disk anyway, but otherwise a busy network could have a lot of extra logging I/O which isn’t ideal for eMMC.

mathiasringhof

@SteveITS depending on the device it's more than every 10 minutes. Look at an excerpt from the Gist linked in the Redmine issue:

2023-11-17T11:55:22.733+01:00	DHCPREQUEST for 192.168.50.31 from ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:22.734+01:00	DHCPACK on 192.168.50.31 to ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:23.734+01:00	DHCPACK on 192.168.50.31 to ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:23.734+01:00	DHCPREQUEST for 192.168.50.31 from ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:25.734+01:00	DHCPACK on 192.168.50.31 to ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:25.734+01:00	DHCPREQUEST for 192.168.50.31 from ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:42.735+01:00	DHCPACK on 192.168.50.31 to ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:42.735+01:00	DHCPREQUEST for 192.168.50.31 from ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:43.735+01:00	DHCPACK on 192.168.50.31 to ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:43.735+01:00	DHCPREQUEST for 192.168.50.31 from ce:3c:e8:92:d2:fa via igb3
2023-11-17T11:55:44.735+01:00	DHCPACK on 192.168.50.31 to ce:3c:e8:92:d2:fa via igb3

vs this is the whole day under 23.05.1 and expected behaviour with the default lease time (7200 seconds = 2h = every 1h the client asks):
Link

SteveITS

@mathiasringhof I think we’re talking about the same thing…the device requests the renewal at some point, be it a few minutes or several hours, but there’s a cycle of several to many, many retries each time it does.

IOW I doubt it’s the device getting a 2 second lease. On your Server ipconfig /all will show you the lease expiration IIRC.

mathiasringhof

Looks like we found something, see the Redmine issue. The binary shipped in 23.09 causes an issue. Running assumption is that it responds on random ports to DHCPREQUEST which seems to cause issues with some clients if they reject anything not coming from port 67.

Replacing the binary with the one from 23.05.1 resolved the issue for me.

johnpoz

@mathiasringhof yeah just did a quick sniff.. and see this

15:39:55.151294 IP 192.168.4.81.68 > 192.168.4.253.67: UDP, length 304
15:39:55.151592 IP 192.168.4.253.18184 > 192.168.4.81.68: UDP, length 300

That doesn't seem right, the reply should be from 67.. not some random port 18184..

The client sure doesn't seem to like it

15:39:55.151294 IP 192.168.4.81.68 > 192.168.4.253.67: UDP, length 304
15:39:55.151592 IP 192.168.4.253.18184 > 192.168.4.81.68: UDP, length 300
15:40:58.682252 IP 192.168.4.81.68 > 192.168.4.253.67: UDP, length 304
15:40:58.682557 IP 192.168.4.253.18184 > 192.168.4.81.68: UDP, length 300
15:41:04.499847 IP 192.168.4.65.68 > 192.168.4.253.67: UDP, length 308
15:41:04.500249 IP 192.168.4.253.18184 > 192.168.4.65.68: UDP, length 300
15:42:03.221195 IP 192.168.4.81.68 > 192.168.4.253.67: UDP, length 304
15:42:03.221489 IP 192.168.4.253.18184 > 192.168.4.81.68: UDP, length 300
15:42:04.507527 IP 192.168.4.65.68 > 192.168.4.253.67: UDP, length 308
15:42:04.507775 IP 192.168.4.253.18184 > 192.168.4.65.68: UDP, length 300

SteveITS

Tagging @stephenw10.

The extra logging isn’t really a breaking change but clients not getting a lease is.

Most people, if they even notice this, can use Kea, but not everyone since it’s not feature complete/ready yet.

johnpoz

@SteveITS I just switched to binary from 23.05.1 just waiting for something to renew.. But log already looks better.. 3 clients renewed before I got the sniff going..

2 of the problem childred I was seeing from 4.81 and 4.63 have now renewed it seems

Nov 19 10:12:37 	dhcpd 	36201 	DHCPACK on 192.168.4.63 to 68:57:2d:98:28:22 via igb2.4
Nov 19 10:12:37 	dhcpd 	36201 	DHCPREQUEST for 192.168.4.63 from 68:57:2d:98:28:22 via igb2.4
Nov 19 10:12:03 	dhcpd 	36201 	DHCPACK on 192.168.4.81 to a8:1b:6a:24:ec:26 via igb2.4
Nov 19 10:12:03 	dhcpd 	36201 	DHCPREQUEST for 192.168.4.81 from a8:1b:6a:24:ec:26 via igb2.4

edit: ok couple of devices renewed while I was sniffing - this looks more normal

16:34:01.236228 IP 192.168.4.73.68 > 192.168.4.253.67: UDP, length 308
16:34:01.236487 IP 192.168.4.253.67 > 192.168.4.73.68: UDP, length 300
16:34:53.173058 IP 192.168.4.49.68 > 192.168.4.253.67: UDP, length 302
16:34:53.173367 IP 192.168.4.253.67 > 192.168.4.49.68: UDP, length 300

Phizix

@johnpoz & @mathiasringhof,

How does one replace the binary with the one from 23.05.1?

Phizix

mathiasringhof

@Phizix The file is located here:

/usr/local/sbin/dhcpd

Boot into a 23.05.1 environment and backup that file to a local share (e.g. scp or an SFTP client), then boot into 23.09, backup the old file just in case and replace.

You need to have SSH enabled (see here) for that to work.

Phizix

@mathiasringhof,

I assume this file is the same across most of the various hardware.

I have a SG-1100 backup that is currently on 23.05.1, If the file is the same I could pull it from it. If it is hardware specific, then I will have to jump back to 23.05.1 on my SG-5100 to pull it.

Do you happen to know?

Phizix

SteveITS

@Phizix the 1100 and 2100 and 3100 are Arm CPUs so that won’t work. If you had reinstalled at some point and have ZFS you could revert to an old boot environment temporarily.

johnpoz

@Phizix said in DHCP weirdness after 23.09 upgrade:

How does one replace the binary with the one from 23.05.1?

For me I just booted the zfs 23.05.1 on my VM, grabbed the file via sftp.. Then stopped dhcpd on my sg4860, copied the current dhcpd to dhcpd.old and then via sftp put the dhcpd from my vm on my 4860 and then restarted it..

I am going to have to do a bit of research - but I could understand why some clients might not like an answer from some random port when it sent to 67.. Have to fire up the rfc and see if client is just not up to date on what is should do.. It is odd, and if doesn't like it renew why does it take it on discover. Or is the discover coming from 67? Very interesting situation for sure.

edit:
It was actually a nice test of booting to an old zfs.. which worked out great.

Phizix

@SteveITS,

My SG-5100 is on ZFS, so I should be able to boot to the previous 23.05.1 environment and grab it.

I did not know if all the files were hardware specific or not.

I don't think it will be too hard, but if it had not been hardware specific then it would have almost been trivial to grab it.

Thanks for the response!

Phizix

michmoor

@johnpoz said in DHCP weirdness after 23.09 upgrade:

It was actually a nice test of booting to an old zfs.. which worked out great.

The boot environment feature is so clutch. Just awesome

beatvjiking

I'm seeing similar problems while applying "DHCP enforcement" through my wifi infrastructure - the controllers aren't recognizing the DHCP transactions and are kicking clients off even after they receive DHCP leases. Switching to Kea isn't an option for me yet since I'm running redundant pairs. The behavior is reliably reproducible on 23.09 and doesn't exist on 23.05.1.

Phizix

I may have to do this soon. It appears my garage door opener is not getting an IP. DOH!!

beatvjiking

If anyone has a non-ZFS-based way to patch/work around the issue at scale (not all my affected boxes have been moved to ZFS yet, and there are a lot, so a binary swap would be challenging) I'd be super amped to hear about it.