DHCPD does not start, gives error

Rhicinus

Hello,
we have a pfSense Cluster with 2 pfSense VMs, Primary + Secondary, using them as Captive Portal. DHCP is configured as failover pool with the Secondary as Failover Peer IP. They have both pfsense software 2.4.4_p3

There is the problem, that at certain times the dhcpd server on the secondary dies while it does a restart.

dhcpd log:

Oct 2 17:29:56 	dhcpd 		exiting.
Oct 2 17:29:56 	dhcpd 		process and the information we find helpful for debugging.
Oct 2 17:29:56 	dhcpd 		before submitting a bug. These pages explain the proper
Oct 2 17:29:56 	dhcpd 		bugs on either our web page at www.isc.org or in the README file
Oct 2 17:29:56 	dhcpd 		than a configuration issue please read the section on submitting
Oct 2 17:29:56 	dhcpd 		If you think you have received this message due to a bug rather
Oct 2 17:29:56 	dhcpd 		Can't register I/O handle for vmx1.2688: out of range
Oct 2 17:29:56 	dhcpd 		Unable to register fd with library out of range
Oct 2 17:29:56 	dhcpd 		Sending on BPF/vmx1.2688/00:50:56:86:b9:8f/172.16.64.0/21
Oct 2 17:29:56 	dhcpd 		Listening on BPF/vmx1.2688/00:50:56:86:b9:8f/172.16.64.0/21
Oct 2 17:29:56 	dhcpd 		Sending on BPF/vmx1.2690/00:50:56:86:b9:8f/172.16.80.0/21
Oct 2 17:29:56 	dhcpd 		Listening on BPF/vmx1.2690/00:50:56:86:b9:8f/172.16.80.0/21
Oct 2 17:29:56 	dhcpd 		Sending on BPF/vmx1.2687/00:50:56:86:b9:8f/172.16.56.0/21
Oct 2 17:29:56 	dhcpd 		Listening on BPF/vmx1.2687/00:50:56:86:b9:8f/172.16.56.0/21
Oct 2 17:29:56 	dhcpd 		Sending on BPF/vmx1.2691/00:50:56:86:b9:8f/172.16.88.0/21
Oct 2 17:29:56 	dhcpd 		Listening on BPF/vmx1.2691/00:50:56:86:b9:8f/172.16.88.0/21
Oct 2 17:29:56 	dhcpd 		Sending on BPF/vmx1.2693/00:50:56:86:b9:8f/172.16.104.0/21
Oct 2 17:29:56 	dhcpd 		Listening on BPF/vmx1.2693/00:50:56:86:b9:8f/172.16.104.0/21
Oct 2 17:29:56 	dhcpd 		Sending on BPF/vmx1.2696/00:50:56:86:b9:8f/172.16.128.0/21
Oct 2 17:29:56 	dhcpd 		Listening on BPF/vmx1.2696/00:50:56:86:b9:8f/172.16.128.0/21
Oct 2 17:29:56 	dhcpd 		Sending on BPF/vmx1.2692/00:50:56:86:b9:8f/172.16.96.0/21
Oct 2 17:29:56 	dhcpd 		Listening on BPF/vmx1.2692/00:50:56:86:b9:8f/172.16.96.0/21
Oct 2 17:29:56 	dhcpd 		Sending on BPF/vmx1.2689/00:50:56:86:b9:8f/172.16.72.0/21
Oct 2 17:29:56 	dhcpd 		Listening on BPF/vmx1.2689/00:50:56:86:b9:8f/172.16.72.0/21
Oct 2 17:29:56 	dhcpd 		Wrote 11661 leases to leases file.
Oct 2 17:29:56 	dhcpd 		For info, please visit https://www.isc.org/software/dhcp/
Oct 2 17:29:56 	dhcpd 		All rights reserved.
Oct 2 17:29:56 	dhcpd 		Copyright 2004-2018 Internet Systems Consortium.
Oct 2 17:29:56 	dhcpd 		Internet Systems Consortium DHCP Server 4.3.6-P1
Oct 2 17:29:56 	dhcpd 		PID file: /var/run/dhcpd.pid
Oct 2 17:29:56 	dhcpd 		Database file: /var/db/dhcpd.leases
Oct 2 17:29:56 	dhcpd 		Config file: /etc/dhcpd.conf
Oct 2 17:29:56 	dhcpd 		For info, please visit https://www.isc.org/software/dhcp/
Oct 2 17:29:56 	dhcpd 		All rights reserved.
Oct 2 17:29:56 	dhcpd 		Copyright 2004-2018 Internet Systems Consortium.
Oct 2 17:29:56 	dhcpd 		Internet Systems Consortium DHCP Server 4.3.6-P1
Oct 2 17:29:52 	dhcpd 		DHCPACK on 172.16.89.60 to 54:99:63:bc:fb:10 (Marcos-iPhone) via vmx1.2691

I can then hit the start button for the dhcpd service, and after 2-10 tries with error, the service eventually starts without error

Can't register I/O handle for vmx1.2688: out of range
Oct 2 17:29:56 dhcpd Unable to register fd with library out of range

Strange thing is, when disabling dhcp server on that interface, the issue persists, then complaining about another interface.

I don't know exactly what the dhcpd causes to restart, but the general log shows at exactly the same time:

Oct 2 17:29:57 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:57 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:57 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:57 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:56 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:56 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:56 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:56 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:56 	minicron 		(/etc/rc.prunecaptiveportal) terminated by signal 15 (Terminated)
Oct 2 17:29:53 	check_reload_status 		Reloading filter
Oct 2 17:29:52 	check_reload_status 		Carp backup event
Oct 2 17:29:52 	kernel 		vmx1.2696: promiscuous mode disabled
Oct 2 17:29:52 	kernel 		carp: 8@vmx1.2696: BACKUP -> INIT (hardware interface up)
Oct 2 17:29:52 	kernel 		ifa_maintain_loopback_route: deletion failed for interface vmx1.2696: 3
Oct 2 17:29:52 	kernel 		ifa_maintain_loopback_route: deletion failed for interface vmx1.2696: 3
Oct 2 17:29:52 	kernel 		ifa_maintain_loopback_route: deletion failed for interface vmx1.2696: 3
Oct 2 17:29:52 	kernel 		carp: 8@vmx1.2696: INIT -> BACKUP (initialization complete)
Oct 2 17:29:52 	kernel 		vmx1.2696: promiscuous mode enabled
Oct 2 17:29:52 	check_reload_status 		Carp backup event

This happens seemingly randomly AND every day at 3:00AM, but not always causing the dhcpd to crash.

The dhcpd crash ONLY happens on the secondary.

Maybe it has to do with a config sync from the primary?

I hope someone can point me to the right direction what I may can check or improve.

best regards

mark-b

Did you ever sort this out? My router started with the same thing out of the blue today.

Rhicinus

Unfortunately not. I figured out that rebooting helps for a couple of days. We think about paid support to hopefully get rid of this issue.

regards

Luca De Andreis

I've exactly the same problem
Randomly DHCP server crash on the secondary node.

The cluster runs on version 2.4.4p3.
Does the new 2.4.5 solve this issues ?

Any news ?

Thanks

Rhicinus

Hi Luca,
we wait a couple of weeks before deploying 2.4.5.
I figured out that it is not randomly, but is triggered by a config save on the primary.
After deactivating the automatic config backup the crash is gone.

It crashes almost always when manually saving the config on the other node.

regards

Luca De Andreis

@Rhicinus

Thanks so much for the additional information.
So you think that the problem is related to the automatic saving of the configuration (Services -> Auto Config Backup)?
In fact, I use it on both nodes.
Have you deactivated the ACB (Active Config Backup) ? Only on the slave, only on primary on both nodes ?

Very thanks

Luca

Rhicinus

Well, for me it was as follows:

Whenever a config save is done on one node, be it automatic or manual, on the other node the dhcpd stops working. So ACB is now deactivated to keep the dhcpd running. Luckily I had it active on the primary only. Our problems would have been far bigger had I activated the ACB on the secondary, too (without knowing it causes the problem). I do not use it anymore. Maybe this is fixed in the new version. I will test it in our lab when I find the time.

Greetings

Luca De Andreis

@Rhicinus

Great !

Now I disable ACB on the primary (ACB was already disabled on the secondary) and I do some tests.

Thanks
Luca

Luca De Andreis

I've tried on a two node PfSense cluster on 2.4.5.
Seems works fine !!! DHCP is solid on the slave node.

Rhicinus

Thank You for this valuable information!