23.01 DHCP Failover Broken (work around included)
-
I pulled the trigger for my home lab pfSense HA pair and upgraded it to 23.01 from 22.05.
Amongst the other comments posted by others already (firmware branches missing, cannot uninstall packages from GUI but can do so using 'pkg' command on CLI) I have found another incident no one has mentioned here yet anyways.
The other issue I am seeing is on any internal interface where I have restricted rules, my DHCP failover is in a 'communication-interrrupted' state. ie. my IOT, Public-WiFi and CCTV interfaces are heavily restricted. Each of these DHCP pool status comes up as 'cummincations-interrupted'.
As soon as I added a rule to allow tcp/519 and tcp/520 from/to same subnet on each of these interfaces - my DHCP pool status recovers to 'normal' state. As it stands now things are working fine - and I will stay on 23.01 to test things further. Hopefully this will help others - until things are fixed.
-CARP and fail over is working fine.
-This all worked just fine prior to 23.01 upgrade
-Tried disabling and also uninstalling pfBlocker using CLI.!Here is an exmaple of ACL entry I added to each affected interface:
Screenshot from 2023-02-16 03-49-21.png-I see connections on this ACL entry between my two HA hosts (so I know ACL is being matched/used)
-DHCP_HA alias contains tcp/519 and tcp/520EDIT: Running each HA member as a VM on different ESXi hosts. Not running netgate hardware here if that matters. Been running this setup for 3+ years through numerous pfSense versions. Don't believe this is a VM issue.
-
-
There should be automatic rules to pass that traffic, but I was able to replicate this and it appears to be a one character typo in the line that fetches the failover IP address for a DHCP server interface when making the ruleset.
I opened https://redmine.pfsense.org/issues/13965 for this and I have committed a fix which will be in momentarily.
You can install the System Patches package and then create an entry for
2186435b5185ceb294cd6a4c1380db443e4dd218
to apply the fix once it shows up on Redmine/Github which will be any minute now. -
I have applied the patch and things are working again. I removed my ACL entries (mentioned above) before applying the patch. The DHCP pool status recovered after applying the patch provided.
*** Also of note, the ID provided above is wrong. I tried adding it with that and was getting failures. Following the regression link shows a different changeset --> 2186435b5185ceb294cd6a4c1380db443e4dd218. I'm guessing the code was possibly edited after this was posted. I created a System patch commit ID against this.
-
I must have grabbed the wrong ID when I copied/pasted, I edited in the correct one. The ID I had in there was the merge commit from when I merged that fix into the plus tree, not the publicly accessible one.
-
Just to add for anyone else coming across this issue.
Adding a vlan and therefore triggering a configuration reload and mini failover, caused exactly the same issue. Which was not fixable with restoring a configuration backup or even a restart of both firewalls.
Applying this patch:
Fix automatic firewall rules for HA DHCP server failover (Requires reboot or filter reload to activate, Redmine #13965)Fixed the issue with the DHCP server.
The issue showed in Status / DHCP Leases a permanent status of My State - 'Recover', as well as previously mentioned 'communication-interrrupted'
-
-