List of problems/bugs in HA/CARP setups

JeGr

Hi,

as the Maxmind stuff made us reconfigure pfBNG on multiple clusters again, I finally wanted to start the list to hopefully get pfB up to date with all the problems of a HA setup that are still open or should be addressed/noticed. That also includes the currently "wonky" sync/save process and the way auto-updates "spam" the audit logs with useless configurations that push out actual config changes that would be needed to roll back to. The list is incomplete and there are a few more minor thing or would-be-nice features, but it's a list we've condensed with customers over the last weeks and months with the biggest problems to hopefully help development along to a better CARP setup

not only a HA/CARP thing but a real nerve-racking one: pfBlocker makes a change to the audit log (translates to: creates a new backup configuration) every time it runs updates via CRON. Meaning: your config history is spammed full of (system): pfBlockerNG: saving DNSBL changes entries that have no content change besides having a new timestamp. That's a big thing as per default pfB is set up to run every hour and with around 30 history steps set up per default, your actual changes you did will be pushed out by those auto updates very quickly without a chance to really roll back to a working state anymore. Furthermore, and that is a HA problem, all those useless config syncs push an update to the CARP backup member resulting in a full config push (e.g.: (system)@192.168.168.1: Merged in config (staticroutes, gateways, virtualip, system, aliases, ca, cert, crl, dhcpd, dhcrelay, dnshaper, dnsmasq, filter, ipsec, nat, openvpn, schedules, shaper, wol sections) from XMLRPC client.). But even worse: as pfBlocker states, that it only syncs the config but not the lists, the backup member itself also has to run pfBlocker to really update them resulting in - again - another useless config backup entry with the above DNSBL changes now having two changes with no content pushing out changes even faster then the ones on the master. Also when not reconfigured, both happen around the same time (on :00 every hour) there resulting in a stress/overload situation with the standby node as it is running the update itself and if we take into account that both nodes have the same hardware and would run ~the same speed will finish at the same time which results in the standby node getting hit with a useless config sync right after having updated the lists. As those PHP jobs can be memory and or CPU intense depending on setup, that's quite a load spike you'll get every hour.

Just as a quick visualization, have a look at node01 (master):

and node02 (standby):

showing that phenomenom described and creating needless config backups that again create useless backups and delays while e.g. syncing them via ACB to the cloud or backing them up to other means.

Syncing only by Cron or by "force run cron/update all": That method is completely different than every other part of the firewall, of other packages etc. If you configure anything else in pfSense that has a XMLRPC sync tab or Sync method, it is synced by saving your work. Either changes on a page/tab, DHCP settings, Filer package etc. etc. - everything just syncs after you hit save. Period. pfBNG using a non-standard method is very confusing to work with (as I work very much with various customers, no one found that really intuitive) and is prone to errors or missing data as you think it should work but it doesn't. ;)
Syncing (currently?) has a bug that you can't save the settings when selecting "Sync to system configured backup server" as the code falsely checks the fields below for the correct IP/port instead of just silently using the system config values under System/High-Availability. They are used but the form fields have to be filled out with (some) values that won't trigger the check to fail. That should only happen when selecting the third option for manually configured backup servers, where those fields are in use :)
DNSBL setup in CARP: we don't recommend to use the webserver setting on the DNSBL page to our customers as null-blocking (or null block logging) is the very preferred way to handle things and displaying error pages only leads to HTTPS warning fatigue and getting them normalized to clients so that's to be avoided. Thus null blocking like PiHole or other tools do it, is the preferred way to go. As there is no possibility to disable the webserver configuration alltogether for not needing it, you have to configure an IP and use either IP alias or CARP. That's not an ideal situation in a CARP setup, as in most cases, additional CARP VIPs you'll want to setup with "Alias on CARP VIP" style and to stop increasing multicast traffic between nodes unnecessarily. So ideal case would be: make an enable switch for those who actually WANT to use the error page via webserver configuration (don't recommend it for reasons above but it's anyone's call) and if it's enabled, add an Alias-on-CARP style VIP selection to the dropdown. Also disable localhost interface lo0 for selection for CARP as it's not allowed. CARP needs both nodes to be available to each other, localhost per definition can't do that :) Also the text below suggests using localhost would be correct so people stumble upon that and try to configure wrong CARP/cluster configurations ;)

Those were the most pressing issues for our cluster setups. A few minor ones not related to HA popped up, too, but are mostly notice/questions, like

(minor) DNSBL category: Isn't shallalist dead? The domain nowadays houses a strange news outlet, that I wouldn't really link to.
Abuse/Threatfox have an IP list part , too. Why not include that SOC IPs?
DoH: unknown to many, the ACME package can need access to DoH IPs so when blocking all DoH stuff, one can also block/hinder certificate processing via ACME package. A small side notice may help with this to either exclude one service that is used by acme.sh from the blockage or note to set the check timer to 180+ seconds (from 20s default) which acts as a disable DoH "toggle" in acme.sh. Again not that widely known but many have stumbled upon up until now.
it seems that the multiple changes pfBlocker triggers in the audit log (see #1) is also the culprit in breaking the audit mechanism of managing the max amount of config.xml copies to archive. We have both nodes of our DC cluster set to 100 steps back to still have a change to get a real user config.xml besides the pfBlocker non-changes. We now had multiple occasions of admins checking the audit logs (Config History) and having to wait for 10+min for the site to load. As we were investigating it was shown, that the /backup dir had around 14000 versions of config.xml instead of the configured 100. After finally loading the page and checking again via
# ls -1 /conf/backup | wc -l
it was down to 102 again. Currently I have a lab machine that wasn't touched at all for months! that reports:

[24.03-RELEASE][admin@pfs-plus-2403.lab.test]/root: ls -1 /conf/backup/ | wc -l
    5637

The only thing that one has running continously is pfBlockerNG updating the blocklists. So no logins or config changes whatsoever but still accumulated configs without pfSense itself managing the backup count and rotating/deleting the old ones.

Hope we can address the big ones above in some kind and if needed can give access to a carp-test-setup to play around with.

Cheers
\jens

SteveITS

@JeGr said in List of problems/bugs in HA/CARP setups:

Syncing only by Cron or by "force run cron/update all":

That's a known issue, https://redmine.pfsense.org/issues/14189#note-16 links to https://forum.netgate.com/topic/179060/pfblockerng-sync-not-working/55 which has a one-line workaround/fix.

@JeGr said in List of problems/bugs in HA/CARP setups:

can't save the settings when selecting "Sync to system configured backup server"

That has a redmine also: https://redmine.pfsense.org/issues/15159

Not sure about the others offhand.

SteveITS

@JeGr said in List of problems/bugs in HA/CARP setups:

the backup member itself also has to run pfBlocker to really update them

Let me throw out what I believe is another symptom...I've seen a few cases in the past year or so where the backup router sees an alias error during the sync/cron period:

06:45:24 There were error(s) loading the rules: /tmp/rules.debug:103: file "/var/db/aliastables/pfB_PRI1_v4.txt" contains bad data - The line in question reads [103]: table <pfB_PRI1_v4> persist file "/var/db/aliastables/pfB_PRI1_v4.txt"

Not every day but maybe every few months.

JeGr

@SteveITS said in List of problems/bugs in HA/CARP setups:

@JeGr said in List of problems/bugs in HA/CARP setups:

Syncing only by Cron or by "force run cron/update all":

That's a known issue, https://redmine.pfsense.org/issues/14189#note-16 links to https://forum.netgate.com/topic/179060/pfblockerng-sync-not-working/55 which has a one-line workaround/fix.

@JeGr said in List of problems/bugs in HA/CARP setups:

can't save the settings when selecting "Sync to system configured backup server"

That has a redmine also: https://redmine.pfsense.org/issues/15159

Not sure about the others offhand.

Sure :) I know some of the things listed are already known or in case of the sync are "working as documented" - the update handling is documented in the notices and package infos. Just wanted to provide a list to tackle together to make the package itself more stable and better to work in a clustered environment. My main criticism refers to it being the "standout" of all core and additional packages as every package syncs normally by saving settings etc. and only pfBNG is behaving differently. I'm all for reading documentation, release notes and changelogs :) But I'm really asking myself if that has to be the way it's done at the core of it. Normally in UX/UI you'd advertise to go the route of least surprises to do things and that's a prime example. If any component does X and only one does Y it's bound to cause friction :)

@SteveITS said in List of problems/bugs in HA/CARP setups:

Let me throw out what I believe is another symptom...I've seen a few cases in the past year or so where the backup router sees an alias error during the sync/cron period:

Yeah, me too. Couldn't exactly pinpoint it to a specific occurence yet so I didn't list it but sometimes you have the "unknown alias" messages popping up on the standby node with pfB running, updating and working fine and if you check the state tables or aliases you find the pfB_aliases working just fine. So yes, there's another bug slightly hidden in the stack here :)

Cheers
\jens

JeGr

Hey there,

just wanted to follow back as it's been a year an none of the mentioned issues are being even discussed or adressed as far as I'm aware.

Especially #1:

@JeGr said in List of problems/bugs in HA/CARP setups:

not only a HA/CARP thing but a real nerve-racking one: pfBlocker makes a change to the audit log (translates to: creates a new backup configuration) every time it runs updates via CRON. Meaning: your config history is spammed ...

and #2:

@JeGr said in List of problems/bugs in HA/CARP setups:

Syncing only by Cron or by "force run cron/update all": That method is completely different than every other part of the firewall, of other packages etc. If you configure anything else in pfSense that has a XMLRPC sync tab or Sync method, it is synced by saving your work.

are popping up in support cases often. pfBlocker being the single only package that is just doing saving/syncing differently is not doing it justice. And the #1 problem is still the WHOPPING mass of useless and unneccessary backups triggered by the hourly cron that simply does nothing at all but clatters your audit log and backup history with nonsense backups that simply have no changes at all. And as described: in a CARP setup, such a useless change gets doubly annoying as it triggers another useless sync and replication to the standby system, doing nothing but triggering a whole lot of action (e.g. restart of services for nothing) and the standby system itself also has to run a separate pfB cron to do it's own download of lists as those aren't synced so it adds another useless empty audit sync to the standby system making backup history a dumpster fire of empty commits

If there's something besides coding: testing, giving access to VMs etc. - let us know - but in a business context it's really important to get pfB fixed up for HA/CARP operation sooner rather then later!

Cheers

JeGr

Still there, still wanting to help, still having big issues in clustered scenarios that are NOT nice to have. :/

Cheers

btspce

We are suspecting more and more that we are hitting #1 in this thread.
Since 24.11 and adding more threat lists we are now up from ~200000 to ~500000 ips being blocked and are having trouble with one or both 6100MAX randomly freezing during pfblockerng updates. It can take from 1-14 days before it happens (most often 1-7 days) and four out of five times it is the secondary firewall in our HA setup that freezes but we have also seen the primary freeze and backup takes over and today both (!) firewalls froze during the update for the first time (blue led on and no blink on both).

"A communications error occurred while attempting XMLRPC sync." is the usual symptom.
pfBlockerNG 3.2.0_16

Disabling CIDR aggregation did not help.
We have now disabled syncing in pfblockerng and moved the secondary firewalls cron updates to 15 min after primary firewall so they don't do the updates at the same time to see if it solves this issue.
Can someone please look at these critical issues? If we can't use pfblockerng (or load threat lists in the rules directly like opnsense does it ) we can't use pfsense.

JeGr

@btspce I'd add another bullet point to it, as it seems very much pfBlocker related:

it seems that the multiple changes pfBlocker triggers in the audit log (see #1) is also the culprit in breaking the audit mechanism of managing the max amount of config.xml copies to archive. We have both nodes of our DC cluster set to 100 steps back to still have a change to get a real user config.xml besides the pfBlocker non-changes. We now had multiple occasions of admins checking the audit logs (Config History) and having to wait for 10+min for the site to load. As we were investigating it was shown, that the /backup dir had around 14000 versions of config.xml instead of the configured 100. After finally loading the page and checking again via

# ls -1 /conf/backup | wc -l

it was down to 102 again. Currently I have a lab machine that wasn't touched at all for months! that reports:

[24.03-RELEASE][admin@pfs-plus-2403.lab.test]/root: ls -1 /conf/backup/ | wc -l
    5637

The only thing that one has running continously is pfBlockerNG updating the blocklists. So no logins or config changes whatsoever but still accumulated configs without pfSense itself managing the backup count and rotating/deleting the old ones.

That seems to very much point at pfBlockerNG as it's the only package currently, that creates that much audit logs on the side.

Not wanting to post any blame here! Don't get me wrong. Just wanted to get as much details and infos out so we can squash those bugs :)

Cheers :)