List of problems/bugs in HA/CARP setups
-
Hi,
as the Maxmind stuff made us reconfigure pfBNG on multiple clusters again, I finally wanted to start the list to hopefully get pfB up to date with all the problems of a HA setup that are still open or should be addressed/noticed. That also includes the currently "wonky" sync/save process and the way auto-updates "spam" the audit logs with useless configurations that push out actual config changes that would be needed to roll back to. The list is incomplete and there are a few more minor thing or would-be-nice features, but it's a list we've condensed with customers over the last weeks and months with the biggest problems to hopefully help development along to a better CARP setup
- not only a HA/CARP thing but a real nerve-racking one: pfBlocker makes a change to the audit log (translates to: creates a new backup configuration) every time it runs updates via CRON. Meaning: your config history is spammed full of
(system): pfBlockerNG: saving DNSBL changes
entries that have no content change besides having a new timestamp. That's a big thing as per default pfB is set up to run every hour and with around 30 history steps set up per default, your actual changes you did will be pushed out by those auto updates very quickly without a chance to really roll back to a working state anymore. Furthermore, and that is a HA problem, all those useless config syncs push an update to the CARP backup member resulting in a full config push (e.g.:(system)@192.168.168.1: Merged in config (staticroutes, gateways, virtualip, system, aliases, ca, cert, crl, dhcpd, dhcrelay, dnshaper, dnsmasq, filter, ipsec, nat, openvpn, schedules, shaper, wol sections) from XMLRPC client.
). But even worse: as pfBlocker states, that it only syncs the config but not the lists, the backup member itself also has to run pfBlocker to really update them resulting in - again - another useless config backup entry with the aboveDNSBL changes
now having two changes with no content pushing out changes even faster then the ones on the master. Also when not reconfigured, both happen around the same time (on :00 every hour) there resulting in a stress/overload situation with the standby node as it is running the update itself and if we take into account that both nodes have the same hardware and would run ~the same speed will finish at the same time which results in the standby node getting hit with a useless config sync right after having updated the lists. As those PHP jobs can be memory and or CPU intense depending on setup, that's quite a load spike you'll get every hour.
Just as a quick visualization, have a look at node01 (master):
and node02 (standby):
showing that phenomenom described and creating needless config backups that again create useless backups and delays while e.g. syncing them via ACB to the cloud or backing them up to other means.
-
Syncing only by Cron or by "force run cron/update all": That method is completely different than every other part of the firewall, of other packages etc. If you configure anything else in pfSense that has a XMLRPC sync tab or Sync method, it is synced by saving your work. Either changes on a page/tab, DHCP settings, Filer package etc. etc. - everything just syncs after you hit save. Period. pfBNG using a non-standard method is very confusing to work with (as I work very much with various customers, no one found that really intuitive) and is prone to errors or missing data as you think it should work but it doesn't. ;)
-
Syncing (currently?) has a bug that you can't save the settings when selecting "Sync to system configured backup server" as the code falsely checks the fields below for the correct IP/port instead of just silently using the system config values under System/High-Availability. They are used but the form fields have to be filled out with (some) values that won't trigger the check to fail. That should only happen when selecting the third option for manually configured backup servers, where those fields are in use :)
-
DNSBL setup in CARP: we don't recommend to use the webserver setting on the DNSBL page to our customers as null-blocking (or null block logging) is the very preferred way to handle things and displaying error pages only leads to HTTPS warning fatigue and getting them normalized to clients so that's to be avoided. Thus null blocking like PiHole or other tools do it, is the preferred way to go. As there is no possibility to disable the webserver configuration alltogether for not needing it, you have to configure an IP and use either IP alias or CARP. That's not an ideal situation in a CARP setup, as in most cases, additional CARP VIPs you'll want to setup with "Alias on CARP VIP" style and to stop increasing multicast traffic between nodes unnecessarily. So ideal case would be: make an enable switch for those who actually WANT to use the error page via webserver configuration (don't recommend it for reasons above but it's anyone's call) and if it's enabled, add an Alias-on-CARP style VIP selection to the dropdown. Also disable
localhost
interface lo0 for selection for CARP as it's not allowed. CARP needs both nodes to be available to each other,localhost
per definition can't do that :) Also the text below suggests using localhost would be correct so people stumble upon that and try to configure wrong CARP/cluster configurations ;)
Those were the most pressing issues for our cluster setups. A few minor ones not related to HA popped up, too, but are mostly notice/questions, like
-
(minor) DNSBL category: Isn't shallalist dead? The domain nowadays houses a strange news outlet, that I wouldn't really link to.
-
Abuse/Threatfox have an IP list part , too. Why not include that SOC IPs?
-
DoH: unknown to many, the ACME package can need access to DoH IPs so when blocking all DoH stuff, one can also block/hinder certificate processing via ACME package. A small side notice may help with this to either exclude one service that is used by acme.sh from the blockage or note to set the check timer to 180+ seconds (from 20s default) which acts as a disable DoH "toggle" in acme.sh. Again not that widely known but many have stumbled upon up until now.
Hope we can address the 4 big ones above in some kind and if needed can give access to a carp-test-setup to play around with.
Cheers
\jens - not only a HA/CARP thing but a real nerve-racking one: pfBlocker makes a change to the audit log (translates to: creates a new backup configuration) every time it runs updates via CRON. Meaning: your config history is spammed full of
-
@JeGr said in List of problems/bugs in HA/CARP setups:
Syncing only by Cron or by "force run cron/update all":
That's a known issue, https://redmine.pfsense.org/issues/14189#note-16 links to https://forum.netgate.com/topic/179060/pfblockerng-sync-not-working/55 which has a one-line workaround/fix.
@JeGr said in List of problems/bugs in HA/CARP setups:
can't save the settings when selecting "Sync to system configured backup server"
That has a redmine also: https://redmine.pfsense.org/issues/15159
Not sure about the others offhand.
-
@JeGr said in List of problems/bugs in HA/CARP setups:
the backup member itself also has to run pfBlocker to really update them
Let me throw out what I believe is another symptom...I've seen a few cases in the past year or so where the backup router sees an alias error during the sync/cron period:
06:45:24 There were error(s) loading the rules: /tmp/rules.debug:103: file "/var/db/aliastables/pfB_PRI1_v4.txt" contains bad data - The line in question reads [103]: table <pfB_PRI1_v4> persist file "/var/db/aliastables/pfB_PRI1_v4.txt"
Not every day but maybe every few months.
-
@SteveITS said in List of problems/bugs in HA/CARP setups:
@JeGr said in List of problems/bugs in HA/CARP setups:
Syncing only by Cron or by "force run cron/update all":
That's a known issue, https://redmine.pfsense.org/issues/14189#note-16 links to https://forum.netgate.com/topic/179060/pfblockerng-sync-not-working/55 which has a one-line workaround/fix.
@JeGr said in List of problems/bugs in HA/CARP setups:
can't save the settings when selecting "Sync to system configured backup server"
That has a redmine also: https://redmine.pfsense.org/issues/15159
Not sure about the others offhand.
Sure :) I know some of the things listed are already known or in case of the sync are "working as documented" - the update handling is documented in the notices and package infos. Just wanted to provide a list to tackle together to make the package itself more stable and better to work in a clustered environment. My main criticism refers to it being the "standout" of all core and additional packages as every package syncs normally by saving settings etc. and only pfBNG is behaving differently. I'm all for reading documentation, release notes and changelogs :) But I'm really asking myself if that has to be the way it's done at the core of it. Normally in UX/UI you'd advertise to go the route of least surprises to do things and that's a prime example. If any component does X and only one does Y it's bound to cause friction :)
@SteveITS said in List of problems/bugs in HA/CARP setups:
Let me throw out what I believe is another symptom...I've seen a few cases in the past year or so where the backup router sees an alias error during the sync/cron period:
Yeah, me too. Couldn't exactly pinpoint it to a specific occurence yet so I didn't list it but sometimes you have the "unknown alias" messages popping up on the standby node with pfB running, updating and working fine and if you check the state tables or aliases you find the pfB_aliases working just fine. So yes, there's another bug slightly hidden in the stack here :)
Cheers
\jens