Frequent Crashes of Secondary PFsense Node in HA Cluster After Large Sync Operations
-
Hello,
I'm posting this message on the forum to find out if anyone else is experiencing the same issue or if there is a solution somewhere that I haven't found yet.
Basically, I have four PFsense HA clusters, and we regularly experience crashes on the secondary node with the following message on the primary:
"A communications error occurred while attempting XMLRPC sync."Upon checking, we completely lose access to the WEBUI of our secondary node. It becomes entirely inaccessible, sometimes displaying an ERRORx50 PHP crash page, but no error reports are generated.
We then connect directly to PFsense via CLI, where everything appears fine—the firewall is accessible and responsive. I've already tried forcing a restart of the WEBUI, but nothing happens. Each time, I am forced to restart the secondary node manually.
We have noticed that the error occurs after synchronizing a large number of firewall rules, creating an IPsec tunnel, or even setting up a new VPN. It seems like the issue arises when there is a large amount of data to synchronize.
We have already switched operations to the secondary node before and never encountered any production issues or disruptions.
We are running version 24.03 with some installed packages, including Freeradius (for VPN user management), OpenVPN Client Export, pfBlockerNG, Suricata, Syslog-NG, System Patches, and Zabbix Agent.
This issue occurs on all four of my PFsense HA clusters, not just one.
Unfortunately, we do not understand the root cause of these "crashes," even though they may not be actual crashes—since no crash logs are available, and we have to force a reboot from the CLI.