XMLRPC Sync no longer performed after update to 2.5.2 (not even attempted) - but actually it broke earlier
Two pfSense machines (master/backup) configured with CARP and XMLRPC Sync, which reliably worked. Last Friday, I updated both machines to 2.5.2, prior to some changes. The changes included a new static DHCP mapping and changes in NAT and firewall rules (I installed a new SIP gateway). Also a new VLAN and associated interface (which I deleted again later on).
Today (Saturday), I noticed that the backup pfSense was still showing old firewall and NAT rules on the WebGUI.
The Sync Interface is showing both incoming and outgoing traffic on both machines. It appears that CARP works. Failover also works, rebooting the master machine causes the backup machine to become the new master. No issues there.
"Synchronize states" is enables on both machines, Synchronize Interface correctly selected, and the respective other node's IP address is specified.
XMLRPC Sync is configured on the master, IP address, user name and password of the backup is specified, and every checkbox checked except "Synchronize admin". On the backup, only the checkboxes (except "Synchronize admin") are checked, no IP address, no user name, no password.
Note that I had not changed anything there; I had left everything there as it was (which worked until the 252 update).
Of course I have checked the system logs (Status / System Logs / System / General) for error messages. A search for "XMLRPC" on the backup machine showed no matches. The same search for "XMLRPC" on the master only matches prior to the time of the 2.5.2 upgrade - all related to ACME, indicating "/usr/local/pkg/acme/acme_command.sh: XMLRPC reload data success with https://192.168.555.2:88/xmlrpc.php (pfsense.exec_php)." (IP address changed to protect the innocent; the real one is valid). Which is a lie, as the backup node still sports certificates from 2020 (I had to install ACME on the backup machine, with automatic updates disabled, so I could update the WebGUI certificate via ACME manually).
Consequently, I triggered a certificate renewal in ACME (successfully), but no XMLRPC message would appear in the system logs, only non-XMLRPC-related ACME log entries.
Looking into the configuration history of the backup machines revealed that the last XMLRPC config merge was about half a year ago. So XMLRPC sync apparently had stopped working not just yesterday, after the upgrade to 2.5.2., but half a year ago already (the last XMLRPC-related change on the backup machine corresponded to "(system): syslog-ng: Settings saved" on the master). But at least the master node had attempted to sync to the backup machine...at least as long as ACME was concerned.
So, where do I go from here? Debugging is a bit hard if I don't even error messages!
@klaws regarding notifications about pfsense activity, do you use Mail Reports? Pfsense can let you know through email issues related with XMLRPC Sync.
@sipriuspt Thank you for your answer!
I use regular email notifications, and these showed nothing suspicious (except that the certificated on the backup node expired, but that was just a symptom). I installed Mail Reports, but since the log files also show no XMLRPC-related messages, it's not very helpful.
In any case, I solved the issue now. I updated the backup node from 2.5.2 to 2.6.0, everything smooth and fine, then entered CARP maintenance mode on the master, performed the same update there, and everything went sideways.
Apparently, something had been corrupted on the master node some time ago already, and now "everything was broken" (including the WebGUI). Well, SSH still worked (even though I was dropped right into a command prompt, no pfSense menu or anything):
pkg-static clean -ay; pkg-static install -fy pkg pfSense-repo pfSense-upgrade pkg-static upgrade -f shutdown -r +1
And this fixed both the botched upgrade as well as my XMLRPC issue!
@klaws Mail Report just let you know in time any issues that could occur.
At least for me it helps a lot dealing with pfsense clusters.
like when some CARP state changes states (master or backup),
17:51:09 HA cluster member "(firstname.lastname@example.org): (IXL3_VLAN13_IT_ADMINS)" has resumed CARP state "BACKUP" for vhid 12
when WANs went offline or online in gateway groups:
11:07:07 MONITOR: WAN_ROUTERA_WAN2_GW is available now, adding to routing group GW_GROUP x.x.x.225|172.16.2.2|WAN_ROUTERA_WAN2_GW|34.651ms|87.308ms|18%|online|loss
when services stop working and watchdog service detect and handle the situation,
9:26:00 Service Watchdog detected service openvpn stopped. Restarting openvpn (OpenVPN server: Internal Devices)
when rules cannot load:
15:42:40 There were error(s) loading the rules: /tmp/rules.debug:51: cannot load "/var/db/aliastables/pfB_NAmerica_v6.txt": Invalid argument - The line in question reads : table <pfB_NAmerica_v6> persist file "/var/db/aliastables/pfB_NAmerica_v6.txt"
when XMLRPC communication fails:
17:29:59 A communications error occurred while attempting to call XMLRPC method restore_config_section:
16:43:28 Exception calling XMLRPC method restore_config_section # Impossible to encode value '' from type 'NULL'. No analogous type in XML_RPC.