HA Sync breaks after restoring configuration
We are upgrading from 2.2.6 to 2.3.4 onto new firewall and for some reason the sync breaks between the 2 firewalls.
I backed up the existing firewall (2.2.6) and done a full restore onto the new installation of 2.3.4. The VLANs and interfaces were then set up. A backup was then made on the new 2.3.4 and restored onto the second firewall with adjustments made (eg interfaces IPs etc).
On the first firewall, the pfsync and XMLRPC sync is setup with the second firewall's IP on a interface using the admin account. On the second firewall, the pfsync is filled in with the IP of the first firewall. The user name and password of the admin account are the same on both firewalls. On the interface specifically for the sync, there is an Allow All rule on both firewalls. I can telnet into each machine from one another on the webconfigurator port (which is the same on both).
It seems the initial sync works as I see new users created on the second firewall however subsequent syncs will fail with the "a communication error occurred while attempting xmlrpc sync" error. I should also add that, I only have Users to be synced checked as a test.
I've also deleted the sync interfaces and recreated them with new IPs and sync problem persists.
Setting up a fresh installation of 2.3.4 with a more or less same setup and the sync works fine however, I cannot do this as I need to keep the users and certificates, rules etc.
What could be the problem?
Do you have a large number of users?
If you check he logs on the primary what time difference is there between starting the sync and the failure entry?
Guessing without log entries to go on but you may be hitting this: https://redmine.pfsense.org/issues/7469
However that would affect 2.2.6 also.
The fact it sync the first time and then fails implies it's applying a change on the secondary that prevent subsequent syncing. Usually that would be a mismatch in the interfaces.
Check the config files from each node, the interfaces must appear with the same names and in the same order.
We have no more 20 users on this firewall.
Here is a portion of the logs which as you can see it has synced successfully once and fails the next time.
Sep 25 14:00:59 php-fpm 61144 /rc.filter_synchronize: New alert found: A communications error occurred while attempting Filter sync with username admin https://192.168.251.250:443.
Sep 25 14:00:59 php-fpm 61144 /rc.filter_synchronize: A communications error occurred while attempting Filter sync with username admin https://192.168.251.250:443.
Sep 25 14:00:59 php-fpm 61144 /rc.filter_synchronize: XML_RPC_Client: RPC server did not send response before timeout. 103
Sep 25 14:00:08 php-fpm 46082 /system_hasync.php: Configuring CARP settings finalize…
Sep 25 14:00:08 php-fpm 46082 /system_hasync.php: pfsync done in 30 seconds.
Sep 25 13:59:59 php-fpm 61144 /rc.filter_synchronize: XMLRPC sync successfully completed with https://192.168.251.250:443.
Sep 25 13:59:36 php-fpm 61144 /rc.filter_synchronize: Beginning XMLRPC sync to https://192.168.251.250:443.
Sep 25 13:59:36 php-fpm 46082 /system_hasync.php: waiting for pfsync...
Sep 25 13:59:35 check_reload_status Syncing firewall
on the second firewall, in the System Logs, it doesn't have anything related except "check_reload_status Reloading filter"
The interfaces match with the same names and in the correct order. The only differences is that on the first firewall - and I'm not sure if there is any significance to it - on the sync interface, it's "1000baseT <full-duplex,master>" and on the second firewall "1000baseT <full-duplex>".
I'm not able to find much information or leads into this error "/rc.filter_synchronize: XML_RPC_Client: RPC server did not send response before timeout. 103"?
Thought I'd also mention that a number of times, from the sync, the gui on the second firewall would be unresponsive (504 gateway timeout?) and restarting php-fpm restores functionality</full-duplex></full-duplex,master>
The maximum time allowed for the sync is 60s. If some part if that takes too long you will see that error. Too many users can cause that.
If you are seeing 504 errors that will also cause xmlrpc to fail, the web server needs to respond on the secondary.
Does it complete successfully after you have restarted php?
The second firewall already contains most (bar one or two) of the users, so not sure why it would take so long to sync one or two users. We also have pfSense (in HA) in another environment with a lot more users than 20 and this syncs without any issues.
Yes, but it seems I'm only seeing 504 on the second firewall as a result of trying to sync. Any ideas why this would crash the GUI?
If I recall correctly, it does usually sync after restarting php-fpm. In the process, it also removes the lock which suggests a sync is taking place and never finishes successfully.
EDIT: restarted php-fpm on the second firewall and the one remaining user on first firewall did not sync over.
Hmm, well I agree that 20 users is not that many and I wouldn't expect any issue there.
However as a test try disabling the user sync from the xmlrpc settings on the primary.
The actual issue there though is the time the secondary takes to re-build the users file from the config and that still applies I believe.
I only have the Users checked for syncing. I disabled it, and I do not see any errors relating to XMLRPC but that's because there isn't anything to sync but that at least rules out authentication issues etc.
To test further, I checked only the Firewall Aliases as a test, but still get the "New alert found: A communications error occurred while attempting Filter sync with username admin" error.
I've also changed the password disabled the sync on both machines and changed the password for the admin account and reenabled the sync, which synced fine once and failed again.
I'm out of ideas!
And you did not see 504/502 errors on the secondary GUI at that time?
The 504 error doesn't happen all the time. The sync fails even when the GUI is responding on the second firewall.
Hmm, it still looks like a timing issue to me from the initial logs though it's unclear what the cause is. Do you still see that same 1m delay on the primary? Nothing obviously logged as an error on the secondary?
In the end, I restored most of the existing config apart from the users. That seemed to work ok.
I also restored the DHCP section which contains a lot of static mappings for a few interfaces. Once I restored this, sync broke which I guess it's taking too long to sync. I removed all static mappings and syncing worked again!
Can I increase this default timeout period to something higher than 60 seconds?
There is no easy way to increase it though I believe it could be done. However you should not normally need to.
How many static mappings do you have? What size is your config file?
There are 186 mappings. The config xml file is 1.8MB
I restored the dhcp mappings again and the sync works.
Where it breaks is very inconsistent and makes it hard to troubleshoot. As of now, the config is complete (except with users and certificates)
Syncing a number of users can slow it down drastically. This is known and something we plan to address shortly: https://redmine.pfsense.org/issues/7469