Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    HA Sync breaks after restoring configuration

    Scheduled Pinned Locked Moved Problems Installing or Upgrading pfSense Software
    15 Posts 3 Posters 1.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R
      redarmy123
      last edited by

      We are upgrading from 2.2.6 to 2.3.4 onto new firewall and for some reason the sync breaks between the 2 firewalls.

      I backed up the existing firewall (2.2.6) and done a full restore onto the new installation of 2.3.4. The VLANs and interfaces were then set up.  A backup was then made on the new 2.3.4 and restored onto the second firewall with adjustments made (eg interfaces IPs etc).

      On the first firewall, the pfsync and XMLRPC sync is setup with the second firewall's IP on a interface using the admin account. On the second firewall, the pfsync is filled in with the IP of the first firewall. The user name and password of the admin account are the same on both firewalls. On the interface specifically for the sync, there is an Allow All rule on both firewalls. I can telnet into each machine from one another on the webconfigurator port (which is the same on both).
      It seems the initial sync works as I see new users created on the second firewall however subsequent syncs will fail with the "a communication error occurred while attempting xmlrpc sync" error. I should also add that, I only have Users to be synced checked as a test.

      I've also deleted the sync interfaces and recreated them with new IPs and sync problem persists.

      Setting up a fresh installation of 2.3.4 with a more or less same setup and the sync works fine however, I cannot do this as I need to keep the users and certificates, rules etc.

      What could be the problem?

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        Do you have a large number of users?

        If you check he logs on the primary what time difference is there between starting the sync and the failure entry?

        Guessing without log entries to go on but you may be hitting this: https://redmine.pfsense.org/issues/7469

        However that would affect 2.2.6 also.

        The fact it sync the first time and then fails implies it's applying a change on the secondary that prevent subsequent syncing. Usually that would be a mismatch in the interfaces.

        Check the config files from each node, the interfaces must appear with the same names and in the same order.

        Steve

        1 Reply Last reply Reply Quote 0
        • R
          redarmy123
          last edited by

          We have no more 20 users on this firewall.

          Here is a portion of the logs which as you can see it has synced successfully once and fails the next time.

          Sep 25 14:00:59 php-fpm 61144 /rc.filter_synchronize: New alert found: A communications error occurred while attempting Filter sync with username admin https://192.168.251.250:443.
          Sep 25 14:00:59 php-fpm 61144 /rc.filter_synchronize: A communications error occurred while attempting Filter sync with username admin https://192.168.251.250:443.
          Sep 25 14:00:59 php-fpm 61144 /rc.filter_synchronize: XML_RPC_Client: RPC server did not send response before timeout. 103
          Sep 25 14:00:08 php-fpm 46082 /system_hasync.php: Configuring CARP settings finalize…
          Sep 25 14:00:08 php-fpm 46082 /system_hasync.php: pfsync done in 30 seconds.
          Sep 25 13:59:59 php-fpm 61144 /rc.filter_synchronize: XMLRPC sync successfully completed with https://192.168.251.250:443.
          Sep 25 13:59:36 php-fpm 61144 /rc.filter_synchronize: Beginning XMLRPC sync to https://192.168.251.250:443.
          Sep 25 13:59:36 php-fpm 46082 /system_hasync.php: waiting for pfsync...
          Sep 25 13:59:35 check_reload_status Syncing firewall

          on the second firewall, in the System Logs, it doesn't have anything related except "check_reload_status      Reloading filter"

          The interfaces match with the same names and in the correct order. The only differences is that on the first firewall - and I'm not sure if there is any significance to it - on the sync interface, it's "1000baseT <full-duplex,master>" and on the second firewall "1000baseT <full-duplex>".

          I'm not able to find much information or leads into this error  "/rc.filter_synchronize: XML_RPC_Client: RPC server did not send response before timeout. 103"?

          Thought I'd also mention that a number of times, from the sync, the gui on the second firewall would be unresponsive (504 gateway timeout?)  and restarting php-fpm restores functionality</full-duplex></full-duplex,master>

          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            The maximum time allowed for the sync is 60s. If some part if that takes too long you will see that error. Too many users can cause that.

            If you are seeing 504 errors that will also cause xmlrpc to fail, the web server needs to respond on the secondary.

            Does it complete successfully after you have restarted php?

            Steve

            1 Reply Last reply Reply Quote 0
            • R
              redarmy123
              last edited by

              The second firewall already contains most (bar one or two) of the users, so not sure why it would take so long to sync one or two users. We also have pfSense (in HA) in another environment with a lot more users than 20 and this syncs without any issues.

              Yes, but it seems I'm only seeing 504 on the second firewall as a result of trying to sync. Any ideas why this would crash the GUI?

              If I recall correctly, it does usually sync after restarting php-fpm. In the process, it also removes the lock which suggests a sync is taking place and never finishes successfully.

              EDIT: restarted php-fpm on the second firewall and the one remaining user on first firewall did not sync over.

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                Hmm, well I agree that 20 users is not that many and I wouldn't expect any issue there.

                However as a test try disabling the user sync from the xmlrpc settings on the primary.

                The actual issue there though is the time the secondary takes to re-build the users file from the config and that still applies I believe.

                Steve

                1 Reply Last reply Reply Quote 0
                • R
                  redarmy123
                  last edited by

                  I only have the Users checked for syncing. I disabled it, and I do not see any errors relating to XMLRPC but that's because there isn't anything to sync but that at least rules out authentication issues etc.

                  To test further, I checked only the Firewall Aliases as a test, but still get the "New alert found: A communications error occurred while attempting Filter sync with username admin" error.

                  I've also changed the password disabled the sync on both machines and changed the password for the admin account and reenabled the sync, which synced fine once and failed again.

                  I'm out of ideas!

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by

                    And you did not see 504/502 errors on the secondary GUI at that time?

                    Steve

                    1 Reply Last reply Reply Quote 0
                    • R
                      redarmy123
                      last edited by

                      The 504 error doesn't happen all the time. The sync fails even when the GUI is responding on the second firewall.

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        Hmm, it still looks like a timing issue to me from the initial logs though it's unclear what the cause is. Do you still see that same 1m delay on the primary? Nothing obviously logged as an error on the secondary?

                        Steve

                        1 Reply Last reply Reply Quote 0
                        • R
                          redarmy123
                          last edited by

                          In the end, I restored most of the existing config apart from the users. That seemed to work ok.

                          I also restored the DHCP section which contains a lot of static mappings for a few interfaces. Once I restored this, sync broke which I guess it's taking too long to sync. I removed all static mappings and syncing worked again!

                          Can I increase this default timeout period to something higher than 60 seconds?

                          1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator
                            last edited by

                            There is no easy way to increase it though I believe it could be done. However you should not normally need to.

                            How many static mappings do you have? What size is your config file?

                            Steve

                            1 Reply Last reply Reply Quote 0
                            • R
                              redarmy123
                              last edited by

                              There are 186 mappings. The config xml file is 1.8MB

                              1 Reply Last reply Reply Quote 0
                              • R
                                redarmy123
                                last edited by

                                I restored the dhcp mappings again and the sync works.

                                Where it breaks is very inconsistent and makes it hard to troubleshoot. As of now, the config is complete (except with users and certificates)

                                1 Reply Last reply Reply Quote 0
                                • jimpJ
                                  jimp Rebel Alliance Developer Netgate
                                  last edited by

                                  Syncing a number of users can slow it down drastically. This is known and something we plan to address shortly: https://redmine.pfsense.org/issues/7469

                                  Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                                  Need help fast? Netgate Global Support!

                                  Do not Chat/PM for help!

                                  1 Reply Last reply Reply Quote 0
                                  • First post
                                    Last post
                                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.