Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    HA randomly BACKUP goes to MASTER state

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    21 Posts 4 Posters 4.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • S
      SteveITS Galactic Empire @m4rek11
      last edited by

      @m4rek11 Have you already found:

      https://docs.netgate.com/pfsense/en/latest/troubleshooting/high-availability.html
      https://docs.netgate.com/pfsense/en/latest/troubleshooting/high-availability-virtual.html

      Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
      When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
      Upvote 👍 helpful posts!

      M 1 Reply Last reply Reply Quote 0
      • M
        m4rek11 @SteveITS
        last edited by

        @steveits thank you for your reply. Of course I saw link from you, but nothing form it match to my problem, so I decided to ask on forum. Maybe someone had the same problem and can share some tips for it.

        P 1 Reply Last reply Reply Quote 0
        • P
          Przemyslaw85 @m4rek11
          last edited by Przemyslaw85

          @m4rek11 I have the same problem. And it wasn't on version 2.5.x
          In my opinion, the new 2.6 release broke something.
          Router 1: No problems
          Router 2:

          May 12 09:33:16	kernel		carp: 130@em1.1030: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 160@em1.1060: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 11@em1.10: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 150@em1.1050: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 140@em1.1040: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 130@em1.1030: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 120@em1.1020: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 160@em1.1060: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 110@em1.1010: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 11@em1.10: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 14@em1.100: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 150@em1.1050: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 22@em1.500: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 140@em1.1040: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 10@em0: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 196@bce0.881: MASTER -> BACKUP (more frequent advertisement received)
          May 12 09:33:16	kernel		carp: 120@em1.1020: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 110@em1.1010: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 14@em1.100: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 196@bce0.881: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 10@em0: BACKUP -> MASTER (master timed out)
          May 12 09:33:16	kernel		carp: 22@em1.500: BACKUP -> MASTER (master timed out)
          

          I have been looking for the cause for some time, unfortunately still the same.
          It is enough that on the main one I change something irrelevant for the second router, e.g. sending logs, Router 2 changes into a master for 1s.

          My pfSense box w HA:
          Master: HP DL360G8 1x E5-2670, 64GB ECC RAM, 8x NIC (17x VLan)
          Slave: HP DL360G5, 2x E5410, 64GB ECC RAM, 6x NIC (17x VLan)

          M 1 Reply Last reply Reply Quote 0
          • M
            m4rek11 @Przemyslaw85
            last edited by

            @przemyslaw85,

            My UTM01 not showing any log about CARP like doing it UTM02, so we have the same behavior, I think. Another example about that strange situation is if I change anything what require "reload" firewall on UTM01 (MASTER) and of course after reload it, randomly UTM02 (BACKUP) change to MASTER for little short of time. I think it is exacly what you described.

            One thing what I plan to test is to change network interface for UTMs in hipervior (Proxmox) from Virtio to e1000 and separate that VLAN that this interface is only for communication between HA.

            If anyone has any other ideas, please let me known.

            P.S. @przemyslaw85 Po Twoich postach widziałem, że też z Polski jesteś, ale że nie opisałem problemu w kategorii "Polska", więc po angielsku kontynuuje wątek. W każdym razie dziękuję za post z Twojej strony. Może uda się problem rozwiązać w końcu.

            P 1 Reply Last reply Reply Quote 0
            • P
              Przemyslaw85 @m4rek11
              last edited by Przemyslaw85

              @m4rek11 I have a separate LAN port for Router1-Router2 sync.
              It won't do much, but it is recommended that it be.
              You also have the gateway monitor reset also on Router2?

              PS. W naszym dziale wieje nudą. Mało jest nas co korzysta z tego oprogramowania. Wiec tutaj się podłączyłem.

              My pfSense box w HA:
              Master: HP DL360G8 1x E5-2670, 64GB ECC RAM, 8x NIC (17x VLan)
              Slave: HP DL360G5, 2x E5410, 64GB ECC RAM, 6x NIC (17x VLan)

              P 1 Reply Last reply Reply Quote 0
              • P
                Przemyslaw85 @Przemyslaw85
                last edited by

                In the statistics for interfaces I see Error IN errors for Lan Sync.
                Interestingly, the amount increases with these hops to Router 2.
                I've already changed the server, network cards and ports. The problem persists.

                My pfSense box w HA:
                Master: HP DL360G8 1x E5-2670, 64GB ECC RAM, 8x NIC (17x VLan)
                Slave: HP DL360G5, 2x E5410, 64GB ECC RAM, 6x NIC (17x VLan)

                M 1 Reply Last reply Reply Quote 0
                • M
                  m4rek11 @Przemyslaw85
                  last edited by

                  @przemyslaw85

                  What do you mean about gateway monitor reset?

                  In my switch I don't see any packet errors about that vlan. Of course I made NIC chagnes in Proxmox, like I diescribed earlier, but problem still persist.

                  I'm very confused, why any changes that requires firewall reload causing (not always, but ofen) change MASTER to BACKUP and BACKUP to MASTER for a few seconds.

                  Maybe it is a pfSense bug?

                  B P 2 Replies Last reply Reply Quote 0
                  • B
                    B_IT @m4rek11
                    last edited by B_IT

                    @m4rek11 In earlier version 2.4.5 I saw sometimes (not always) that saving the rules from the firewall caused the change of one of several CARP interfaces from MASTER to BACKUP but it lasted a fraction of a second and did not cause any additional problems. This was only one, small issue with CARP I got with this older version.
                    This Saturday I installed version 2.6.0 CE and I noticed that just restarting one of the firewalls causes several minutes of "agreeing" who is MASTER and who is BACKUP - got similar logs as you and @Przemyslaw85
                    At first, I thought it is happening only when restarting one of 2 firewalls, because firewalls were working fine together for more than a day, but during the half of production day it happened again and do know why.

                    Our firewall is two physical machines connected via two switches

                    1 Reply Last reply Reply Quote 0
                    • P
                      Przemyslaw85 @m4rek11
                      last edited by Przemyslaw85

                      @m4rek11 I have internet connection monitoring configured. As soon as the roles of Router Backup are changed, the link status is lost.

                      @b_it Restarting the backup router also causes such problems for me.

                      My pfSense box w HA:
                      Master: HP DL360G8 1x E5-2670, 64GB ECC RAM, 8x NIC (17x VLan)
                      Slave: HP DL360G5, 2x E5410, 64GB ECC RAM, 6x NIC (17x VLan)

                      B 1 Reply Last reply Reply Quote 0
                      • B
                        B_IT @Przemyslaw85
                        last edited by

                        @przemyslaw85 I also have Internet connection monitoring enabled. I found this discussion on https://redmine.pfsense.org/issues/12961 regarding CARP storm. I will try to play with this solution this saturday and see what will happen.

                        M 1 Reply Last reply Reply Quote 0
                        • M
                          m4rek11 @B_IT
                          last edited by

                          @Przemyslaw85, Im using monitoring system too and when the problem occuring lots of hosts (vlans) are not visible for a while within it.

                          @b_it, thank you for link, please, let us to known about your test.

                          B 1 Reply Last reply Reply Quote 0
                          • B
                            B_IT @m4rek11
                            last edited by

                            @m4rek11 @Przemyslaw85 Sure I will. I hope that I will back with good news. Stay tuned

                            B 1 Reply Last reply Reply Quote 0
                            • B
                              B_IT @B_IT
                              last edited by

                              On last Saturday I added the two patches I mentioned in the previous comment and so far it looks much better. I don't see too many unnecessary messages, both firewalls are stable after these few days. Here are all the patches I added directly to mitigate CARP issue:

                              Fix CARP event storm when leaving persistent CARP maintenance mode 1/2
                              https://github.com/pfsense/pfsense/commit/8a906fba5e42d391227dfc39311d02b570576d50.patch

                              Fix CARP event storm when leaving persistent CARP maintenance mode 2/2
                              https://github.com/pfsense/pfsense/commit/3c15b353c6968801cfffb7d3b30a7069d2330a3e.patch

                              during patching Saturday I also manually added this one:
                              Fix Clicking Save & Force Update on a Dynamic DNS entry results in a GUI timeout
                              https://github.com/pfsense/pfsense/commit/bdffb77d1aa21770b23ef408ad9fba79d0825ec5.patch

                              and I applied this three patches from recommended section:
                              Disable pf counter data preservation to temporarily work around latency when reloading large rulesets (Redmine #12827)

                              Fix Captive Portal handling of non-TCP traffic after login (Redmine #12834)

                              Fix OpenVPN dashboard widget client termination (Redmine #12817)

                              to sum up: for now I will stay with 2.6.0 version with patches

                              P 1 Reply Last reply Reply Quote 0
                              • P
                                Przemyslaw85 @B_IT
                                last edited by

                                @b_it I understand I have made changes for mode 1/2 and mode 2/2.
                                For mode 1/2 I have to do steps for server 1 or both.

                                My pfSense box w HA:
                                Master: HP DL360G8 1x E5-2670, 64GB ECC RAM, 8x NIC (17x VLan)
                                Slave: HP DL360G5, 2x E5410, 64GB ECC RAM, 6x NIC (17x VLan)

                                B 1 Reply Last reply Reply Quote 0
                                • B
                                  B_IT @Przemyslaw85
                                  last edited by

                                  @przemyslaw85 I think that every node should have the same set of patches. So I patched first node, and than the second node.

                                  this name is just my own convention name:

                                  Fix CARP event storm when leaving persistent CARP maintenance mode 1/2
                                  Fix CARP event storm when leaving persistent CARP maintenance mode 2/2

                                  For CARP issue the second patch is not going to apply without the first one. This the view from one node (the second has the same set o patches)
                                  4bc2fdee-0f66-45d5-a658-dfb4ca325c88-obraz.png

                                  P 1 Reply Last reply Reply Quote 0
                                  • P
                                    Przemyslaw85 @B_IT
                                    last edited by

                                    @b_it I confirm the operation of the patches.
                                    Yesterday I made a few changes to the original files using the file editor. I didn't know there was such a module as patches. I had to revert to the original changes from a copy made before editing.
                                    As I added 1/2 2/2 patches and Dynamin DNS I did not notice any improvement. Only after I added patches # 12827, # 12834, # 12816 and # 12817 I can say that now the system works as it should.

                                    My pfSense box w HA:
                                    Master: HP DL360G8 1x E5-2670, 64GB ECC RAM, 8x NIC (17x VLan)
                                    Slave: HP DL360G5, 2x E5410, 64GB ECC RAM, 6x NIC (17x VLan)

                                    B 1 Reply Last reply Reply Quote 0
                                    • B
                                      B_IT @Przemyslaw85
                                      last edited by

                                      @przemyslaw85 Seems to me that when I started to patch (CARP) I saw that firewall is more responsive making later changes (patching) but I didn't wait too long - just rebooted both nodes to be sure that all selected patches are fully applied.
                                      I have to admit that I started to make more thorough tests after I rebooting FWs (with mentioned patch set), so I can't be sure what really helped and how much.
                                      BTW; The patching mechanism was introduced around version 2.5, and I've already learned from his beginning that I have to be careful selecting patches.

                                      M 1 Reply Last reply Reply Quote 0
                                      • M
                                        m4rek11 @B_IT
                                        last edited by

                                        @Przemyslaw85, @B_IT, after that changes did you have carp storm in logs and that MASTER -> BACKUP, BACKUP ->MASTER change for little time?

                                        B P 2 Replies Last reply Reply Quote 0
                                        • B
                                          B_IT @m4rek11
                                          last edited by

                                          @m4rek11 I am looking into logs I see that during applying patch there are some entries, but after patching I see only a few, and they all looks as they should (at least for me) and they have reason (eg. rebooted node). I wouldn't call them storm and definitely I don't see flipping MASTER - BACKUP entries now.

                                          1 Reply Last reply Reply Quote 0
                                          • P
                                            Przemyslaw85 @m4rek11
                                            last edited by Przemyslaw85

                                            @m4rek11 After applying the patches, I did not notice that the routers changed the roles of Master-> Backup, Backup-> Master.
                                            All the problems went with those when I made any changes to the rules, dns or DHCP.

                                            I found my configuration error early. For unknown reason, for 2 different networks I sent the same vhid for Virtual IP. But the problems were still there. After applying the patches, the problem was gone.

                                            My pfSense box w HA:
                                            Master: HP DL360G8 1x E5-2670, 64GB ECC RAM, 8x NIC (17x VLan)
                                            Slave: HP DL360G5, 2x E5410, 64GB ECC RAM, 6x NIC (17x VLan)

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.