Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    CARP sporadically flopping to BACKUP and then back to MASTER

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    15 Posts 7 Posters 4.5k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • KOMK
      KOM
      last edited by

      A better question might be why is there so much contention on your network that the heartbeats are delayed to the point where failover is triggered?

      1 Reply Last reply Reply Quote 0
      • F
        five0va
        last edited by

        There shouldn't be… it's a direct cable. Just had it happen again, about 90% of VIPs failed over.

        1 Reply Last reply Reply Quote 0
        • KOMK
          KOM
          last edited by

          You might want to capture the traffic on that link and see what's going on.

          1 Reply Last reply Reply Quote 0
          • awebsterA
            awebster
            last edited by

            I too initially had that same issue of VIP flip-flopping.  I chalked it up to the lack of decent timer resolution when running my pfSense instances on ESXi.
            What worked for me in the end:
            MASTER: BASE=1
            BACKUP: BASE=10

            Be sure to validate you don't have any duplicate VHID or switch HSRP/VRRP using the same ID # on any of your failover interfaces, that they cannot ever see each other's traffic, and that IPv4 and IPv6 each use a different VHID on the same interface.

            –A.

            1 Reply Last reply Reply Quote 0
            • awebsterA
              awebster
              last edited by

              I forgot one critical detail.

              You must uncheck Synchronize Vitual IPs in the System -> High Avail. Sync, otherwise the MASTER will keep overwriting the ADVBASE value.
              This also means you must manually configure the VIP address on each box, initially check the Synchronize Vitual IPs when you do the setup, then uncheck it to go into production, and never check it again.

              I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.

              –A.

              M 1 Reply Last reply Reply Quote 1
              • L
                ljorgensen
                last edited by

                @ljorgensen:

                @jimp:

                Try increasing the advbase, rather than the skew. The skew only adjusts in 1/256th of a second increment, base adds whole seconds.

                That seems to have fixed the problem. Thank you!

                That was premature, unfortunately. The problem persisted, only less often due to the fewer advertisements being sent. I hooked up a wireshark probe on the network today and was able to have a capture running when it happened.

                I see CARP packets once a second when things are working normally. When the Master reverts to backup, I see the same CARP packet repeated thousands of times (actually more than 20,000 times in one second in the capture).

                I'm not sure this is related to pfsense at all. Leaning more toward something in the network grossly misbehaving. Problem is, the error is only in one VLAN out of seven VLANs on the same interface. The other six behaves just fine.

                I include the wireshark dump in this post in the hope that someone will devote a few minutes to look at it and tell me where my next point of attention should be. The master->backup switchover happens 348 seconds in (and is pretty obvious!).

                Lars

                CARP_VLAN6_multicast_storm.pcapng.gz

                1 Reply Last reply Reply Quote 0
                • awebsterA
                  awebster
                  last edited by

                  @ljorgensen:

                  …
                  I see CARP packets once a second when things are working normally. When the Master reverts to backup, I see the same CARP packet repeated thousands of times (actually more than 20,000 times in one second in the capture).

                  I'm not sure this is related to pfsense at all. Leaning more toward something in the network grossly misbehaving. Problem is, the error is only in one VLAN out of seven VLANs on the same interface. The other six behaves just fine.
                  ...

                  Looks like there might be a loop in the network, the same packet is seen repeating over and over again as of packet 348, at a rate of around 20,000 pps.  That should be setting off some alarms if you have broadcast / multicast storm control setup on your switches.

                  –A.

                  1 Reply Last reply Reply Quote 0
                  • L
                    ljorgensen
                    last edited by

                    @awebster:

                    Looks like there might be a loop in the network, the same packet is seen repeating over and over again as of packet 348, at a rate of around 20,000 pps.  That should be setting off some alarms if you have broadcast / multicast storm control setup on your switches.

                    I don't have storm control setup on the switches and it's not enough traffic to disrupt anything, so I won't bother. I'm guessing something in the network is making those packets loop around for a few seconds.

                    I've tried adjusting the Advertising Frequence Base to 30 seconds and that seems to have solved the problem. Haven't seen anything for a few days now, and it used to happen once every 10 to 15 minutes. It means I'll have a slower failover in the event of a network outage on the master, but that's not a problem compared to the previous situation where the master spontaneously became backup numerous times throughout a day.

                    Lars

                    1 Reply Last reply Reply Quote 0
                    • C
                      cmb
                      last edited by

                      @awebster:

                      I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.

                      No, you need those to match.

                      1 Reply Last reply Reply Quote 0
                      • M
                        mrn @awebster
                        last edited by mrn

                        @awebster

                        You must uncheck Synchronize Vitual IPs in the System -> High Avail. Sync, otherwise the MASTER will keep overwriting the ADVBASE value.
                        This also means you must manually configure the VIP address on each box, initially check the Synchronize Vitual IPs when you do the setup, then uncheck it to go into production, and never check it again.

                        In the year 2025 I found this comment which resolved my problem!

                        1 Reply Last reply Reply Quote 1
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.