Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    CARP triggered on the BACKUP only without obvious reason

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    13 Posts 2 Posters 4.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • B Offline
      bennyc
      last edited by

      Hi,

      encountered strange CARP half failover today? Why I say half; because it only happened on the BACKUP, not one carp log entry on the MASTER?
      System = SG4860 Dual, with dedicated link for SYNC (OPT4)
      Below other events are present in the System log on the BACKUP.

      Does anyone have a clue why this is triggered or where to look?

      
      Nov 20 08:10:06	php-fpm[79129]: /rc.carpbackup: Carp cluster member " 2_VLAN200 CARP IP (200@igb2_vlan200)" has resumed the state "BACKUP" for vhid 200@igb2_vlan200
      Nov 20 08:10:06	php-fpm[79129]: /rc.carpbackup: Message sent to @....com OK
      Nov 20 08:10:06	php-fpm[65854]: /rc.carpbackup: Carp cluster member "  CARP IP (4@igb2)" has resumed the state "BACKUP" for vhid 4@igb2
      Nov 20 08:10:06	php-fpm[65854]: /rc.carpbackup: Message sent to @....com OK
      Nov 20 08:10:06	php-fpm[55620]: /rc.carpbackup: Carp cluster member " VLAN33 CARP IP (33@igb0_vlan33)" has resumed the state "BACKUP" for vhid 33@igb0_vlan33
      Nov 20 08:10:06	php-fpm[55620]: /rc.carpbackup: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[79129]: /rc.carpbackup: Carp cluster member " 1_LAN CARP IP (1@igb0)" has resumed the state "BACKUP" for vhid 1@igb0
      Nov 20 08:10:05	php-fpm[79129]: /rc.carpbackup: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[65854]: /rc.carpbackup: Carp cluster member " VLAN222 CARP IP (160@igb0_vlan222)" has resumed the state "BACKUP" for vhid 160@igb0_vlan222
      Nov 20 08:10:05	php-fpm[65854]: /rc.carpbackup: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[55620]: /rc.carpbackup: Carp cluster member " VLAN153 CARP IP (153@igb0_vlan3)" has resumed the state "BACKUP" for vhid 153@igb0_vlan3
      Nov 20 08:10:05	php-fpm[55620]: /rc.carpbackup: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[79129]: /rc.carpbackup: Carp cluster member " VLAN169 CARP IP (169@igb0_vlan169)" has resumed the state "BACKUP" for vhid 169@igb0_vlan169
      Nov 20 08:10:05	php-fpm[79129]: /rc.carpbackup: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Carp cluster member " 2_VLAN200 CARP IP (200@igb2_vlan200)" has resumed the state "MASTER" for vhid 200@igb2_vlan200
      Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[79129]: /rc.carpmaster: Carp cluster member " 1_LAN CARP IP (1@igb0)" has resumed the state "MASTER" for vhid 1@igb0
      Nov 20 08:10:05	php-fpm[79129]: /rc.carpmaster: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Carp cluster member " VLAN33 CARP IP (33@igb0_vlan33)" has resumed the state "MASTER" for vhid 33@igb0_vlan33
      Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[79129]: /rc.carpmaster: Carp cluster member " VLAN169 CARP IP (169@igb0_vlan169)" has resumed the state "MASTER" for vhid 169@igb0_vlan169
      Nov 20 08:10:05	php-fpm[79129]: /rc.carpmaster: Message sent to @....com OK
      Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Carp cluster member " VLAN222 CARP IP (160@igb0_vlan222)" has resumed the state "MASTER" for vhid 160@igb0_vlan222
      Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Message sent to @....com OK
      Nov 20 08:10:04	php-fpm[79129]: /rc.carpmaster: Carp cluster member " VLAN153 CARP IP (153@igb0_vlan3)" has resumed the state "MASTER" for vhid 153@igb0_vlan3
      Nov 20 08:10:04	php-fpm[79129]: /rc.carpmaster: Message sent to @....com OK
      Nov 20 08:10:04	php-fpm[79129]: /rc.carpmaster: Carp cluster member "  CARP IP (4@igb2)" has resumed the state "MASTER" for vhid 4@igb2
      Nov 20 08:10:04	php-fpm[79129]: /rc.carpmaster: Message sent to @....com OK
      Nov 20 08:10:03	check_reload_status: Carp backup event
      Nov 20 08:10:03	check_reload_status: Carp backup event
      Nov 20 08:10:03	check_reload_status: Carp backup event
      Nov 20 08:10:03	check_reload_status: Carp backup event
      Nov 20 08:10:03	check_reload_status: Carp backup event
      Nov 20 08:10:03	check_reload_status: Carp backup event
      Nov 20 08:10:03	check_reload_status: Carp backup event
      Nov 20 08:10:03	check_reload_status: Carp master event
      Nov 20 08:10:03	check_reload_status: Carp master event
      Nov 20 08:10:03	check_reload_status: Carp master event
      Nov 20 08:10:03	check_reload_status: Carp master event
      Nov 20 08:10:03	check_reload_status: Carp master event
      Nov 20 08:10:03	check_reload_status: Carp master event
      Nov 20 08:10:03	kernel: carp: VHID 200@igb2_vlan200: MASTER -> BACKUP (more frequent advertisement received)
      Nov 20 08:10:03	kernel: carp: VHID 4@igb2: MASTER -> BACKUP (more frequent advertisement received)
      Nov 20 08:10:03	kernel: carp: VHID 33@igb0_vlan33: MASTER -> BACKUP (more frequent advertisement received)
      Nov 20 08:10:03	kernel: carp: VHID 1@igb0: MASTER -> BACKUP (more frequent advertisement received)
      Nov 20 08:10:03	kernel: carp: VHID 160@igb0_vlan222: MASTER -> BACKUP (more frequent advertisement received)
      Nov 20 08:10:03	kernel: carp: VHID 153@igb0_vlan3: MASTER -> BACKUP (more frequent advertisement received)
      Nov 20 08:10:03	kernel: carp: VHID 169@igb0_vlan169: MASTER -> BACKUP (more frequent advertisement received)
      Nov 20 08:10:03	check_reload_status: Carp master event
      Nov 20 08:10:03	kernel: carp: VHID 200@igb2_vlan200: BACKUP -> MASTER (master down)
      Nov 20 08:10:03	kernel: carp: VHID 1@igb0: BACKUP -> MASTER (master down)
      Nov 20 08:10:03	kernel: carp: VHID 33@igb0_vlan33: BACKUP -> MASTER (master down)
      Nov 20 08:10:03	kernel: carp: VHID 169@igb0_vlan169: BACKUP -> MASTER (master down)
      Nov 20 08:10:03	kernel: carp: VHID 160@igb0_vlan222: BACKUP -> MASTER (master down)
      Nov 20 08:10:03	kernel: carp: VHID 153@igb0_vlan3: BACKUP -> MASTER (master down)
      Nov 20 08:10:03	kernel: carp: VHID 4@igb2: BACKUP -> MASTER (master down)
      
      

      #edit: version = v2.2.5

      4x XG-7100 (2xHA), 1x SG-4860, 1x SG-2100
      1x PC Engines APU2C4, 1x PC Engines APU1C4

      1 Reply Last reply Reply Quote 0
      • awebsterA Offline
        awebster
        last edited by

        What are the ADVBASE and SKEW values for each of the CARP VIPs on each box?

        –A.

        1 Reply Last reply Reply Quote 0
        • B Offline
          bennyc
          last edited by

          they are left default for all VIP's.

          -> MASTER = Base 1 Skew 0
          -> BACKUP = Base 1 Skew 100 (set automatically)

          4x XG-7100 (2xHA), 1x SG-4860, 1x SG-2100
          1x PC Engines APU2C4, 1x PC Engines APU1C4

          1 Reply Last reply Reply Quote 0
          • awebsterA Offline
            awebster
            last edited by

            In this thread https://forum.pfsense.org/index.php?topic=90758.0, Jimp gives some good advice, which I had also stumbled upon by trial and error.

            Set ADVBASE higher on the backup.

            Infact, there is a clue to this in the DHCP server Failover peer configuration

            Leave blank to disable. Enter the interface IP address of the other machine. 
            Machines must be using CARP. Interface's advskew determines whether the DHCPd process is Primary or Secondary. 
            Ensure one machine's advskew<20 (and the other is >20).
            

            –A.

            1 Reply Last reply Reply Quote 0
            • B Offline
              bennyc
              last edited by

              hmm, interesting tip. When I increase in the BACKUP for a certain CARP the Base setting to anything higher (like "22"), after a while (I assume a next sync?) it returns to value "1".
              And this leads also to the issue that my dhcpd failover states are in error. Until I change it back. And then after a while we go back to "1". (manual loop  ::)

              So either I'm doing it wrong, or there is something wrong?

              4x XG-7100 (2xHA), 1x SG-4860, 1x SG-2100
              1x PC Engines APU2C4, 1x PC Engines APU1C4

              1 Reply Last reply Reply Quote 0
              • awebsterA Offline
                awebster
                last edited by

                @bennyc:

                hmm, interesting tip. When I increase in the BACKUP for a certain CARP the Base setting to anything higher (like "22"), after a while (I assume a next sync?) it returns to value "1".
                And this leads also to the issue that my dhcpd failover states are in error. Until I change it back. And then after a while we go back to "1". (manual loop  ::)

                So either I'm doing it wrong, or there is something wrong?

                Sorry, I forgot one critical detail.

                You must uncheck Synchronize Vitual IPs in the System -> High Avail. Sync, otherwise the MASTER will keep overwriting the ADVBASE value.
                This also means you must manually configure the VIP address on each box, initially check the Synchronize Vitual IPs when you do the setup, then uncheck it to go into production, and never check it again.

                I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.

                –A.

                1 Reply Last reply Reply Quote 0
                • B Offline
                  bennyc
                  last edited by

                  First of all; thanks for your hints here!

                  Ok, that makes sense. In meantime, this is driving me nuts.
                  I disabled the sync for dhcpd. Then modified each CARP IP on the backup with a Adv. of 22. Then I had one failover group, that for no obvious reason remained in unknown state. Reboot, went back to settings, compaired dhcpd.conf from each member, they seemed ok.

                  Restarted the dhcpd from the backup, all groups on my Master reported in recover, and on my backup they are all (but one) normal  :(
                  Then restarted the dhcpd on the master, seems all groups on my Master are stuck in recover  >:(

                  sigh… ::)

                  4x XG-7100 (2xHA), 1x SG-4860, 1x SG-2100
                  1x PC Engines APU2C4, 1x PC Engines APU1C4

                  1 Reply Last reply Reply Quote 0
                  • awebsterA Offline
                    awebster
                    last edited by

                    You are certain there are no duplicate CARP IDs or HSRP/VRRP sessions using the same CARP IDs?
                    From shell, use tcpdump -s 0 -n -i <interface>-T carp proto 112
                    Check each interface individually to be sure.</interface>

                    –A.

                    1 Reply Last reply Reply Quote 0
                    • B Offline
                      bennyc
                      last edited by

                      Went back to my initial state (full sync incl VIPs), removed even the failover ip's from dhcp's.

                      Removed sync for CARP again. Then I updated the VIP CARP Adv. to 22 on the backup.

                      Started the packet capture as requested. Set the FailOver IP again on the problematic interface (only here). Immediately the dhcp failover group shows on the backup, but also immediately in state "recover" and peer-state "unknown-state"

                      I went over my config, I only have 7 CARP IP's, and they each have a unique hvid. Did a capture on both nodes, on the IF which is giving me troubles, and I only see CARPv2-advertise 36 for the configured vhid:
                      22:50:58.137661 IP 10.100.200.251 > 224.0.0.18: CARPv2-advertise 36: vhid=200 advbase=1 advskew=0 authlen=7 counter=10858211938054085703

                      Changed the vhid just to be sure, repeated the test (remove failover, add failover), same result.

                      I must add, it justed dawned to me this IF happens to be the one connected on a Cisco stack with hsrp configured on it. It seems now a good idea for me to review it's config…

                      4x XG-7100 (2xHA), 1x SG-4860, 1x SG-2100
                      1x PC Engines APU2C4, 1x PC Engines APU1C4

                      1 Reply Last reply Reply Quote 0
                      • B Offline
                        bennyc
                        last edited by

                        @awebster

                        Tnx for all hints here, solved the issue in meantime.
                        It required a reboot of the Master to get things running again, and a config error from my part for that last dhcp group.

                        4x XG-7100 (2xHA), 1x SG-4860, 1x SG-2100
                        1x PC Engines APU2C4, 1x PC Engines APU1C4

                        1 Reply Last reply Reply Quote 0
                        • B Offline
                          bennyc
                          last edited by

                          @awebster:

                          I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.

                          Perhaps silly question, but did you report this? Seem indeed unwanted behavior that could be fixed as a bug…

                          4x XG-7100 (2xHA), 1x SG-4860, 1x SG-2100
                          1x PC Engines APU2C4, 1x PC Engines APU1C4

                          1 Reply Last reply Reply Quote 0
                          • awebsterA Offline
                            awebster
                            last edited by

                            Thanks, here it is: https://redmine.pfsense.org/issues/5528

                            –A.

                            1 Reply Last reply Reply Quote 0
                            • B Offline
                              bennyc
                              last edited by

                              Great! (& tnx!) I would have done it otherwise, but credits go to you…

                              4x XG-7100 (2xHA), 1x SG-4860, 1x SG-2100
                              1x PC Engines APU2C4, 1x PC Engines APU1C4

                              1 Reply Last reply Reply Quote 0
                              • First post
                                Last post
                              Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.