CARP triggered on the BACKUP only without obvious reason



  • Hi,

    encountered strange CARP half failover today? Why I say half; because it only happened on the BACKUP, not one carp log entry on the MASTER?
    System = SG4860 Dual, with dedicated link for SYNC (OPT4)
    Below other events are present in the System log on the BACKUP.

    Does anyone have a clue why this is triggered or where to look?

    
    Nov 20 08:10:06	php-fpm[79129]: /rc.carpbackup: Carp cluster member " 2_VLAN200 CARP IP (200@igb2_vlan200)" has resumed the state "BACKUP" for vhid 200@igb2_vlan200
    Nov 20 08:10:06	php-fpm[79129]: /rc.carpbackup: Message sent to @....com OK
    Nov 20 08:10:06	php-fpm[65854]: /rc.carpbackup: Carp cluster member "  CARP IP (4@igb2)" has resumed the state "BACKUP" for vhid 4@igb2
    Nov 20 08:10:06	php-fpm[65854]: /rc.carpbackup: Message sent to @....com OK
    Nov 20 08:10:06	php-fpm[55620]: /rc.carpbackup: Carp cluster member " VLAN33 CARP IP (33@igb0_vlan33)" has resumed the state "BACKUP" for vhid 33@igb0_vlan33
    Nov 20 08:10:06	php-fpm[55620]: /rc.carpbackup: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[79129]: /rc.carpbackup: Carp cluster member " 1_LAN CARP IP (1@igb0)" has resumed the state "BACKUP" for vhid 1@igb0
    Nov 20 08:10:05	php-fpm[79129]: /rc.carpbackup: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[65854]: /rc.carpbackup: Carp cluster member " VLAN222 CARP IP (160@igb0_vlan222)" has resumed the state "BACKUP" for vhid 160@igb0_vlan222
    Nov 20 08:10:05	php-fpm[65854]: /rc.carpbackup: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[55620]: /rc.carpbackup: Carp cluster member " VLAN153 CARP IP (153@igb0_vlan3)" has resumed the state "BACKUP" for vhid 153@igb0_vlan3
    Nov 20 08:10:05	php-fpm[55620]: /rc.carpbackup: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[79129]: /rc.carpbackup: Carp cluster member " VLAN169 CARP IP (169@igb0_vlan169)" has resumed the state "BACKUP" for vhid 169@igb0_vlan169
    Nov 20 08:10:05	php-fpm[79129]: /rc.carpbackup: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Carp cluster member " 2_VLAN200 CARP IP (200@igb2_vlan200)" has resumed the state "MASTER" for vhid 200@igb2_vlan200
    Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[79129]: /rc.carpmaster: Carp cluster member " 1_LAN CARP IP (1@igb0)" has resumed the state "MASTER" for vhid 1@igb0
    Nov 20 08:10:05	php-fpm[79129]: /rc.carpmaster: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Carp cluster member " VLAN33 CARP IP (33@igb0_vlan33)" has resumed the state "MASTER" for vhid 33@igb0_vlan33
    Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[79129]: /rc.carpmaster: Carp cluster member " VLAN169 CARP IP (169@igb0_vlan169)" has resumed the state "MASTER" for vhid 169@igb0_vlan169
    Nov 20 08:10:05	php-fpm[79129]: /rc.carpmaster: Message sent to @....com OK
    Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Carp cluster member " VLAN222 CARP IP (160@igb0_vlan222)" has resumed the state "MASTER" for vhid 160@igb0_vlan222
    Nov 20 08:10:05	php-fpm[55620]: /rc.carpmaster: Message sent to @....com OK
    Nov 20 08:10:04	php-fpm[79129]: /rc.carpmaster: Carp cluster member " VLAN153 CARP IP (153@igb0_vlan3)" has resumed the state "MASTER" for vhid 153@igb0_vlan3
    Nov 20 08:10:04	php-fpm[79129]: /rc.carpmaster: Message sent to @....com OK
    Nov 20 08:10:04	php-fpm[79129]: /rc.carpmaster: Carp cluster member "  CARP IP (4@igb2)" has resumed the state "MASTER" for vhid 4@igb2
    Nov 20 08:10:04	php-fpm[79129]: /rc.carpmaster: Message sent to @....com OK
    Nov 20 08:10:03	check_reload_status: Carp backup event
    Nov 20 08:10:03	check_reload_status: Carp backup event
    Nov 20 08:10:03	check_reload_status: Carp backup event
    Nov 20 08:10:03	check_reload_status: Carp backup event
    Nov 20 08:10:03	check_reload_status: Carp backup event
    Nov 20 08:10:03	check_reload_status: Carp backup event
    Nov 20 08:10:03	check_reload_status: Carp backup event
    Nov 20 08:10:03	check_reload_status: Carp master event
    Nov 20 08:10:03	check_reload_status: Carp master event
    Nov 20 08:10:03	check_reload_status: Carp master event
    Nov 20 08:10:03	check_reload_status: Carp master event
    Nov 20 08:10:03	check_reload_status: Carp master event
    Nov 20 08:10:03	check_reload_status: Carp master event
    Nov 20 08:10:03	kernel: carp: VHID 200@igb2_vlan200: MASTER -> BACKUP (more frequent advertisement received)
    Nov 20 08:10:03	kernel: carp: VHID 4@igb2: MASTER -> BACKUP (more frequent advertisement received)
    Nov 20 08:10:03	kernel: carp: VHID 33@igb0_vlan33: MASTER -> BACKUP (more frequent advertisement received)
    Nov 20 08:10:03	kernel: carp: VHID 1@igb0: MASTER -> BACKUP (more frequent advertisement received)
    Nov 20 08:10:03	kernel: carp: VHID 160@igb0_vlan222: MASTER -> BACKUP (more frequent advertisement received)
    Nov 20 08:10:03	kernel: carp: VHID 153@igb0_vlan3: MASTER -> BACKUP (more frequent advertisement received)
    Nov 20 08:10:03	kernel: carp: VHID 169@igb0_vlan169: MASTER -> BACKUP (more frequent advertisement received)
    Nov 20 08:10:03	check_reload_status: Carp master event
    Nov 20 08:10:03	kernel: carp: VHID 200@igb2_vlan200: BACKUP -> MASTER (master down)
    Nov 20 08:10:03	kernel: carp: VHID 1@igb0: BACKUP -> MASTER (master down)
    Nov 20 08:10:03	kernel: carp: VHID 33@igb0_vlan33: BACKUP -> MASTER (master down)
    Nov 20 08:10:03	kernel: carp: VHID 169@igb0_vlan169: BACKUP -> MASTER (master down)
    Nov 20 08:10:03	kernel: carp: VHID 160@igb0_vlan222: BACKUP -> MASTER (master down)
    Nov 20 08:10:03	kernel: carp: VHID 153@igb0_vlan3: BACKUP -> MASTER (master down)
    Nov 20 08:10:03	kernel: carp: VHID 4@igb2: BACKUP -> MASTER (master down)
    
    

    #edit: version = v2.2.5



  • What are the ADVBASE and SKEW values for each of the CARP VIPs on each box?



  • they are left default for all VIP's.

    -> MASTER = Base 1 Skew 0
    -> BACKUP = Base 1 Skew 100 (set automatically)



  • In this thread https://forum.pfsense.org/index.php?topic=90758.0, Jimp gives some good advice, which I had also stumbled upon by trial and error.

    Set ADVBASE higher on the backup.

    Infact, there is a clue to this in the DHCP server Failover peer configuration

    Leave blank to disable. Enter the interface IP address of the other machine. 
    Machines must be using CARP. Interface's advskew determines whether the DHCPd process is Primary or Secondary. 
    Ensure one machine's advskew<20 (and the other is >20).
    


  • hmm, interesting tip. When I increase in the BACKUP for a certain CARP the Base setting to anything higher (like "22"), after a while (I assume a next sync?) it returns to value "1".
    And this leads also to the issue that my dhcpd failover states are in error. Until I change it back. And then after a while we go back to "1". (manual loop  ::)

    So either I'm doing it wrong, or there is something wrong?



  • @bennyc:

    hmm, interesting tip. When I increase in the BACKUP for a certain CARP the Base setting to anything higher (like "22"), after a while (I assume a next sync?) it returns to value "1".
    And this leads also to the issue that my dhcpd failover states are in error. Until I change it back. And then after a while we go back to "1". (manual loop  ::)

    So either I'm doing it wrong, or there is something wrong?

    Sorry, I forgot one critical detail.

    You must uncheck Synchronize Vitual IPs in the System -> High Avail. Sync, otherwise the MASTER will keep overwriting the ADVBASE value.
    This also means you must manually configure the VIP address on each box, initially check the Synchronize Vitual IPs when you do the setup, then uncheck it to go into production, and never check it again.

    I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.



  • First of all; thanks for your hints here!

    Ok, that makes sense. In meantime, this is driving me nuts.
    I disabled the sync for dhcpd. Then modified each CARP IP on the backup with a Adv. of 22. Then I had one failover group, that for no obvious reason remained in unknown state. Reboot, went back to settings, compaired dhcpd.conf from each member, they seemed ok.

    Restarted the dhcpd from the backup, all groups on my Master reported in recover, and on my backup they are all (but one) normal  :(
    Then restarted the dhcpd on the master, seems all groups on my Master are stuck in recover  >:(

    sigh… ::)



  • You are certain there are no duplicate CARP IDs or HSRP/VRRP sessions using the same CARP IDs?
    From shell, use tcpdump -s 0 -n -i <interface>-T carp proto 112
    Check each interface individually to be sure.</interface>



  • Went back to my initial state (full sync incl VIPs), removed even the failover ip's from dhcp's.

    Removed sync for CARP again. Then I updated the VIP CARP Adv. to 22 on the backup.

    Started the packet capture as requested. Set the FailOver IP again on the problematic interface (only here). Immediately the dhcp failover group shows on the backup, but also immediately in state "recover" and peer-state "unknown-state"

    I went over my config, I only have 7 CARP IP's, and they each have a unique hvid. Did a capture on both nodes, on the IF which is giving me troubles, and I only see CARPv2-advertise 36 for the configured vhid:
    22:50:58.137661 IP 10.100.200.251 > 224.0.0.18: CARPv2-advertise 36: vhid=200 advbase=1 advskew=0 authlen=7 counter=10858211938054085703

    Changed the vhid just to be sure, repeated the test (remove failover, add failover), same result.

    I must add, it justed dawned to me this IF happens to be the one connected on a Cisco stack with hsrp configured on it. It seems now a good idea for me to review it's config…



  • @awebster

    Tnx for all hints here, solved the issue in meantime.
    It required a reboot of the Master to get things running again, and a config error from my part for that last dhcp group.



  • @awebster:

    I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.

    Perhaps silly question, but did you report this? Seem indeed unwanted behavior that could be fixed as a bug…





  • Great! (& tnx!) I would have done it otherwise, but credits go to you…


Log in to reply