Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    CARP fails over both interfaces, then one sometimes fails back

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    5 Posts 2 Posters 1.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      MarkMpn
      last edited by

      I have a simple test setup with two physical servers running pfSense 2.4.3. Each server has three interfaces:

      • WAN
      • LAN
      • SYNC

      WAN and LAN interfaces have associated VIP addresses.

      If I disconnect either the WAN or LAN connection, most of the time both connections fail over to the backup server and connectivity is restored within a second or so.

      Occasionally however, it appears only one of the VIPs has failed over, e.g. if I disconnect the WAN connection, the WAN VIP will fail over to the backup but the LAN VIP remains on the master.

      The system log seems to show that both VIPs did fail over, but one then quickly failed back. This seems to happen roughly 30% of the time in my tests.

      When this happens, I see the following log entry on the original master:
      carp: 2@bge5: BACKUP -> MASTER (preempting a slower master)
      and on the backup I see the entry:
      carp: 2@bge5: MASTER -> BACKUP (more frequent advertisement received)

      As I understand it, it should not be possible for CARP to result in failing over only one VIP, yet I can regularly reproduce it.

      Is there some configuration setting I'm missing, or is there something else I should be looking at?

      1 Reply Last reply Reply Quote 0
      • DerelictD
        Derelict LAYER 8 Netgate
        last edited by

        When you down a CARP interface, the advskew on all CARP VIPs on that node get demoted (advskews increased).

        The first thing I would look at is a packet capture of the CARP traffic on the secondary node, paying attention to the advbase/advskew values in the advertisements that are being received during one of these tests.

        If you do a bunch of yanking of interfaces it is possible that the demotion counter got confused. Are there any messages on Status > CARP?

        Does it all fail over normally if you enter CARP Maintenance Mode instead of pulling interfaces?

        Chattanooga, Tennessee, USA
        A comprehensive network diagram is worth 10,000 words and 15 conference calls.
        DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
        Do Not Chat For Help! NO_WAN_EGRESS(TM)

        1 Reply Last reply Reply Quote 0
        • M
          MarkMpn
          last edited by MarkMpn

          There aren't any messages in Status > Carp. It does fail over as expected when I use Maintenance Mode.

          I've run a packet capture on the LAN interface on the secondary node (10.1.1.2), then unplugged the WAN connection on the primary (10.1.1.1). That gives the following log:

          13:08:01.211091 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:02.215013 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:03.219572 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:04.235793 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:05.237226 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:06.248646 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:07.251671 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:07.814455 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=240 authlen=7 counter=13725484716165676517
          13:08:07.814467 IP 10.1.1.2 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=100 authlen=7 counter=16795567171629106047
          13:08:09.237432 IP 10.1.1.2 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=100 authlen=7 counter=16795567171629106047
          13:08:09.237741 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:10.260303 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:11.276073 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:12.338015 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:13.400586 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:14.463607 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:15.525578 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:16.558297 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          13:08:17.563721 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
          

          The first few entries cover the initial state where everything is plugged in as normal, and the primary node is advertising itself with advbase 1 and advskew 0.

          At 13:08:07 the primary changes to advbase 1 and advskew 240. The secondary replies with advbase 1 and advskew 100, which I guess means it has taken over than LAN VIP as expected.

          At 13:08:09 the primary starts advertising with advbase 1 and advskew 0 again, which seems to coincide with the LAN VIP failing over back to the primary while the WAN VIP remains on the secondary.

          1 Reply Last reply Reply Quote 0
          • M
            MarkMpn
            last edited by

            I've repeated the test and got the system log from the primary:

            Aug 16 13:18:11	kernel		carp: 10@bge4: MASTER -> INIT (hardware interface down)
            Aug 16 13:18:11	kernel		carp: demoted by 240 to 240 (interface down)
            Aug 16 13:18:11	kernel		bge4: link state changed to DOWN
            Aug 16 13:18:11	kernel		carp: 11@bge5: MASTER -> BACKUP (more frequent advertisement received)
            Aug 16 13:18:11	kernel		ifa_maintain_loopback_route: deletion failed for interface bge5: 3
            Aug 16 13:18:11	check_reload_status		Carp backup event
            Aug 16 13:18:11	check_reload_status		Linkup starting bge4
            Aug 16 13:18:11	check_reload_status		Carp backup event
            Aug 16 13:18:12	php-fpm	17778	/rc.linkup: DEVD Ethernet detached event for wan
            Aug 16 13:18:12	php-fpm	318	/rc.carpbackup: HA cluster member "(193.187.72.4@bge4): (WAN)" has resumed CARP state "BACKUP" for vhid 10
            Aug 16 13:18:12	php-fpm	317	/rc.carpbackup: HA cluster member "(10.1.1.3@bge5): (LAN)" has resumed CARP state "BACKUP" for vhid 11
            Aug 16 13:18:12	kernel		ifa_maintain_loopback_route: deletion failed for interface bge4: 3
            Aug 16 13:18:12	kernel		ifa_maintain_loopback_route: deletion failed for interface bge4: 3
            Aug 16 13:18:12	kernel		carp: demoted by -240 to 0 (vhid removed)
            Aug 16 13:18:12	kernel		bge4: promiscuous mode disabled
            Aug 16 13:18:13	kernel		carp: 11@bge5: BACKUP -> MASTER (preempting a slower master)
            Aug 16 13:18:13	check_reload_status		Carp master event
            Aug 16 13:18:13	kernel		arpresolve: can't allocate llinfo for 193.187.72.2 on bge4
            Aug 16 13:18:13	check_reload_status		Reloading filter
            Aug 16 13:18:14	php-fpm	318	/rc.carpmaster: HA cluster member "(10.1.1.3@bge5): (LAN)" has resumed CARP state "MASTER" for vhid 11
            

            I guess my problem stems from the line carp: demoted by -240 to 0 (vhid removed), but I don't understand what would be triggering this.

            1 Reply Last reply Reply Quote 0
            • M
              MarkMpn
              last edited by

              Does anyone have any ideas what could be causing this, or where I could look next to get any more useful debugging info?

              1 Reply Last reply Reply Quote 0
              • First post
                Last post
              Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.