Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    CARP failover group worked - Single interface failed back = Outage

    Scheduled Pinned Locked Moved HA/CARP/VIPs
    6 Posts 2 Posters 2.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • C
      cmorrow
      last edited by

      Greetings,

      I just ran into a problem that caused an outage for me.  Two PFSense firewalls are configured as a Sync group with multiple CARP interfaces with increasing VHIDs.

      Last night the second firewall detected a "slow responding master" and failed over all the VIPs.  This worked fine.  One minute later, firewall2 dropped vip2 (WAN) due to "a more frequent advertisement" from firewall1.  This in turn caused an outage since all the LAN vips were on firewall2 and the WAN vip was on firewall1.

      Any idea why the vips didn't all failover as a group from the "more frequent advertisement" (a failback event)?  Am I missing some configuration piece?

      <5>1 2012-06-28T20:03:14.237674-06:00 app-fw02 kernel: - kernel: vip2: link state changed to UP
      –------snip------- : VIPs 3-14 also failed over to fw02 ---------snip--------
      <6>1 2012-06-28T20:04:10.929419-06:00 app-fw02 kernel: - kernel: vip2: MASTER -> BACKUP (more frequent advertisement received)
      <5>1 2012-06-28T20:04:10.929419-06:00 app-fw02 kernel: - kernel: vip2: link state changed to DOWN
      -------outage-------

      Any help appreciated.
      Calvin

      1 Reply Last reply Reply Quote 0
      • C
        cmb
        last edited by

        Any idea what lead to the "slow responding master"? Also what was the exact actual log there? That within quotes isn't in the source code.

        The source of that is likely the source of the other problem.

        1 Reply Last reply Reply Quote 0
        • C
          cmorrow
          last edited by

          You are correct.  The exact statement was "Preempting a slower master."  I'm not sure what caused the failover event unless its hidden in the logs below.

          
          <5>1 2012-06-28T20:03:12.247885-06:00 app-fw02 kernel: - kernel: vip15: link state changed to UP
          <6>1 2012-06-28T20:03:12.247959-06:00 app-fw02 kernel: - kernel: vip14: BACKUP -> MASTER (preempting a slower master)
          <6>1 2012-06-28T20:03:12.247758-06:00 app-fw02 kernel: - kernel: vip15: BACKUP -> MASTER (preempting a slower master)
          <6>1 2012-06-28T20:03:12.248123-06:00 app-fw02 kernel: - kernel: vip13: BACKUP -> MASTER (preempting a slower master)
          <5>1 2012-06-28T20:03:12.248867-06:00 app-fw02 kernel: - kernel: vip9: link state changed to UP
          <6>1 2012-06-28T20:03:12.248867-06:00 app-fw02 kernel: - kernel: vip9: BACKUP -> MASTER (preempting a slower master)
          
          

          –--etc.-----

          There really isn't anything significant in the log immediately before that except:

          
          <78>1 2012-06-28T20:00:08.772907-06:00 app-fw02 /usr/sbin/cron[35929]: - : (root) CMD (/usr/bin/nice -n20 newsyslog)
          <78>1 2012-06-28T20:00:08.772748-06:00 app-fw02 /usr/sbin/cron[35585]: - : (root) CMD (/usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 virusprot)
          <78>1 2012-06-28T20:00:08.772831-06:00 app-fw02 /usr/sbin/cron[35617]: - : (root) CMD (/usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 sshlockout)
          
          

          From the other side I see:

          
          Jun 28 20:03:03 app-fw01 check_reload_status: Linkup starting em2
          Jun 28 20:03:03 app-fw01 kernel: vip15: MASTER -> BACKUP (more frequent advertisement received)
          Jun 28 20:03:03 app-fw01 kernel: em2: link state changed to DOWN
          Jun 28 20:03:03 app-fw01 kernel: vip2: link state changed to DOWN
          Jun 28 20:03:03 app-fw01 kernel: vip15: link state changed to DOWN
          Jun 28 20:03:03 app-fw01 kernel: vip14: MASTER -> BACKUP (more frequent advertisement received)
          Jun 28 20:03:03 app-fw01 kernel: vip14: link state changed to DOWN
          Jun 28 20:03:03 app-fw01 kernel: vip13: MASTER -> BACKUP (more frequent advertisement received)
          Jun 28 20:03:03 app-fw01 kernel: vip13: link state changed to DOWN
          
          

          –--etc.------

          The "Jun 28 20:03:03 app-fw01 check_reload_status: Linkup starting em2" part has me curious.  Is that signifying that em2 went link down at some point?

          A minute later (when it takes vip2 back over) (IPs removed)

          
          Jun 28 20:03:28 app-fw01 check_reload_status: Reloading filter
          Jun 28 20:03:28 app-fw01 apinger: Starting Alarm Pinger, apinger(25057)
          Jun 28 20:03:31 app-fw01 php: : Hotplug event detected for wan but ignoring since interface is configured with static IP (WWW.XXX.YYY.ZZZ)
          Jun 28 20:03:31 app-fw01 check_reload_status: rc.newwanip starting em2
          Jun 28 20:03:31 app-fw01 check_reload_status: Linkup starting em2
          Jun 28 20:03:31 app-fw01 kernel: em2: link state changed to DOWN
          Jun 28 20:03:31 app-fw01 kernel: vip2: link state changed to DOWN
          Jun 28 20:03:34 app-fw01 check_reload_status: Linkup starting em2
          Jun 28 20:03:34 app-fw01 kernel: vip2: INIT -> BACKUP
          Jun 28 20:03:34 app-fw01 kernel: em2: link state changed to UP
          Jun 28 20:03:34 app-fw01 kernel: vip2: link state changed to DOWN
          Jun 28 20:03:36 app-fw01 check_reload_status: Linkup starting em2
          Jun 28 20:03:36 app-fw01 kernel: em2: link state changed to DOWN
          Jun 28 20:03:36 app-fw01 kernel: vip2: link state changed to DOWN
          Jun 28 20:03:36 app-fw01 php: : rc.newwanip: Informational is starting em2.
          Jun 28 20:03:36 app-fw01 php: : rc.newwanip: on (IP address: WWW.XXX.YYY.ZZZ) (interface: wan) (real interface: em2).
          Jun 28 20:03:36 app-fw01 php: : ROUTING: setting default route to WWW.XXX.YYY.ZZZ
          Jun 28 20:03:36 app-fw01 apinger: Exiting on signal 15.
          Jun 28 20:03:37 app-fw01 php: : Hotplug event detected for wan but ignoring since interface is configured with static IP (WWW.XXX.YYY.ZZZ)
          Jun 28 20:03:37 app-fw01 check_reload_status: Reloading filter
          Jun 28 20:03:37 app-fw01 apinger: Starting Alarm Pinger, apinger(13460)
          Jun 28 20:03:38 app-fw01 check_reload_status: Linkup starting em2
          Jun 28 20:03:38 app-fw01 kernel: vip2: INIT -> BACKUP
          Jun 28 20:03:38 app-fw01 kernel: em2: link state changed to UP
          Jun 28 20:03:38 app-fw01 kernel: vip2: link state changed to DOWN
          Jun 28 20:03:39 app-fw01 php: : Hotplug event detected for wan but ignoring since interface is configured with static IP (WWW.XXX.YYY.ZZZ)
          Jun 28 20:03:39 app-fw01 check_reload_status: rc.newwanip starting em2
          Jun 28 20:03:41 app-fw01 kernel: vip2: link state changed to UP
          Jun 28 20:03:44 app-fw01 php: : rc.newwanip: Informational is starting em2.
          Jun 28 20:03:44 app-fw01 php: : rc.newwanip: on (IP address: WWW.XXX.YYY.ZZZ) (interface: wan) (real interface: em2).
          Jun 28 20:03:44 app-fw01 php: : ROUTING: setting default route to WWW.XXX.YYY.ZZZ
          Jun 28 20:03:44 app-fw01 apinger: Exiting on signal 15.
          Jun 28 20:03:45 app-fw01 check_reload_status: Reloading filter
          Jun 28 20:03:45 app-fw01 apinger: Starting Alarm Pinger, apinger(41165)
          
          

          Calvin

          1 Reply Last reply Reply Quote 0
          • C
            cmb
            last edited by

            Yeah looks like you had some general network issue there. That NIC lost and gained link multiple times. On maybe a couple occasions I've seen CARP switch over half way like that when there was some general chaos going on with the network, but in both instances I was never able to find out exactly what happened to cause the network issues, and have never been able to find a means of replicating it (I've seen and induced all kinds of network chaos on CARP, it's always handled it fine for me).

            1 Reply Last reply Reply Quote 0
            • C
              cmorrow
              last edited by

              I believe there's a backup process that kicks off at 20:00 so there's a good amount of traffic flowing across that interface.  That probably counts as network chaos  ;D

              I guess my next question is/should be, is there any way to allow traffic to survive CARP only failing over some (but not all) vhids?  Does fw state syncing allow this in any way?

              1 Reply Last reply Reply Quote 0
              • C
                cmb
                last edited by

                I'm referring to real chaos (switches flaking out, other extreme network flakiness), not high load. CARP preemption is enabled, it should always switch all IPs over.

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.