CARP failover group worked - Single interface failed back = Outage

cmorrow

Greetings,

I just ran into a problem that caused an outage for me. Two PFSense firewalls are configured as a Sync group with multiple CARP interfaces with increasing VHIDs.

Last night the second firewall detected a "slow responding master" and failed over all the VIPs. This worked fine. One minute later, firewall2 dropped vip2 (WAN) due to "a more frequent advertisement" from firewall1. This in turn caused an outage since all the LAN vips were on firewall2 and the WAN vip was on firewall1.

Any idea why the vips didn't all failover as a group from the "more frequent advertisement" (a failback event)? Am I missing some configuration piece?

<5>1 2012-06-28T20:03:14.237674-06:00 app-fw02 kernel: - kernel: vip2: link state changed to UP
–------snip------- : VIPs 3-14 also failed over to fw02 ---------snip--------
<6>1 2012-06-28T20:04:10.929419-06:00 app-fw02 kernel: - kernel: vip2: MASTER -> BACKUP (more frequent advertisement received)
<5>1 2012-06-28T20:04:10.929419-06:00 app-fw02 kernel: - kernel: vip2: link state changed to DOWN
-------outage-------

Any help appreciated.
Calvin

cmb

Any idea what lead to the "slow responding master"? Also what was the exact actual log there? That within quotes isn't in the source code.

The source of that is likely the source of the other problem.

cmorrow

You are correct. The exact statement was "Preempting a slower master." I'm not sure what caused the failover event unless its hidden in the logs below.


<5>1 2012-06-28T20:03:12.247885-06:00 app-fw02 kernel: - kernel: vip15: link state changed to UP
<6>1 2012-06-28T20:03:12.247959-06:00 app-fw02 kernel: - kernel: vip14: BACKUP -> MASTER (preempting a slower master)
<6>1 2012-06-28T20:03:12.247758-06:00 app-fw02 kernel: - kernel: vip15: BACKUP -> MASTER (preempting a slower master)
<6>1 2012-06-28T20:03:12.248123-06:00 app-fw02 kernel: - kernel: vip13: BACKUP -> MASTER (preempting a slower master)
<5>1 2012-06-28T20:03:12.248867-06:00 app-fw02 kernel: - kernel: vip9: link state changed to UP
<6>1 2012-06-28T20:03:12.248867-06:00 app-fw02 kernel: - kernel: vip9: BACKUP -> MASTER (preempting a slower master)

–--etc.-----

There really isn't anything significant in the log immediately before that except:


<78>1 2012-06-28T20:00:08.772907-06:00 app-fw02 /usr/sbin/cron[35929]: - : (root) CMD (/usr/bin/nice -n20 newsyslog)
<78>1 2012-06-28T20:00:08.772748-06:00 app-fw02 /usr/sbin/cron[35585]: - : (root) CMD (/usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 virusprot)
<78>1 2012-06-28T20:00:08.772831-06:00 app-fw02 /usr/sbin/cron[35617]: - : (root) CMD (/usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 sshlockout)

From the other side I see:


Jun 28 20:03:03 app-fw01 check_reload_status: Linkup starting em2
Jun 28 20:03:03 app-fw01 kernel: vip15: MASTER -> BACKUP (more frequent advertisement received)
Jun 28 20:03:03 app-fw01 kernel: em2: link state changed to DOWN
Jun 28 20:03:03 app-fw01 kernel: vip2: link state changed to DOWN
Jun 28 20:03:03 app-fw01 kernel: vip15: link state changed to DOWN
Jun 28 20:03:03 app-fw01 kernel: vip14: MASTER -> BACKUP (more frequent advertisement received)
Jun 28 20:03:03 app-fw01 kernel: vip14: link state changed to DOWN
Jun 28 20:03:03 app-fw01 kernel: vip13: MASTER -> BACKUP (more frequent advertisement received)
Jun 28 20:03:03 app-fw01 kernel: vip13: link state changed to DOWN

–--etc.------

The "Jun 28 20:03:03 app-fw01 check_reload_status: Linkup starting em2" part has me curious. Is that signifying that em2 went link down at some point?

A minute later (when it takes vip2 back over) (IPs removed)


Jun 28 20:03:28 app-fw01 check_reload_status: Reloading filter
Jun 28 20:03:28 app-fw01 apinger: Starting Alarm Pinger, apinger(25057)
Jun 28 20:03:31 app-fw01 php: : Hotplug event detected for wan but ignoring since interface is configured with static IP (WWW.XXX.YYY.ZZZ)
Jun 28 20:03:31 app-fw01 check_reload_status: rc.newwanip starting em2
Jun 28 20:03:31 app-fw01 check_reload_status: Linkup starting em2
Jun 28 20:03:31 app-fw01 kernel: em2: link state changed to DOWN
Jun 28 20:03:31 app-fw01 kernel: vip2: link state changed to DOWN
Jun 28 20:03:34 app-fw01 check_reload_status: Linkup starting em2
Jun 28 20:03:34 app-fw01 kernel: vip2: INIT -> BACKUP
Jun 28 20:03:34 app-fw01 kernel: em2: link state changed to UP
Jun 28 20:03:34 app-fw01 kernel: vip2: link state changed to DOWN
Jun 28 20:03:36 app-fw01 check_reload_status: Linkup starting em2
Jun 28 20:03:36 app-fw01 kernel: em2: link state changed to DOWN
Jun 28 20:03:36 app-fw01 kernel: vip2: link state changed to DOWN
Jun 28 20:03:36 app-fw01 php: : rc.newwanip: Informational is starting em2.
Jun 28 20:03:36 app-fw01 php: : rc.newwanip: on (IP address: WWW.XXX.YYY.ZZZ) (interface: wan) (real interface: em2).
Jun 28 20:03:36 app-fw01 php: : ROUTING: setting default route to WWW.XXX.YYY.ZZZ
Jun 28 20:03:36 app-fw01 apinger: Exiting on signal 15.
Jun 28 20:03:37 app-fw01 php: : Hotplug event detected for wan but ignoring since interface is configured with static IP (WWW.XXX.YYY.ZZZ)
Jun 28 20:03:37 app-fw01 check_reload_status: Reloading filter
Jun 28 20:03:37 app-fw01 apinger: Starting Alarm Pinger, apinger(13460)
Jun 28 20:03:38 app-fw01 check_reload_status: Linkup starting em2
Jun 28 20:03:38 app-fw01 kernel: vip2: INIT -> BACKUP
Jun 28 20:03:38 app-fw01 kernel: em2: link state changed to UP
Jun 28 20:03:38 app-fw01 kernel: vip2: link state changed to DOWN
Jun 28 20:03:39 app-fw01 php: : Hotplug event detected for wan but ignoring since interface is configured with static IP (WWW.XXX.YYY.ZZZ)
Jun 28 20:03:39 app-fw01 check_reload_status: rc.newwanip starting em2
Jun 28 20:03:41 app-fw01 kernel: vip2: link state changed to UP
Jun 28 20:03:44 app-fw01 php: : rc.newwanip: Informational is starting em2.
Jun 28 20:03:44 app-fw01 php: : rc.newwanip: on (IP address: WWW.XXX.YYY.ZZZ) (interface: wan) (real interface: em2).
Jun 28 20:03:44 app-fw01 php: : ROUTING: setting default route to WWW.XXX.YYY.ZZZ
Jun 28 20:03:44 app-fw01 apinger: Exiting on signal 15.
Jun 28 20:03:45 app-fw01 check_reload_status: Reloading filter
Jun 28 20:03:45 app-fw01 apinger: Starting Alarm Pinger, apinger(41165)

Calvin

cmb

Yeah looks like you had some general network issue there. That NIC lost and gained link multiple times. On maybe a couple occasions I've seen CARP switch over half way like that when there was some general chaos going on with the network, but in both instances I was never able to find out exactly what happened to cause the network issues, and have never been able to find a means of replicating it (I've seen and induced all kinds of network chaos on CARP, it's always handled it fine for me).

cmorrow

I believe there's a backup process that kicks off at 20:00 so there's a good amount of traffic flowing across that interface. That probably counts as network chaos ;D

I guess my next question is/should be, is there any way to allow traffic to survive CARP only failing over some (but not all) vhids? Does fw state syncing allow this in any way?

cmb

I'm referring to real chaos (switches flaking out, other extreme network flakiness), not high load. CARP preemption is enabled, it should always switch all IPs over.