CARP fails over both interfaces, then one sometimes fails back



  • I have a simple test setup with two physical servers running pfSense 2.4.3. Each server has three interfaces:

    • WAN
    • LAN
    • SYNC

    WAN and LAN interfaces have associated VIP addresses.

    If I disconnect either the WAN or LAN connection, most of the time both connections fail over to the backup server and connectivity is restored within a second or so.

    Occasionally however, it appears only one of the VIPs has failed over, e.g. if I disconnect the WAN connection, the WAN VIP will fail over to the backup but the LAN VIP remains on the master.

    The system log seems to show that both VIPs did fail over, but one then quickly failed back. This seems to happen roughly 30% of the time in my tests.

    When this happens, I see the following log entry on the original master:
    carp: 2@bge5: BACKUP -> MASTER (preempting a slower master)
    and on the backup I see the entry:
    carp: 2@bge5: MASTER -> BACKUP (more frequent advertisement received)

    As I understand it, it should not be possible for CARP to result in failing over only one VIP, yet I can regularly reproduce it.

    Is there some configuration setting I'm missing, or is there something else I should be looking at?


  • Netgate

    When you down a CARP interface, the advskew on all CARP VIPs on that node get demoted (advskews increased).

    The first thing I would look at is a packet capture of the CARP traffic on the secondary node, paying attention to the advbase/advskew values in the advertisements that are being received during one of these tests.

    If you do a bunch of yanking of interfaces it is possible that the demotion counter got confused. Are there any messages on Status > CARP?

    Does it all fail over normally if you enter CARP Maintenance Mode instead of pulling interfaces?



  • There aren't any messages in Status > Carp. It does fail over as expected when I use Maintenance Mode.

    I've run a packet capture on the LAN interface on the secondary node (10.1.1.2), then unplugged the WAN connection on the primary (10.1.1.1). That gives the following log:

    13:08:01.211091 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:02.215013 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:03.219572 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:04.235793 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:05.237226 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:06.248646 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:07.251671 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:07.814455 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=240 authlen=7 counter=13725484716165676517
    13:08:07.814467 IP 10.1.1.2 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=100 authlen=7 counter=16795567171629106047
    13:08:09.237432 IP 10.1.1.2 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=100 authlen=7 counter=16795567171629106047
    13:08:09.237741 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:10.260303 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:11.276073 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:12.338015 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:13.400586 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:14.463607 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:15.525578 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:16.558297 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    13:08:17.563721 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
    

    The first few entries cover the initial state where everything is plugged in as normal, and the primary node is advertising itself with advbase 1 and advskew 0.

    At 13:08:07 the primary changes to advbase 1 and advskew 240. The secondary replies with advbase 1 and advskew 100, which I guess means it has taken over than LAN VIP as expected.

    At 13:08:09 the primary starts advertising with advbase 1 and advskew 0 again, which seems to coincide with the LAN VIP failing over back to the primary while the WAN VIP remains on the secondary.



  • I've repeated the test and got the system log from the primary:

    Aug 16 13:18:11	kernel		carp: 10@bge4: MASTER -> INIT (hardware interface down)
    Aug 16 13:18:11	kernel		carp: demoted by 240 to 240 (interface down)
    Aug 16 13:18:11	kernel		bge4: link state changed to DOWN
    Aug 16 13:18:11	kernel		carp: 11@bge5: MASTER -> BACKUP (more frequent advertisement received)
    Aug 16 13:18:11	kernel		ifa_maintain_loopback_route: deletion failed for interface bge5: 3
    Aug 16 13:18:11	check_reload_status		Carp backup event
    Aug 16 13:18:11	check_reload_status		Linkup starting bge4
    Aug 16 13:18:11	check_reload_status		Carp backup event
    Aug 16 13:18:12	php-fpm	17778	/rc.linkup: DEVD Ethernet detached event for wan
    Aug 16 13:18:12	php-fpm	318	/rc.carpbackup: HA cluster member "(193.187.72.4@bge4): (WAN)" has resumed CARP state "BACKUP" for vhid 10
    Aug 16 13:18:12	php-fpm	317	/rc.carpbackup: HA cluster member "(10.1.1.3@bge5): (LAN)" has resumed CARP state "BACKUP" for vhid 11
    Aug 16 13:18:12	kernel		ifa_maintain_loopback_route: deletion failed for interface bge4: 3
    Aug 16 13:18:12	kernel		ifa_maintain_loopback_route: deletion failed for interface bge4: 3
    Aug 16 13:18:12	kernel		carp: demoted by -240 to 0 (vhid removed)
    Aug 16 13:18:12	kernel		bge4: promiscuous mode disabled
    Aug 16 13:18:13	kernel		carp: 11@bge5: BACKUP -> MASTER (preempting a slower master)
    Aug 16 13:18:13	check_reload_status		Carp master event
    Aug 16 13:18:13	kernel		arpresolve: can't allocate llinfo for 193.187.72.2 on bge4
    Aug 16 13:18:13	check_reload_status		Reloading filter
    Aug 16 13:18:14	php-fpm	318	/rc.carpmaster: HA cluster member "(10.1.1.3@bge5): (LAN)" has resumed CARP state "MASTER" for vhid 11
    

    I guess my problem stems from the line carp: demoted by -240 to 0 (vhid removed), but I don't understand what would be triggering this.



  • Does anyone have any ideas what could be causing this, or where I could look next to get any more useful debugging info?