CARP fails over both interfaces, then one sometimes fails back
-
I have a simple test setup with two physical servers running pfSense 2.4.3. Each server has three interfaces:
- WAN
- LAN
- SYNC
WAN and LAN interfaces have associated VIP addresses.
If I disconnect either the WAN or LAN connection, most of the time both connections fail over to the backup server and connectivity is restored within a second or so.
Occasionally however, it appears only one of the VIPs has failed over, e.g. if I disconnect the WAN connection, the WAN VIP will fail over to the backup but the LAN VIP remains on the master.
The system log seems to show that both VIPs did fail over, but one then quickly failed back. This seems to happen roughly 30% of the time in my tests.
When this happens, I see the following log entry on the original master:
carp: 2@bge5: BACKUP -> MASTER (preempting a slower master)
and on the backup I see the entry:
carp: 2@bge5: MASTER -> BACKUP (more frequent advertisement received)
As I understand it, it should not be possible for CARP to result in failing over only one VIP, yet I can regularly reproduce it.
Is there some configuration setting I'm missing, or is there something else I should be looking at?
-
When you down a CARP interface, the advskew on all CARP VIPs on that node get demoted (advskews increased).
The first thing I would look at is a packet capture of the CARP traffic on the secondary node, paying attention to the advbase/advskew values in the advertisements that are being received during one of these tests.
If you do a bunch of yanking of interfaces it is possible that the demotion counter got confused. Are there any messages on Status > CARP?
Does it all fail over normally if you enter CARP Maintenance Mode instead of pulling interfaces?
-
There aren't any messages in Status > Carp. It does fail over as expected when I use Maintenance Mode.
I've run a packet capture on the LAN interface on the secondary node (10.1.1.2), then unplugged the WAN connection on the primary (10.1.1.1). That gives the following log:
13:08:01.211091 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:02.215013 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:03.219572 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:04.235793 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:05.237226 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:06.248646 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:07.251671 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:07.814455 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=240 authlen=7 counter=13725484716165676517 13:08:07.814467 IP 10.1.1.2 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=100 authlen=7 counter=16795567171629106047 13:08:09.237432 IP 10.1.1.2 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=100 authlen=7 counter=16795567171629106047 13:08:09.237741 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:10.260303 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:11.276073 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:12.338015 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:13.400586 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:14.463607 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:15.525578 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:16.558297 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517 13:08:17.563721 IP 10.1.1.1 > 224.0.0.18: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
The first few entries cover the initial state where everything is plugged in as normal, and the primary node is advertising itself with advbase 1 and advskew 0.
At 13:08:07 the primary changes to advbase 1 and advskew 240. The secondary replies with advbase 1 and advskew 100, which I guess means it has taken over than LAN VIP as expected.
At 13:08:09 the primary starts advertising with advbase 1 and advskew 0 again, which seems to coincide with the LAN VIP failing over back to the primary while the WAN VIP remains on the secondary.
-
I've repeated the test and got the system log from the primary:
Aug 16 13:18:11 kernel carp: 10@bge4: MASTER -> INIT (hardware interface down) Aug 16 13:18:11 kernel carp: demoted by 240 to 240 (interface down) Aug 16 13:18:11 kernel bge4: link state changed to DOWN Aug 16 13:18:11 kernel carp: 11@bge5: MASTER -> BACKUP (more frequent advertisement received) Aug 16 13:18:11 kernel ifa_maintain_loopback_route: deletion failed for interface bge5: 3 Aug 16 13:18:11 check_reload_status Carp backup event Aug 16 13:18:11 check_reload_status Linkup starting bge4 Aug 16 13:18:11 check_reload_status Carp backup event Aug 16 13:18:12 php-fpm 17778 /rc.linkup: DEVD Ethernet detached event for wan Aug 16 13:18:12 php-fpm 318 /rc.carpbackup: HA cluster member "(193.187.72.4@bge4): (WAN)" has resumed CARP state "BACKUP" for vhid 10 Aug 16 13:18:12 php-fpm 317 /rc.carpbackup: HA cluster member "(10.1.1.3@bge5): (LAN)" has resumed CARP state "BACKUP" for vhid 11 Aug 16 13:18:12 kernel ifa_maintain_loopback_route: deletion failed for interface bge4: 3 Aug 16 13:18:12 kernel ifa_maintain_loopback_route: deletion failed for interface bge4: 3 Aug 16 13:18:12 kernel carp: demoted by -240 to 0 (vhid removed) Aug 16 13:18:12 kernel bge4: promiscuous mode disabled Aug 16 13:18:13 kernel carp: 11@bge5: BACKUP -> MASTER (preempting a slower master) Aug 16 13:18:13 check_reload_status Carp master event Aug 16 13:18:13 kernel arpresolve: can't allocate llinfo for 193.187.72.2 on bge4 Aug 16 13:18:13 check_reload_status Reloading filter Aug 16 13:18:14 php-fpm 318 /rc.carpmaster: HA cluster member "(10.1.1.3@bge5): (LAN)" has resumed CARP state "MASTER" for vhid 11
I guess my problem stems from the line
carp: demoted by -240 to 0 (vhid removed)
, but I don't understand what would be triggering this. -
Does anyone have any ideas what could be causing this, or where I could look next to get any more useful debugging info?