Strange CARP Behavior

wheelz

I see lots of people posting about various CARP issues and I have read through a bunch of them. Those have helped me already solve most of my issues however now I am seeing something in my test environment that I can't explain at all.

I have 2 VM (vmware esxi 5.0u1) as master and slave. I have already followed the howto about using VDS to isolate my promiscuous traffic and also enabled Net.ReversePathFwdCheckPromisc on my hosts to get rid of the redundant adapter echos. What I can't explain is this:

If I disconnect all vNICs on my master, the second pfsense becomes the master so that is working perfectly:

Jan 13 11:31:28 kernel: vip1: link state changed to UP

Then I re-enable all the vNICs on my master and as expected, it fails back (though I'm not sure why it repeats so much in the log):

Jan 13 11:33:58 kernel: vip1: link state changed to DOWN
Jan 13 11:33:58 kernel: vip1: 2 link states coalesced
Jan 13 11:33:58 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:33:54 kernel: vip1: link state changed to DOWN
Jan 13 11:33:54 kernel: vip1: 2 link states coalesced
Jan 13 11:33:54 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:33:51 kernel: vip1: link state changed to DOWN
Jan 13 11:33:51 kernel: vip1: 2 link states coalesced
Jan 13 11:33:51 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:33:48 kernel: vip1: link state changed to DOWN
Jan 13 11:33:48 kernel: vip1: 2 link states coalesced
Jan 13 11:33:48 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:33:03 kernel: vip1: link state changed to DOWN
Jan 13 11:33:03 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)

However if I halt the master (which should effectively be the same as completely disconnecting the network as I did above), I get this:

Jan 13 11:34:15 kernel: vip1: link state changed to DOWN
Jan 13 11:34:15 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:34:15 kernel: vip1: link state changed to UP
Jan 13 11:34:11 kernel: vip1: link state changed to DOWN
Jan 13 11:34:11 kernel: vip1: 2 link states coalesced
Jan 13 11:34:11 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:34:08 kernel: vip1: link state changed to DOWN
Jan 13 11:34:08 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:34:08 kernel: vip1: link state changed to UP
Jan 13 11:34:05 kernel: vip1: link state changed to DOWN
Jan 13 11:34:05 kernel: vip1: 2 link states coalesced
Jan 13 11:34:05 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:34:01 kernel: vip1: link state changed to DOWN
Jan 13 11:34:01 kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
Jan 13 11:34:01 kernel: vip1: link state changed to UP

And it just repeats. That is really got me puzzled because both scenarios should react the same I would think… any ideas? ???

wheelz

:P I think I figured it out. In case anyone else has some strange behavior like this, keep in mind that you have to reboot your VMware hosts before advanced setting configurations like Net.ReversePathFwdCheckPromisc take effect. duhh, me… ::)

cmb

hm, shouldn't have to, I don't recall having to do that on any of the many setups I've done that require it. When I was looking at this yesterday, it sort of sounded like what happens when Net.ReversePathFwdCheckPromisc isn't set, but that would cause it to never fail over correctly. In that regard, there's no difference between disconnecting NICs on the primary, disabling CARP manually on the primary, and shutting down the VM of the primary. Maybe some new VMware bug specific to something in your environment that a host reboot fixes, they seem to introduce them that impact CARP-related networking things on a routine basis.

wheelz

Yea… just happened again. So I guess that wasn't it. I had a power blib that caused the vmware host and a switch to go down and come back up. This time I just tried rebooting the switch and it seemed to have stopped it...

cmb

Looping multicast traffic back in the same interface it comes out of is the cause of that. There must be something on the physical network I guess that's looping it back.

wheelz

I had a couple physical boxes I was able to try this on and it worked OK. So I am guessing it is an issue in the way VMware is configured… I do have 2 hosts each with 2 physical NICs for vmguest networking going to 2 physical switches (trunked between them). I'm using a VDS with a separate port group that has promiscuous enabled on the VLAN that has my CARP VIP and have configured the hosts with Net.ReversePathFwdCheckPromisc = 1. Is there anything else I am missing?