CARP on 2.2.1, VMWare 5.5 with dvS

hphan082

hi everyone,
I'm running into this old issue where both firewall are stucked in Backup status.
I followed https://doc.pfsense.org/index.php/CARP_Configuration_Troubleshooting#VMware_ESX.2FESXi_Users, and here: https://forum.pfsense.org/index.php?topic=64022.0

We use dvS, and we already enable the advance setting. I have promiscuous enabled on LAN and WAN port-group, but not the sync port-group.

My firewalls both still stuck in Backup.

Below is a short capture of my log in the master.
Mar 26 17:17:42 php-fpm[58491]: /rc.carpbackup: Carp cluster member "10.22.2.254 - daas.management.vip (2@em0_vlan2202)" has resumed the state "BACKUP" for vhid 2@em0_vlan2202
Mar 26 17:17:42 php-fpm[57375]: /rc.carpbackup: Carp cluster member "10.22.5.254 - daas.dmz.vip (3@em0_vlan2205)" has resumed the state "BACKUP" for vhid 3@em0_vlan2205
Mar 26 17:17:44 check_reload_status: Carp master event
Mar 26 17:17:44 kernel: carp: VHID 1@em1: BACKUP -> MASTER (master down)
Mar 26 17:17:44 kernel: carp: VHID 1@em1: MASTER -> BACKUP (more frequent advertisement received)
Mar 26 17:17:44 check_reload_status: Carp backup event
Mar 26 17:17:44 check_reload_status: Carp master event
Mar 26 17:17:44 kernel: carp: VHID 3@em0_vlan2205: BACKUP -> MASTER (master down)
Mar 26 17:17:44 kernel: carp: VHID 2@em0_vlan2202: BACKUP -> MASTER (master down)
Mar 26 17:17:44 kernel: carp: VHID 3@em0_vlan2205: MASTER -> BACKUP (more frequent advertisement received)
Mar 26 17:17:44 kernel: carp: VHID 2@em0_vlan2202: MASTER -> BACKUP (more frequent advertisement received)
Mar 26 17:17:44 check_reload_status: Carp master event
Mar 26 17:17:44 check_reload_status: Carp backup event
Mar 26 17:17:44 check_reload_status: Carp backup event
Mar 26 17:17:45 php-fpm[58491]: /rc.carpmaster: Carp cluster member "198.51.168.254 - daas.pub.vip (1@em1)" has resumed the state "MASTER" for vhid 1@em1
Mar 26 17:17:45 php-fpm[58491]: /rc.carpbackup: Carp cluster member "198.51.168.254 - daas.pub.vip (1@em1)" has resumed the state "BACKUP" for vhid 1@em1
Mar 26 17:17:45 php-fpm[58491]: /rc.carpmaster: Carp cluster member "10.22.5.254 - daas.dmz.vip (3@em0_vlan2205)" has resumed the state "MASTER" for vhid 3@em0_vlan2205
Mar 26 17:17:45 php-fpm[58491]: /rc.carpmaster: Carp cluster member "10.22.2.254 - daas.management.vip (2@em0_vlan2202)" has resumed the state "MASTER" for vhid 2@em0_vlan2202
Mar 26 17:17:45 php-fpm[58491]: /rc.carpbackup: Carp cluster member "10.22.5.254 - daas.dmz.vip (3@em0_vlan2205)" has resumed the state "BACKUP" for vhid 3@em0_vlan2205
Mar 26 17:17:45 php-fpm[58491]: /rc.carpbackup: Carp cluster member "10.22.2.254 - daas.management.vip (2@em0_vlan2202)" has resumed the state "BACKUP" for vhid 2@em0_vlan2202
Mar 26 17:17:47 check_reload_status: Carp master event
Mar 26 17:17:47 kernel: carp: VHID 1@em1: BACKUP -> MASTER (master down)
Mar 26 17:17:47 kernel: carp: VHID 1@em1: MASTER -> BACKUP (more frequent advertisement received)
Mar 26 17:17:47 check_reload_status: Carp backup event
Mar 26 17:17:47 check_reload_status: Carp master event
Mar 26 17:17:47 kernel: carp: VHID 2@em0_vlan2202: BACKUP -> MASTER (master down)
Mar 26 17:17:47 kernel: carp: VHID 3@em0_vlan2205: BACKUP -> MASTER (master down)
Mar 26 17:17:47 kernel: carp: VHID 2@em0_vlan2202: MASTER -> BACKUP (more frequent advertisement received)
Mar 26 17:17:47 kernel: carp: VHID 3@em0_vlan2205: MASTER -> BACKUP (more frequent advertisement received)
Mar 26 17:17:47 check_reload_status: Carp master event
Mar 26 17:17:47 check_reload_status: Carp backup event
Mar 26 17:17:47 check_reload_status: Carp backup event

cmb

That's the symptoms of the VMware looping multicast issue.
https://doc.pfsense.org/index.php/CARP_Configuration_Troubleshooting#Changing_Net.ReversePathFwdCheckPromisc

rickbaran

Also might check the version of exi 5.5 your are on. Had some other issues when using the 5.5 1331820 before we upgraded to 1623387

hphan082

hi CMB,
i followed that document, but it doesn't work.

Rick, we are running 5.5 1892794. :) I'll talk to our VMWare team to see if they have newer version to upgrade for these hosts.

KOM

Current build for 5.5 is 1993072 I believe.

cmb

@hphan082:

hi CMB,
i followed that document, but it doesn't work.

That most definitely fixes the problem you're seeing. It has to be set on every host that has a promiscuous port group so none of them loop multicast. I've done that on many, many ESX hosts from a variety of versions and it's always immediately worked with one odd exception - one ESX host in particular just wouldn't obey that config setting until rebooting ESX. Most of the time though, when that doesn't work it's because it wasn't set on all the hosts and some host is still looping the multicast.

The other possibility is there is something else on your network that's looping multicast traffic, but that's unlikely.

hphan082

hi cmb,
I seriously tried everything I can to get this to work.
I will be away for a 10-day bootcamp. I'll ask our virtualization manager to reboot both hosts while I am gone and will try again when I'm back.

hphan082

hi cmb,
we did the entire thing one more time, reboot both hosts. I still get nothing. I attached a few screenshot here for your review, including Host_Advance_Settings, the dvS port-group setting, and also the firewall Log

daas_ext.JPG_thumb

daas_int.JPG_thumb

daas_pfsense_sync.JPG_thumb
![Host Advanced Settings.JPG](/public/imported_attachments/1/Host Advanced Settings.JPG)
![Host Advanced Settings.JPG_thumb](/public/imported_attachments/1/Host Advanced Settings.JPG_thumb)
![firewall log.PNG](/public/imported_attachments/1/firewall log.PNG)
![firewall log.PNG_thumb](/public/imported_attachments/1/firewall log.PNG_thumb)

cmb

That all looks correct. You can verify the looping multicast with a packet capture. Via SSH command prompt:

tcpdump -nei em0 vrrp

The system will send 1 per second. You'll see similar to the following.

22:00:59.909437 00:00:5e:00:01:0a > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 70: 10.0.0.2 > 224.0.0.18: VRRPv2, Advertisement, vrid 10, prio 0, authtype none, intvl 1s, length 36
22:01:00.910396 00:00:5e:00:01:0a > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 70: 10.0.0.2 > 224.0.0.18: VRRPv2, Advertisement, vrid 10, prio 0, authtype none, intvl 1s, length 36

You want to see only one per CARP IP on that interface per second. You'll see duplicates there more than likely.

hphan082

hi cmb,
I ran tcpdump on both firewall, and below are the screenshot of what I see. Every second, I see 2 message from 198.51.168.252 (VRRP) to 224.0.0.18, look like I get 2 packets every second.

So this should confirm that we are hitting a bug with VMWare again?

Capture.PNG_thumb

cmb

Yes, look at the timestamp on the colored lines there, that's the same packet only 0.0001 seconds later. Something is looping that system's multicast traffic back to it. VMware is the most likely candidate because it generally doesn't happen on physical switches, but it's possible that you have the ESX hosts configured fine and some other device looping the traffic. That does 100% confirm the issue is looping multicast at least.

hphan082

Thanks CMB. I will work with the VMWare team to look into this.