CARP on 2.2.1, VMWare 5.5 with dvS



  • hi everyone,
    I'm running into this old issue where both firewall are stucked in Backup status.
    I followed https://doc.pfsense.org/index.php/CARP_Configuration_Troubleshooting#VMware_ESX.2FESXi_Users, and here: https://forum.pfsense.org/index.php?topic=64022.0

    We use dvS, and we already enable the advance setting. I have promiscuous enabled on LAN and WAN port-group, but not the sync port-group.

    My firewalls both still stuck in Backup.

    Below is a short capture of my log in the master.
    Mar 26 17:17:42 php-fpm[58491]: /rc.carpbackup: Carp cluster member "10.22.2.254 - daas.management.vip (2@em0_vlan2202)" has resumed the state "BACKUP" for vhid 2@em0_vlan2202
    Mar 26 17:17:42 php-fpm[57375]: /rc.carpbackup: Carp cluster member "10.22.5.254 - daas.dmz.vip (3@em0_vlan2205)" has resumed the state "BACKUP" for vhid 3@em0_vlan2205
    Mar 26 17:17:44 check_reload_status: Carp master event
    Mar 26 17:17:44 kernel: carp: VHID 1@em1: BACKUP -> MASTER (master down)
    Mar 26 17:17:44 kernel: carp: VHID 1@em1: MASTER -> BACKUP (more frequent advertisement received)
    Mar 26 17:17:44 check_reload_status: Carp backup event
    Mar 26 17:17:44 check_reload_status: Carp master event
    Mar 26 17:17:44 kernel: carp: VHID 3@em0_vlan2205: BACKUP -> MASTER (master down)
    Mar 26 17:17:44 kernel: carp: VHID 2@em0_vlan2202: BACKUP -> MASTER (master down)
    Mar 26 17:17:44 kernel: carp: VHID 3@em0_vlan2205: MASTER -> BACKUP (more frequent advertisement received)
    Mar 26 17:17:44 kernel: carp: VHID 2@em0_vlan2202: MASTER -> BACKUP (more frequent advertisement received)
    Mar 26 17:17:44 check_reload_status: Carp master event
    Mar 26 17:17:44 check_reload_status: Carp backup event
    Mar 26 17:17:44 check_reload_status: Carp backup event
    Mar 26 17:17:45 php-fpm[58491]: /rc.carpmaster: Carp cluster member "198.51.168.254 - daas.pub.vip (1@em1)" has resumed the state "MASTER" for vhid 1@em1
    Mar 26 17:17:45 php-fpm[58491]: /rc.carpbackup: Carp cluster member "198.51.168.254 - daas.pub.vip (1@em1)" has resumed the state "BACKUP" for vhid 1@em1
    Mar 26 17:17:45 php-fpm[58491]: /rc.carpmaster: Carp cluster member "10.22.5.254 - daas.dmz.vip (3@em0_vlan2205)" has resumed the state "MASTER" for vhid 3@em0_vlan2205
    Mar 26 17:17:45 php-fpm[58491]: /rc.carpmaster: Carp cluster member "10.22.2.254 - daas.management.vip (2@em0_vlan2202)" has resumed the state "MASTER" for vhid 2@em0_vlan2202
    Mar 26 17:17:45 php-fpm[58491]: /rc.carpbackup: Carp cluster member "10.22.5.254 - daas.dmz.vip (3@em0_vlan2205)" has resumed the state "BACKUP" for vhid 3@em0_vlan2205
    Mar 26 17:17:45 php-fpm[58491]: /rc.carpbackup: Carp cluster member "10.22.2.254 - daas.management.vip (2@em0_vlan2202)" has resumed the state "BACKUP" for vhid 2@em0_vlan2202
    Mar 26 17:17:47 check_reload_status: Carp master event
    Mar 26 17:17:47 kernel: carp: VHID 1@em1: BACKUP -> MASTER (master down)
    Mar 26 17:17:47 kernel: carp: VHID 1@em1: MASTER -> BACKUP (more frequent advertisement received)
    Mar 26 17:17:47 check_reload_status: Carp backup event
    Mar 26 17:17:47 check_reload_status: Carp master event
    Mar 26 17:17:47 kernel: carp: VHID 2@em0_vlan2202: BACKUP -> MASTER (master down)
    Mar 26 17:17:47 kernel: carp: VHID 3@em0_vlan2205: BACKUP -> MASTER (master down)
    Mar 26 17:17:47 kernel: carp: VHID 2@em0_vlan2202: MASTER -> BACKUP (more frequent advertisement received)
    Mar 26 17:17:47 kernel: carp: VHID 3@em0_vlan2205: MASTER -> BACKUP (more frequent advertisement received)
    Mar 26 17:17:47 check_reload_status: Carp master event
    Mar 26 17:17:47 check_reload_status: Carp backup event
    Mar 26 17:17:47 check_reload_status: Carp backup event





  • Also might check the version of exi 5.5 your are on. Had some other issues when using the 5.5 1331820 before we upgraded to 1623387



  • hi CMB,
    i followed that document, but it doesn't work.

    Rick, we are running 5.5 1892794. :) I'll talk to our VMWare team to see if they have newer version to upgrade for these hosts.



  • Current build for 5.5 is 1993072 I believe.



  • @hphan082:

    hi CMB,
    i followed that document, but it doesn't work.

    That most definitely fixes the problem you're seeing. It has to be set on every host that has a promiscuous port group so none of them loop multicast. I've done that on many, many ESX hosts from a variety of versions and it's always immediately worked with one odd exception - one ESX host in particular just wouldn't obey that config setting until rebooting ESX. Most of the time though, when that doesn't work it's because it wasn't set on all the hosts and some host is still looping the multicast.

    The other possibility is there is something else on your network that's looping multicast traffic, but that's unlikely.



  • hi cmb,
    I seriously tried everything I can to get this to work.
    I will be away for a 10-day bootcamp. I'll ask our virtualization manager to reboot both hosts while I am gone and will try again when I'm back.



  • hi cmb,
    we did the entire thing one more time, reboot both hosts. I still get nothing. I attached a few screenshot here for your review, including Host_Advance_Settings, the dvS port-group setting, and also the firewall Log







    ![Host Advanced Settings.JPG](/public/imported_attachments/1/Host Advanced Settings.JPG)
    ![Host Advanced Settings.JPG_thumb](/public/imported_attachments/1/Host Advanced Settings.JPG_thumb)
    ![firewall log.PNG](/public/imported_attachments/1/firewall log.PNG)
    ![firewall log.PNG_thumb](/public/imported_attachments/1/firewall log.PNG_thumb)



  • That all looks correct. You can verify the looping multicast with a packet capture. Via SSH command prompt:

    tcpdump -nei em0 vrrp
    

    The system will send 1 per second. You'll see similar to the following.

    22:00:59.909437 00:00:5e:00:01:0a > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 70: 10.0.0.2 > 224.0.0.18: VRRPv2, Advertisement, vrid 10, prio 0, authtype none, intvl 1s, length 36
    22:01:00.910396 00:00:5e:00:01:0a > 01:00:5e:00:00:12, ethertype IPv4 (0x0800), length 70: 10.0.0.2 > 224.0.0.18: VRRPv2, Advertisement, vrid 10, prio 0, authtype none, intvl 1s, length 36
    
    

    You want to see only one per CARP IP on that interface per second. You'll see duplicates there more than likely.



  • hi cmb,
    I ran tcpdump on both firewall, and below are the screenshot of what I see. Every second, I see 2 message from 198.51.168.252 (VRRP) to 224.0.0.18, look like I get 2 packets every second.

    So this should confirm that we are hitting a bug with VMWare again?




  • Yes, look at the timestamp on the colored lines there, that's the same packet only 0.0001 seconds later. Something is looping that system's multicast traffic back to it. VMware is the most likely candidate because it generally doesn't happen on physical switches, but it's possible that you have the ESX hosts configured fine and some other device looping the traffic. That does 100% confirm the issue is looping multicast at least.



  • Thanks CMB. I will work with the VMWare team to look into this.


Log in to reply