Primary node is not willing to become master. Keeps falling back to backup state



  • I have 2 PFSense 2.2.6 boxes with IP's on the LAN interface and a VIP defined with type CARP.
    The backup node (10.0.0.3) is working as expected… it works as backup.
    The primary node (10.0.0.2) however is not willing to become master of the VIP (10.0.0.1).

    In the system.log I find the following messages :

    Feb  5 15:49:12 pfsense-box1 kernel: carp: VHID 3@vmx1: BACKUP -> MASTER (master down)
    Feb  5 15:49:12 pfsense-box1 kernel: carp: VHID 3@vmx1: MASTER -> BACKUP (more frequent advertisement received)
    Feb  5 15:49:12 pfsense-box1 check_reload_status: Carp master event
    Feb  5 15:49:12 pfsense-box1 check_reload_status: Carp backup event
    Feb  5 15:49:13 pfsense-box1 php-fpm[81635]: /rc.carpmaster: Carp cluster member "10.0.0.1 -  (3@vmx1)" has resumed the state "MASTER" for vhid 3@vmx1
    Feb  5 15:49:13 pfsense-box1 php-fpm[81635]: /rc.carpbackup: Carp cluster member "10.0.0.1 -  (3@vmx1)" has resumed the state "BACKUP" for vhid 3@vmx1
    

    From what I understand from this logging, the node finds out the master is down and promotes itself to Master.
    But, in the the same second it gets a "more frequent advertisement received", which is causing the server to return back to the backup state.

    I did isolate both nodes already to have both LAN interfaces of the pfsense boxes as the only remaining members in a lan.
    Also, both boxes are running as virtual machines on a VMWare server and the options "Promiscuous Mode", "MAC Address Changes" and "Forged Transmits" are set to "Accept"
    When doing a tcpdump, I can see the multicast message from the primary node arrive at the secondary node.
    The issue however might give some clues in this trace and I am hoping someone will have an "aha"-moment on this…
    What I see in the trace :

    | Source | Dest | Proto | Info |
    | 10.0.0.2 | 224.0.0.18 | VRRP | Announcement (Current master has stopped participating in VRRP) |
    | Vmware_mac_prim | Broadcast | ARP | Gratuitous ARP for 10.0.0.1 (Request) |
    | VMware_mac_back | IETF-VRRP-VRID_03 | ARP | Gratuitous ARP for 10.0.0.1 (Reply) (duplicate use of 10.0.0.1 detected!) |

    As you can see, there is a VRRP announcement from the primary box (10.0.0.2)
    Then, the primary box sends an ARP request for 10.0.0.1, which is answered by the backup node with an ARP reply telling a duplicate use of the VIP 10.0.0.1 is detected. Maybe this is the root-cause of this issue… but the question is "why".

    So... the main question is.... how do I get the primary to become master of the VIP.



  • Couple of questions.
    Is the VM host using load balancing over multiple nics? If so make sure it is set to IP Hash with the switch configured accordingly.
    Have you created separate port groups on the virtual switch with  promiscuous mode only enabled on the group that carries the VRRP? Port groups are probably the way forward.

    Carp, VRRP etc are notoriously idiosyncratic on VMWare


Log in to reply