Random multiple master



  • Hello Board,

    We have the problem that for some reason in our carp setup the backup system suddenly thinks it has to become master. This problem presents itself at random. We can't pinpoint what to do to reproduce.

    The systemlog on the master presents the following :
    Mar 13 17:36:56 pfSense2 kernel: opt3_vip7: link state changed to UP
    Mar 13 17:36:56 pfSense2 kernel: opt3_vip6: link state changed to UP
    Mar 13 17:36:56 pfSense2 kernel: lan_vip3: MASTER -> BACKUP (more frequent advertisement received)
    Mar 13 17:36:57 pfSense2 kernel: opt3_vip9: MASTER -> BACKUP (more frequent advertisement received)
    Mar 13 17:36:57 pfSense2 kernel: opt3_vip7: MASTER -> BACKUP (more frequent advertisement received)
    Mar 13 17:36:57 pfSense2 kernel: opt3_vip6: MASTER -> BACKUP (more frequent advertisement received)
    Mar 13 17:36:59 pfSense2 kernel: lan_vip3: link state changed to DOWN
    Mar 13 17:36:59 pfSense2 kernel: opt3_vip9: link state changed to DOWN
    Mar 13 17:36:59 pfSense2 kernel: opt3_vip7: link state changed to DOWN
    Mar 13 17:36:59 pfSense2 kernel: opt3_vip6: link state changed to DOWN

    We also notice the console spamming information about interupt storm detected.

    Searching the forum gives multiple suggestions on resolving this, but until now nothing helped.

    • We tried setting the sysctl with some tuned values :

    net.inet.carp.preempt=1
    net.inet.carp.allow=1
    net.inet.carp.log=1
    net.inet.carp.drop_echoed=1
    hw.intr_storm_threshold=10000

    • When checking what device is causing the storm (vmstat -i) we see it's a certain internet connection. We moved this connection from the onboard to our more powerfull quad-port intel nic. This did not help.

    our setup is as followed :

    pf1
    lan 172.17.7.253/16
    carp 10.155.0.1/24
    wan1
    wan2
    wan3
    dmz

    pf2
    lan 172.17.7.252/16
    carp 10.155.0.2/24
    wan1
    wan2
    wan3
    dmz

    Both sharing 12 virtual ips from the type carp. The carp is a physical dedicated nicport.

    When the systems are up and running, i can reboot the master, and the backup takes over, and when the master comes back it will take back the functions and become master again, and the second tells me it's backup again in the carp status.

    The setup is followed from this link : http://www.howtoforge.com/how-to-configure-a-pfsense-2.0-cluster-using-carp

    Does someone have a pointer in where to troubleshoot this ?

    Thank you in advance. If there is more information needed ( more ip information ) i'm glad to provide ..

    Regards,

    Mayk



  • I had a similar issue here. The wan vips on the backup pfSense also bekame master, while the primary was master.

    For me it help to shut off flow control at wan interface by adding "hw.<if_name>.fc_setting=0".
    Maybe give it a try.

    Regards,

    Richard</if_name>


  • Rebel Alliance Developer Netgate

    The problem isn't with pfSense, but at layer 2.

    If anything interferes with the multicast heartbeats, you'll get multiple masters if the secondary can't see the heartbeats from the master.

    So either the two are not visible in the same subnet, something (not pfSense) is blocking their CARP heartbeats, or there could be a VHID conflict with some other CARP/VRRP/HSRP device on the same layer 2.



  • @jimp:

    The problem isn't with pfSense, but at layer 2.

    If anything interferes with the multicast heartbeats, you'll get multiple masters if the secondary can't see the heartbeats from the master.

    So either the two are not visible in the same subnet, something (not pfSense) is blocking their CARP heartbeats, or there could be a VHID conflict with some other CARP/VRRP/HSRP device on the same layer 2.

    Hello.. Thank you for the answer. My guest it is something interfering has allways been there but i can't pinpoint it.  Regarding the carp , the interfaces are direct attached with a crosscable.  This would bypass the theory of interference from a switch.
    The thing that is bothering me is that i also see it is logging more frequent advertisement received on the lan.. and other interfaces..  I will go and double check the vhid settings.

    thank you again for the reply. Are there other things to check, or data to be provided to analyse this further ?

    @viragoman , thanks for the tip.. i will check this out too.

    Regards,
    Mayk


  • Rebel Alliance Developer Netgate

    @mayk:

    Hello.. Thank you for the answer. My guest it is something interfering has allways been there but i can't pinpoint it.  Regarding the carp , the interfaces are direct attached with a crosscable.  This would bypass the theory of interference from a switch.

    That's incorrect. CARP Heartbeats happen on every interface with a CARP VIP. They do not happen on the sync/crossover cable.



  • thank you for clearing that up.. That is verry verry usefull information.  All the  connections are on seperate vlans.  i think it is wise to check them too.



  • A quick question..  When the Heartbeat  happens on all interfaces with a vip, is there a way to monitor this ? logs ? For example, am i correct to assume that if a Heartbeat does not arrive on time on the slave  he wil then assume that there is a failure and become master for that vip ?  Is there an option, and is it wise , to adjust timeout settings for the hb ?


  • Rebel Alliance Developer Netgate

    The way to monitor it: If the heartbeats stop being seen by the slave, it takes over as master. It's logged in the system log.

    If you want to decrease the sensitivity, increase the advbase on the VIPs. A higher base means that it will be less sensitive to a problem but it also takes longer to detect an outage.


Log in to reply