CARP sporadically flopping to BACKUP and then back to MASTER


  • I've been running pfSense with CARP on 11 interfaces (three physical and a bunch of VLANs) for a few months now with no issues at all. Starting today I'm getting these randomly throughout the day. Always on the same VLAN:

    
    Mar 18 14:18:05	check_reload_status: Carp backup event
    Mar 18 14:18:05	kernel: carp: VHID 3@bce0_vlan6: MASTER -> BACKUP (more frequent advertisement received)
    Mar 18 14:18:06	php-fpm[7974]: /rc.carpbackup: Carp cluster member "10.6.0.6 - ADM CARP (3@bce0_vlan6)" has resumed the state "BACKUP" for vhid 3@bce0_vlan6
    Mar 18 14:18:08	check_reload_status: Carp master event
    Mar 18 14:18:08	kernel: carp: VHID 3@bce0_vlan6: BACKUP -> MASTER (master down)
    Mar 18 14:18:09	php-fpm[7974]: /rc.carpmaster: Carp cluster member "10.6.0.6 - ADM CARP (3@bce0_vlan6)" has resumed the state "MASTER" for vhid 3@bce0_vlan6
    
    

    I'm guessing there's some noise on that vlan, but I can't catch it with Wireshark. Can I tune something to make the failover threshold a bit more lenient?


  • This is still an issue on 2.2.4. I have tried adjusting skew on the backup node, but that didn't fix anything. Any pointers?

  • Rebel Alliance Developer Netgate

    Usually it's because something on the switch is preventing the heartbeats from arriving in a timely manner.

    Try increasing the advbase, rather than the skew. The skew only adjusts in 1/256th of a second increment, base adds whole seconds.


  • @jimp:

    Try increasing the advbase, rather than the skew. The skew only adjusts in 1/256th of a second increment, base adds whole seconds.

    That seems to have fixed the problem. Thank you!


  • Excellent suggestion jimp.. thank you! I raised the base on the systems I setup to 2 yesterday, but it still was having issues this morning (seemed to last longer), I changed base to 10 when I got in… so far so good. Any general rule of thumb to go off on how far to raise the base? Thank you in advance!


  • A better question might be why is there so much contention on your network that the heartbeats are delayed to the point where failover is triggered?


  • There shouldn't be… it's a direct cable. Just had it happen again, about 90% of VIPs failed over.


  • You might want to capture the traffic on that link and see what's going on.


  • I too initially had that same issue of VIP flip-flopping.  I chalked it up to the lack of decent timer resolution when running my pfSense instances on ESXi.
    What worked for me in the end:
    MASTER: BASE=1
    BACKUP: BASE=10

    Be sure to validate you don't have any duplicate VHID or switch HSRP/VRRP using the same ID # on any of your failover interfaces, that they cannot ever see each other's traffic, and that IPv4 and IPv6 each use a different VHID on the same interface.


  • I forgot one critical detail.

    You must uncheck Synchronize Vitual IPs in the System -> High Avail. Sync, otherwise the MASTER will keep overwriting the ADVBASE value.
    This also means you must manually configure the VIP address on each box, initially check the Synchronize Vitual IPs when you do the setup, then uncheck it to go into production, and never check it again.

    I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.


  • @ljorgensen:

    @jimp:

    Try increasing the advbase, rather than the skew. The skew only adjusts in 1/256th of a second increment, base adds whole seconds.

    That seems to have fixed the problem. Thank you!

    That was premature, unfortunately. The problem persisted, only less often due to the fewer advertisements being sent. I hooked up a wireshark probe on the network today and was able to have a capture running when it happened.

    I see CARP packets once a second when things are working normally. When the Master reverts to backup, I see the same CARP packet repeated thousands of times (actually more than 20,000 times in one second in the capture).

    I'm not sure this is related to pfsense at all. Leaning more toward something in the network grossly misbehaving. Problem is, the error is only in one VLAN out of seven VLANs on the same interface. The other six behaves just fine.

    I include the wireshark dump in this post in the hope that someone will devote a few minutes to look at it and tell me where my next point of attention should be. The master->backup switchover happens 348 seconds in (and is pretty obvious!).

    Lars

    CARP_VLAN6_multicast_storm.pcapng.gz


  • @ljorgensen:


    I see CARP packets once a second when things are working normally. When the Master reverts to backup, I see the same CARP packet repeated thousands of times (actually more than 20,000 times in one second in the capture).

    I'm not sure this is related to pfsense at all. Leaning more toward something in the network grossly misbehaving. Problem is, the error is only in one VLAN out of seven VLANs on the same interface. The other six behaves just fine.
    ...

    Looks like there might be a loop in the network, the same packet is seen repeating over and over again as of packet 348, at a rate of around 20,000 pps.  That should be setting off some alarms if you have broadcast / multicast storm control setup on your switches.


  • @awebster:

    Looks like there might be a loop in the network, the same packet is seen repeating over and over again as of packet 348, at a rate of around 20,000 pps.  That should be setting off some alarms if you have broadcast / multicast storm control setup on your switches.

    I don't have storm control setup on the switches and it's not enough traffic to disrupt anything, so I won't bother. I'm guessing something in the network is making those packets loop around for a few seconds.

    I've tried adjusting the Advertising Frequence Base to 30 seconds and that seems to have solved the problem. Haven't seen anything for a few days now, and it used to happen once every 10 to 15 minutes. It means I'll have a slower failover in the event of a network outage on the master, but that's not a problem compared to the previous situation where the master spontaneously became backup numerous times throughout a day.

    Lars


  • @awebster:

    I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.

    No, you need those to match.