Question about CARP VIP skew bug



  • http://redmine.pfsense.org/issues/2012

    I filed this 6 months ago, and haven't heard anything about it. That's understandable since it affects almost no one (you have to have a CARP cluster with 4+ members).

    I would like to try to find a fix for it and then submit a patch, but I've tried really hard to troll through the files and figure out where this value gets set and I haven't been able to find it.

    I'm also confused about when this value gets set (besides on first creation). I've had it reset after an upgrade, and I've had it reset on other occasions seemingly at random (it doesn't reset on reboot, at least not in my testing). Can anyone help with this?


  • Rebel Alliance Developer Netgate

    It's set by the system during config sync.

    If you want that many systems you can always drop the automatic VIP sync and manage the VIPs manually with custom skews.



  • @jimp:

    It's set by the system during config sync.

    If you want that many systems you can always drop the automatic VIP sync and manage the VIPs manually with custom skews.

    Doing it manually, how would I go about doing that?

    I did find an area in the code where it seemed to be set during config sync, but the really confusing thing for me is that I didn't see where it was adding 100, and that I make configuration changes all the time but the skew doesn't get reset (that may be because it isn't changing? I don't remember now). If you could point out where exactly I should be looking and how I could change it, it would be a huge help to me.

    To me it seems like a really easy fix to just add a value significantly lower than 100 (maybe 10?) and that would allow for a much higher number of members. Am I missing some ill effect that could cause?

    Even if no one else is interested and it never makes it into pfSense proper, I can write a quick patch/package that stays on my local repo and that would be great for my environment.


  • Rebel Alliance Developer Netgate

    In /etc/rc.filter_synchronize:

    
    /*
     *  backup_vip_config_section($section): returns as an xml file string of
     *                                   the configuration section
     */
    function backup_vip_config_section() {
            global $config;
    
            if (!is_array($config['virtualip']['vip']))
                    return;
            $temp = array();
            $temp['vip'] = array();
            foreach($config['virtualip']['vip'] as $section) {
                    if(($section['mode'] == "proxyarp" || $section['mode'] == "ipalias") && !strstr($section['interface'], "_vip"))
                            continue;
                    if($section['advskew'] <> "") {
                            $section_val = intval($section['advskew']);
                            $section_val=$section_val+100;
                            if($section_val > 255)
                                    $section_val = 255;
                            $section['advskew'] = $section_val;
                    }
                    if($section['advbase'] <> "") {
                            $section_val = intval($section['advbase']);
                            if($section_val > 255)
                                    $section_val = 255;
                            $section['advbase'] = $section_val;
                    }
                    $temp['vip'][] = $section;
            }
            return $temp;
    }
    
    

    Specifically:

    $section_val=$section_val+100;
    

    IIRC in the CARP spec, the default value for a backup unit is 100.

    Perhaps a simple test there, if >= 100, then only add 10 or 20.

    The problem you might run into is timing. The skew is an actual clock skew on the advertisements. Too close together and the VIPs could flap up/down from normal network latency.

    As for the automated sync setting, just go to Firewall > Virtual IPs on the CARP settings tab and untick the box to sync VIPs. then you can manage them independently.



  • Thanks Jim! That's really good information. I didn't realize that it was an actual clock skew; that's pretty important. I imagine it's in milliseconds?

    Also, I see the reason why I have the "init" issue: when section_val > 255 it sets it to 255 but the maximum value is actually 254 (or at least that's what it is in the drop down in the GUI). So at the very least I can fix that and it will work for exactly 4 members.



  • Okay I looked around a bit, and this is the best explanation I could find about the exact timing used by CARP:
    http://kerneltrap.org/node/5607

    With this in mind, I figure I could use skew increments of 35. This would skew the advertisements by about 130ms each and allow for 8 total members. In the coming weeks I'll be doing some testing with this.



  • What is the target of this?



  • The target, as in the version? I am working with 2.0.1 right now. I don't have the time to test out 2.1 snapshots; I'll probably mess with it once it hits RC (especially to get acquainted with the new package system). Hope this answers your question ermal; if not could you elaborate?



  • The question is why you need 8 failover machines?



  • @ermal:

    The question is why you need 8 failover machines?

    I see, I responded to this question when you posted it on the redmine bug entry. We have a VMware environment with a single vCenter that spans two datacenters that we have on site (upper and lower). I have two cluster members pinned to the upper and two to the lower, and in each site I have affinity rules that require each of the two to be on separate hosts. We want the quick failover of CARP and we want it to cover a variety of situations without having to wait for vSphere's HA to spin up the VM on another host in the event of failure.

    So right now I am only using 4. I may add more later (at least one more) since we will soon have two separate VMware environments. I may add some more on a host's local storage to keep us working in the event of some wider SAN issue.

    We aren't using these as firewalls or for routing; we're using HAProxy on them to load balance web/TCP applications and Microsoft Exchange (probably the most user sensitive application in our environment).



  • Well this needs a lot of testing and probably thought a bit more than this.
    There are some complications and probably some development needed to allow this multilayer failover.

    Can you detail in schematic your arch so i understand it better.



  • Essentially from the pfSense side, none of the VMware stuff matters. It's just standard CARP; nothing special there. It works fine now with 3 members, and the 4th works too as long as I manually set its skew to something under 255. As Jim pointed out, I could simply not sync them and then manage the skews myself.

    I'm thinking you misunderstood what I meant because I don't believe that there is any multilayer (?) failover here, unless you mean having more than 2 members, but again that's something that CARP seems like it was designed to do.



  • For my purposes, I will manually set my skew values by not syncing VIPs automatically, however the 255 value is a legitimate bug, and I've updated the bug report and made a pull request in github with the fix:
    https://github.com/bsdperimeter/pfsense/pull/127
    http://redmine.pfsense.org/issues/2012


Locked