Adding CARP VIPs causes Pair to start Crashing



  • I currently manage 4 sets of pfSense firewalls in HA, and a number of standalone firewalls.  I've been trying to get a Pair setup at 1 site for a few months now and I just can't get it working right.

    Basically, as soon as I start adding CARP VIPs to the pair, it becomes unstable and starts crashing.

    For example, about an hour ago I added a single CARP VIP to vlan229 on lagg1, which was replicated without issues to the secondary.  To test stability I disabled CARP on the primary fw to cause it to failover, which it did successfully, then I enabled CARP on the primary and the VIPs failed back no problem.  I then rebooted the secondary fw and it never came back up.  After ten minutes I remotely power-cycled the server, and 5 minutes later the secondary fw was up and running.  Successfully rebooted the primary, successfully rebooted the secondary.  Seems stable..  Then I added another CARP VIP to vlan3 on lagg0 and as soon as the VIP replicated to the secondary fw it crashed.  Approximately five minutes later when the secondary finished booting, the primary crashed, as soon as the primary finished booting the secondary crashed.

    This is technically the 3rd pair of physical servers I've tried to set this up on, one of them actually had a bad flakey NIC which caused quite a few problems while trying to implement this config.  So I think I've ruled out hardware issues on the FWs.  pfSense just doesn't like something I'm trying to do.  I've submitted a crap-ton of crash reports, I can provide the hostnames of these fws to the devs if they PM me.

    A high-level overview of the config is as follows;

    2x Dell R410's with a 6-Core Intel CPU, 16GB RAM, 500Gb HDD.
    2x Onboard Broadcom NICs, 2x Intel Quad-Port Adapters
    pfSense v2.1.4 w/ CARP+IPAlias patch applied

    /boot/loader.conf.local
      kern.ipc.nmbclusters="131072"
      hw.bce.tso_enable="0"
      hw.pci.enable_msix="0"

    igb7 = WAN
    igb6 = lagg1
    igb5 = lagg1
    igb4 = Network1
    igb3 = lagg0
    igb2 = lagg0
    igb1 = lagg0
    igb0 = lagg0
    bce0 = Network2
    bce1 = pfSync

    lagg0 is LACP, and has 10 vlans plus a network directly assigned to lagg0 (untagged traffic)
        lagg0_untagged >–--------< There are a few Windows NLBs on this subnet
        lagg0_vlan3
        lagg0_vlan12
        lagg0_vlan16 >----------< Another FW Pair using CARP VHID 34
        lagg0_vlan22
        lagg0_vlan32
        lagg0_vlan185
        lagg0_vlan186
        lagg0_vlan228
        lagg0_vlan230 >----------< There are a few Windows NLBs, and a Pair of Barracudas on this subnet
        lagg0_vlan320

    lagg1 is LACP, and has 7 vlans plus a network directly assigned to lagg1 (untagged traffic)
        lagg1_untagged
        lagg1_vlan4
        lagg1_vlan5
        lagg1_vlan6
        lagg1_vlan14
        lagg1_vlan229
        lagg1_vlan261
        lagg1_vlan262

    lagg0 is connected to a pair of Netgear GS728TS Switches (v1h1 B5.2.0.2 V5.3.0.17)
      fw1 is connected to sw1 g1/g2/g3/g4
      fw2 is connected to sw2 g1/g2/g3/g4

    lagg1 is connected to a pair of Netgear GS752TS Switches (H00.00.01 B1.0.2.0 V5.1.0.2)
      fw1 is connected to sw1/g1 and sw2/g1
      fw2 is connected to sw1/g2 and sw2/g2

    ... This pair is supposed to replace an aging pfSense fw at a Data Center (I inherited this stuff), the old FW is starting to become unstable, it is setup very similar to the above pair except that it does not have an HA interface, there is only a single port in lagg1, and all of the VIPs are ProxyARPs.

    Questions and Suggestions are welcome, I would really like to get this pair up and running before my old FW finally craps out on me.

    -ct



  • At the same time these two firewalls are up and down as a result of them crashing .. I started getting reports that folks couldn't access a website that uses a Windows NLB and resides on vlan230.  There were three separate incidents where I happened to have these firewalls up and running with active CARPs and this website became inaccessible.

    I don't understand it, because I added a CARP VIP to lagg0_vlan3, and lagg1_vlan229.  But I definitely think that the two bouncing firewalls caused the issue.  During the last incident, I immediately powered off the two firewalls, and the issue went away.

    The resource(s) sitting behind the Barracuda NLBs on the same vlan, do not appear to have been affected.

    -ct