Adding CARP VIPs causes Pair to start Crashing
-
I currently manage 4 sets of pfSense firewalls in HA, and a number of standalone firewalls. I've been trying to get a Pair setup at 1 site for a few months now and I just can't get it working right.
Basically, as soon as I start adding CARP VIPs to the pair, it becomes unstable and starts crashing.
For example, about an hour ago I added a single CARP VIP to vlan229 on lagg1, which was replicated without issues to the secondary. To test stability I disabled CARP on the primary fw to cause it to failover, which it did successfully, then I enabled CARP on the primary and the VIPs failed back no problem. I then rebooted the secondary fw and it never came back up. After ten minutes I remotely power-cycled the server, and 5 minutes later the secondary fw was up and running. Successfully rebooted the primary, successfully rebooted the secondary. Seems stable.. Then I added another CARP VIP to vlan3 on lagg0 and as soon as the VIP replicated to the secondary fw it crashed. Approximately five minutes later when the secondary finished booting, the primary crashed, as soon as the primary finished booting the secondary crashed.
This is technically the 3rd pair of physical servers I've tried to set this up on, one of them actually had a bad flakey NIC which caused quite a few problems while trying to implement this config. So I think I've ruled out hardware issues on the FWs. pfSense just doesn't like something I'm trying to do. I've submitted a crap-ton of crash reports, I can provide the hostnames of these fws to the devs if they PM me.
A high-level overview of the config is as follows;
2x Dell R410's with a 6-Core Intel CPU, 16GB RAM, 500Gb HDD.
2x Onboard Broadcom NICs, 2x Intel Quad-Port Adapters
pfSense v2.1.4 w/ CARP+IPAlias patch applied/boot/loader.conf.local
kern.ipc.nmbclusters="131072"
hw.bce.tso_enable="0"
hw.pci.enable_msix="0"igb7 = WAN
igb6 = lagg1
igb5 = lagg1
igb4 = Network1
igb3 = lagg0
igb2 = lagg0
igb1 = lagg0
igb0 = lagg0
bce0 = Network2
bce1 = pfSynclagg0 is LACP, and has 10 vlans plus a network directly assigned to lagg0 (untagged traffic)
lagg0_untagged >–--------< There are a few Windows NLBs on this subnet
lagg0_vlan3
lagg0_vlan12
lagg0_vlan16 >----------< Another FW Pair using CARP VHID 34
lagg0_vlan22
lagg0_vlan32
lagg0_vlan185
lagg0_vlan186
lagg0_vlan228
lagg0_vlan230 >----------< There are a few Windows NLBs, and a Pair of Barracudas on this subnet
lagg0_vlan320lagg1 is LACP, and has 7 vlans plus a network directly assigned to lagg1 (untagged traffic)
lagg1_untagged
lagg1_vlan4
lagg1_vlan5
lagg1_vlan6
lagg1_vlan14
lagg1_vlan229
lagg1_vlan261
lagg1_vlan262lagg0 is connected to a pair of Netgear GS728TS Switches (v1h1 B5.2.0.2 V5.3.0.17)
fw1 is connected to sw1 g1/g2/g3/g4
fw2 is connected to sw2 g1/g2/g3/g4lagg1 is connected to a pair of Netgear GS752TS Switches (H00.00.01 B1.0.2.0 V5.1.0.2)
fw1 is connected to sw1/g1 and sw2/g1
fw2 is connected to sw1/g2 and sw2/g2... This pair is supposed to replace an aging pfSense fw at a Data Center (I inherited this stuff), the old FW is starting to become unstable, it is setup very similar to the above pair except that it does not have an HA interface, there is only a single port in lagg1, and all of the VIPs are ProxyARPs.
Questions and Suggestions are welcome, I would really like to get this pair up and running before my old FW finally craps out on me.
-ct
-
At the same time these two firewalls are up and down as a result of them crashing .. I started getting reports that folks couldn't access a website that uses a Windows NLB and resides on vlan230. There were three separate incidents where I happened to have these firewalls up and running with active CARPs and this website became inaccessible.
I don't understand it, because I added a CARP VIP to lagg0_vlan3, and lagg1_vlan229. But I definitely think that the two bouncing firewalls caused the issue. During the last incident, I immediately powered off the two firewalls, and the issue went away.
The resource(s) sitting behind the Barracuda NLBs on the same vlan, do not appear to have been affected.
-ct