Many CARPs on many VLANs
I have a pair of identical pfSense 2.5 firewall/routers, setup with pfsync and CARP on WAN and LAN. It works fine.
But in my final configuration the firewall's LAN interface is plugged to a trunk port with 250 configured VLANs. Since a CARP vhid is strictly limited to an L2 segment, I have set up one CARP per VLAN in order for the firewall pair to be fully redundant. I could not figure another way (IP aliases won't cut it since we need to cross subnet boundaries).
First thing : it's a bit tedious, but with jaredhendrickson13/pfsense-api I could (more or less reliably) automate the creation of the 250 VLANs + CARP.
Second thing : I now have 250 CARP heartbeat/s on my LAN interface. That's a bonded 2x10G so I'm not really worried performance-wise. But the net.inet.carp.preempt feature implies that if a single heartbeat of a single VLAN is missed, failover is triggered. And sending 250 heartbeat/s on the SAME physical interface tends to give you 250x more chance (or frequency) for that failover event. It fails fast and cleanly (<0.5s of downtime) but it's a bit creepy.
As a simple workaround I set the first VLAN's CARP with a 1s period, and all the others with 5s, considering that the first VLAN test is sufficient to detect a physical failure of the LAN interface it sits on (all other VLANs will be pre-empted). But I'm still in the 50x ballpark more sensible than I should be to congestion/dropped/late heartbeat packet.
I could theoretically use a much slower heartbeat on the 250-but-one VLANs (I'm allowed up to 254s) but the advskew cannot be larger than 240/256 = 938ms, and I'm worried that the CARP heartbeat scheduling is not precise enough to guarantee that the timer derivation between the two firewalls is at least less than ~938/2~=500ms in 254s. Again, I sense we're going into much higher/frequent risks of missed heartbeats, which I don't like even if the final probability is still low.
I'm also worried about the CARP heartbeat scheduling : are my 250 heartbeat/s evenly spread or should I fear short burts with higher chance of short congestions/dropped heartbeats ? It's hard to analyse since I would have to be able to tcpdump on 250 VLAN interfaces at once, if someone knows a trick... (can you get the trunk port in promiscuous mode and see all VLANs trafic together ? PS: I don't have access to the switch management, the solution has to be in pfsense)
All things considered, I wonder if there's a better handling of this scenario : the failure domain is the physical LAN link, but the failover state is spread in many VLANs, can we bind one heartbeat for all of them ('cause it does not make any sense that the VLAN from the same physical interface can't be down or up together) ?
Thanks for any help or insight !
@zerodeux You could have a single transit link to a layer 3 switch and have it route your 250 VLANs.
All in all, an HA firewall with 250 interfaces is going to be work. It is also going to generate heartbeat traffic for all the first-hop redundancy VIPs. That is true for CARP, VRRP, or HSRP.