CARP issues in new setup - two MASTERs and dropped CARP packets.

Ashmodai

Hi there!

I was hoping someone could shed some light on a little problem I've run into when setting up my first set of pfSense firewalls; These are slated to replace a few bare-bones, no-GUI Gentoo Linux hardened boxes running iptables. I got them up and running no problem, and am looking to put them into production, but my confidence has been shaken by one small issue: Unless I setup an interface-specific or floating firewall rule to explicitly allow CARP traffic, the BACKUP server stays MASTER along with the MASTER.

When I originally set this up, everything seemed fine and worked great, but after a week of doing other things I have come back to these firewalls in an odd state.

If I add:
pass in quick on em1 inet proto carp from any to 224.0.0.0/8 keep state label "USER_RULE: CARP VRRP"
pass in quick on em0 inet proto carp from any to 224.0.0.0/8 keep state label "USER_RULE: CARP VRRP"
pass in quick on em2 inet proto carp from any to 224.0.0.0/8 keep state label "USER_RULE: CARP VRRP"
pass in quick on em3 inet proto carp from any to 224.0.0.0/8 keep state label "USER_RULE: CARP VRRP"

(This is a floating rule on all of the interfaces, but a single rule on the "em1" interface is enough to make it work for this particular test VIP - I created a floating rule in order to cover my bases since other interfaces will have VRRP)

As I mentioned earlier, though, when I first set these up and created the rulesets & a few virtual CARP test IPs, it worked exactly as expected with none of these custom floating rules. Examining the output of 'pfctl -sa' shows this at the very top:

pass quick proto carp all keep state
pass quick proto pfsync all keep state

And I would think that would be adequate to make this work? Why do I all of the sudden need to add specific rules, and why did it stop working? I am hesitant to deploy any system that exhibits behaviour out of the ordinary or that I don't understand, and this definitely fits the bill in my eyes!

If I remove the floating rule and "break" CARP (both systems now think they're master), the following "block" entries are in the logs on the slave upon initiating a failover:

Apr 8 13:26:22 EXTERNAL XXX.XXX.XXX.231 224.0.0.22 IGMP
Apr 8 13:26:21 EXTERNAL XXX.XXX.XXX.231 224.0.0.22 IGMP
Apr 8 13:26:19 EXTERNAL XXX.XXX.XXX.231 224.0.0.22 IGMP
Apr 8 13:26:18 EXTERNAL XXX.XXX.XXX.231 224.0.0.22 IGMP

When I click the "X" to discover why it was blocked, I get the default deny rule:

@1 Block drop in log all label "Default deny rule"

As an extra piece of wierd, if you leave it alone, the backup system eventually -does- go back into BACKUP state.

Any clues, hints or questions?

podilarius

Are you dedicating an interface for pfsync? If not, I would do so. Then you only need to add an allow all rule to just that interface with a crossover cable between the 2 FWs.

Ashmodai

Actually it's a bit more complex, but regardless it was my understanding that CARP operated on all interfaces which virtual IPs exist on (so it can detect link failures, as well as system failures). PFSYNC isn't the issue, CARP is. (as far as I know, pfsync is working fine since there are no firewall blocks in the logs for that particular protocol).

Thanks for your help, though.

podilarius

I have run that in 2 different locations and never had to create rules for VRRP packets. It is strange that you would have to do that.

cmb

IGMP is not CARP and has no direct relation to CARP. The rules to permit CARP are automatically added. User-defined floating rules can block CARP which will cause the described scenario. It won't be the traffic you're logging as blocked there, if you're not seeing any blocked CARP in the firewall log it's likely because you have a block floating rule without logging blocking that traffic.

It's also possible it's not getting blocked, but rather not getting to the other host at all, if network connectivity between the firewalls is broken.

Ashmodai

This has been resolved. Leaving this here because it didn't seem like anyone else had run into similar issues, certainly I've never experienced such with Cisco/Juniper/HP kit.

The culprit was IGMP snooping. This location uses Allied Telesis AT-9000/52 switches, which by default have all ports configured with IGMP snooping on in single-host per port mode (contrary to the manual and the configuration, actually, which indicate the opposite).

Turning off (or properly configuring) IGMP snooping on the switches allows the CARP packets to flow freely to their destination.

podilarius

Thank you for the update.

Ashmodai

@podilarius:

Thank you for the update.

Sure It does make you wonder: why does it (sometimes) then work when you explicitly permit all IGMP in, if the problem is at layer 2?

I believe that the multicast "join IGMP group" and "leave IGMP group" messages must be interpreted somehow by the CARP peer causing it to properly fall into backup state. These particular switches seem to 'lock' the particular port(s) out of an IGMP group for 260 seconds (default) following an IGMP group membership LEAVE request. Very bizarre. 5 minutes later or so they start forwarding the packets again.

Anyways, IGMP snooping isn't really a useful addition to this network so it can stay off. :)