CARP sensitive to in/out errors, few errors on backup LAN nic kills WAN ISP
-
This post is a short version of long story that might save you time I lost.
Putting the end at the beginning: CARP is highly, highly sensitive, even a few in/out errors on any interface hosting a CARP VIP (such as LAN on a backup box with a working and normal master/primary box) can bring otherwise unrelated and healthy WAN gateways down, halting all internet traffic via that ISP.
If PFSense reports any problems with a gateway connected vic a nic that hosts a CARP VIP, or the CARP display reports any abnormalities– check the 'interfaces' display on both the primary and backup pfsense boxes, and if there are ANY in/out errors on ANY carp enabled interface, solve those first. Even if the interface with the few in/out errors seems to have nothing to do with the ISP gateway being down. Then reset the ARP tables in any ISP's gateway.
The shaggy dog story starts here, but you've already read the good stuff.
The setup is: any number of ISP's, one of which is 'WAN' in pfsense jargon, each with it's own nic. Two PF boxes, each with a lan and pfsync nic. Pretty much a duplicate of the 'typical' two pfsense box setup shown on the diagrams except more than one isp on the wan side, and a few different LANs via separate nics.
Carp on the lan side for the private/protected gateway, static ipv4 ips for each of the WAN nics, and several WAN ipv4s which are carp supported between the two boxes for failover / HA.
The setup works for months, years without a blip. One day, apropos no known hardware or software changes, the dashboard reports one of the ISP gateways is 'down'. Checking up shows the WAN gateway is up, but strangely neither the primary pf box wan nic ip nor any of the WAN virtual ips are pingable from the internet. The CARP failover status shows no problems with the primary box being 'master' and the failover box 'backup. The master and backup pf box can ping the cablemodem/router. Other boxes directly connected to the internet via the same cablemodem/router using IPs not associated with PFSense work normally, outbound and inbound.
The target ip address to test whether the gateway is 'up' is reachable from all the other ISP's and is reachable through the suspect ISP via the static IP's not associated with PFSense.
So far, everything should be working, except, it isn't. Did many further tests along the lines of the above-- that reported normal results.
Called the ISP, they reported that the ARP table in the cablemodem / router had the MAC address for the master PF box WAN NIC in their table for the first virtual IP address. And, the primary pfsense virtual carp MAC address associated with the IP for the primary NIC.
So, they were swapped in the routers ARP table, though correct in both PFSense box's ARP tables.
Power cycling the cable modem didn't help. Having the ISP reset the cable modem's arp table without power cycling did help --- for a while. Normal operations. Then all the symptoms came back.
A puzzle. Much thrashing.
In the end, it was a LAN nic on the backup box that was dropping just a few packets occasionally that triggered the problem. Apparently in the PFSense world, if one CARP interface has a problem, it puts all of them into some sort of flapping condition. That creates a cascade effect that sends out so many arp packets on the WAN side some cable modems get confused ARP tables, halting all traffic. Replacing the LAN nic on the backup box, so that the 'interfaces' screen showed 0 in/out errors on all the CARP enabled interfaces, resumed normal ISP cablemodem operation. All the other ISP's were not sensitive to the CARP 'flapping', their performance was unaffected.
So, moral of the story: If PFSense reports any problems with a gateway connected vic a nic that hosts a CARP VIP, or the CARP display reports any abnormalities-- check the 'interfaces' display on both the primary and backup pfsense boxes, and if there are ANY in/out errors on ANY carp enabled interface, solve those first. Even if the interface with the problem has nothing to do with the ISP. Then reset the ARP tables in any ISP's gateway.
The above is my little contribution to a great project.