CARP Failures



  • I am having an issue with our CARP implementation that I am hoping to discuss to try to make some progress.

    Some background:  Our network is a large wireless network using licensed microwave radio links to connect a number of communications tower sites.  At each tower we server various services in including Internet to rural homes and commercial facilities (ie. natural gas plants), 2-way radio voice services, and hosted services using customer owned microwave gear for last-mile links to their own facilities.  To ensure our network remains running as much as possible I have redundant microwave links to some sites, creating loops, as well as 2 pfSense installs running CARP and connected to our fiber - as much as I want full redundancy there is still single points of failure at a a couple points in our network.

    The basic network layout looks likes this:

    I started off using STP, but the higher-than-wireline latency of the microwave links (0.5-1.5 ms) caused too many issues and we moved the proprietary protocol from Extreme Networks - EAPS, Ethernet Automatic Protection Switching.  This works really really well for our network, very quick fail over times.

    Now the issue: any number of events can cause one of the loops in our network to open - ie. microwave link failure, radio reboot, interface failure, interface disconnect, etc.  But, like STP, they aren't really loops as one loop interface of the Master is always blocking all traffic other than it's control packets  These packets are sent on dedicated VLANs (one per loop) that carry only these control packets, no IP traffic, and are only configured for the loop interfaces (I have quadruple checked this).  When a loop opens or closes (I have tested with the large loop to the right and the smaller loop on the bottom left) pfSense appears to receive CARP advertisements and changes it's config - this only happens to 2 VIPs out of 19 (2 and 9).  When this happens the VLANs on those 2 VIPs lose connectivity through the router(s).  And it doesn't happen every time w loops changes state, seems like 3 out of 5 times.  The System Logs in the CARP Master show only this (from a failure this morning, loop was restored, loop failed, and was restored again):

    Jul 11 08:52:39 	dnsmasq[26849]: read /etc/hosts - 28 addresses
    Jul 11 08:50:31 	dnsmasq[26849]: read /etc/hosts - 28 addresses
    Jul 11 08:49:38 	kernel: vip2: link state changed to UP
    Jul 11 08:49:38 	kernel: vip9: link state changed to UP
    Jul 11 08:49:35 	kernel: vip9: link state changed to DOWN
    Jul 11 08:49:35 	kernel: vip9: MASTER -> BACKUP (more frequent advertisement received)
    Jul 11 08:49:35 	kernel: vip2: link state changed to DOWN
    Jul 11 08:49:35 	kernel: vip2: MASTER -> BACKUP (more frequent advertisement received)
    Jul 11 08:47:31 	dnsmasq[26849]: read /etc/hosts - 28 addresses
    Jul 11 08:47:15 	dnsmasq[26849]: read /etc/hosts - 28 addresses
    Jul 11 08:47:09 	dnsmasq[26849]: read /etc/hosts - 28 addresses
    

    I see these same messages every time the routing to these 2 VIPs fails - no entries are generated in the Backup.  The CARP status shows Master for the Master and Backup for the Backup for all entries during this failure time.  When routing is restored I do not see any related log entries.  Routing is restored: sometimes if a loop changes state again; sometimes a reboot is required; and sometimes both the CARP Master and Backup need to be rebooted).

    My initial thought was that somehow the Extreme control packets were getting to the pfSense boxes and messing with the CARP (VRRP), so I have checked the VLAN configs and I can't see that being possible, so I am now looking to pfSense.  It's almost as if the Master is seeing it's own advertisements - even though it's only for the same 2 VIPs every single time.  The advertising frequencies are Master Base 1 Skew 0, Backup Base 2 Skew 100 for every VIP including the 2 that are affected.

    Both pfSense boxes are running 2.0-RC3  (i386) built on Thu Jun 23 13:05:26 EDT 2011 - I have experience this with each of the RC's I have tried.

    So, I am looking for any input from anyone that may be able to help or tell me what to look at or where.  Thanks in advance.

    Aaron


Log in to reply