Spontaneous Failover?

mcampbell

Under what circumstances would a CARP suddenly fail over to the other node, and stay there?

I have a dual setup CARP cluster, and it just switched over to the slave, despite it being up and running, still accessible, and even got into the web interface just fine. The master was showing up as backup in the CARP status page, but as far as I could tell, no reason for it, it just switched over, and the only way I could find to switch it back, was by rebooting the master node, even though everything else seemed fine.

Any ideas?

mcampbell

Since no one seemed interested in answering that question, I'll pose another one:

pfsense01 & pfsense02 are respectively in a master-backup configuration, with pfsync, xmlsync, CARP VIPs for LAN, WAN, & WAN2 (with respective configurations listed below), & a dedicated NIC for CARP traffic. Assuming that both are functioning properly, is pfsense01 ALWAYS going to be the master of the CARP VIPs, or can it switch to the other one without it necessarily meaning that there's something wrong with the two nodes?

| | LAN |
| Settings | pfsense01 | pfsense02 |
| IP Address | 10.1.1.1/23 | 10.1.1.1/23 |
| VHID Group | 2 | 2 |
| Advertising Frequency Base | 1 | 1 |
| Advertising Frequency Skew | 0 | 100 |
| | WAN |
| Settings | pfsense01 | pfsense02 |
| IP Address | 208.x.x.170/29 | 208.x.x.170/29 |
| VHID Group | 3 | 3 |
| Advertising Frequency Base | 1 | 1 |
| Advertising Frequency Skew | 0 | 100 |
| | WAN2 |
| Settings | pfsense01 | pfsense02 |
| IP Address | 71.x.x.18/29 | 71.x.x.18/29 |
| VHID Group | 4 | 4 |
| Advertising Frequency Base | 1 | 1 |
| Advertising Frequency Skew | 0 | 100 |

jasonlitka

Did you ever figure this out? I'm having the same issue on my systems at the office. Every few days I'll notice that everything is running from the backup box. Disabling and enabling CARP on the first box, or simply rebooting the first box, will shift everything back over.

mattb253

what are the actual interface IPs?

mcampbell

I never did figure this out, and my work around is the same as yours, Jason. But that's not a very good workaround.

Matt, my apologies, I didn't realize I put this up without the actual interface IPs. Here's a revised chart:

LAN
Settings pfsense01 pfsense02
CARP IP Address 10.1.1.1/23
IP Address 10.1.1.2/23 10.1.1.3/23
VHID Group 2 2
Advertising Frequency Base 1 1
Advertising Frequency Skew 0 100
WAN
Settings pfsense01 pfsense02
CARP IP Address 208.x.x.170/29
IP Address 208.x.x.171/29 208.x.x.172/29
VHID Group 3 3
Advertising Frequency Base 1 1
Advertising Frequency Skew 0 100
WAN2
Settings pfsense01 pfsense02
CARP IP Address 71.x.x.18/29
IP Address 71.x.x.19/29 71.x.x.20/29
VHID Group 4 4
Advertising Frequency Base 1 1
Advertising Frequency Skew 0 100

jasonlitka

I'm running about 25 IPs with CARP. All are set to 1/0 on the master and 1/100 on the backup, just as yours are.

I'm not showing any downtime in OpManager from either the pfSense boxes or from the switches (a stacked pair of Dell 6248) so I'm really not sure what is causing it. If there was a momentary glitch I'd have thought that as soon as the glitch was over that the IPs would switch back to the primary.

mcampbell

I monitor my setup with Nagios, and it's not reported any glitches either. I've looked in the logs as well, but I've been unable to find anything (though it's possible I'm looking in the wrong spot, as there isn't a tab specifically for CARP).

I also would have thought it would switch back, but that's definitely not the observed behavior.

jasonlitka

I've spent the last hour or so digging through my syslog server (Kiwi sucks, BTW) and I found that last Friday morning (05:08:04) I had a ton of:

192.168.1.252 - kernel: wan_vip18: link state changed to DOWN
192.168.1.252 - kernel: wan_vip18: MASTER -> BACKUP (more frequent advertisement received)
192.168.1.253 - kernel: wan_vip18: link state changed to UP
192.168.1.253 - kernel: wan_vip18: BACKUP -> MASTER (preempting a slower master)

This happened within seconds of a (and this may be unrelated) cron job which ran /usr/local/bin/vnstat -u (05:08:00). Incidentally, that command doesn't actually exist. I'm going to remove it from cron to see if the issue goes away but my hopes aren't high.

mcampbell

Good luck Jason. How often does it happen to you? It's been real sporadic for me (1-2/month) (but seemingly always at the worst possible time), so I don't even have any recent logs to look at right now.

I looked in /etc/crontab on pfsense01, and didn't see anything like that entry in mine. This is the entirety of my crontab:

0       *       *       *       *       root    /usr/bin/nice -n20 newsyslog
1,31    0-5     *       *       *       root    /usr/bin/nice -n20 adjkerntz -a
1       3       1       *       *       root    /usr/bin/nice -n20 /etc/rc.update_bogons.sh
*/60    *       *       *       *       root    /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 sshlockout
1       1       *       *       *       root    /usr/bin/nice -n20 /etc/rc.dyndns.update
*/60    *       *       *       *       root    /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 virusprot
30      12      *       *       *       root    /usr/bin/nice -n20 /etc/rc.update_urltables
0       */24    *       *       *       root    /etc/rc.backup_rrd.sh
0       */24    *       *       *       root    /etc/rc.backup_dhcpleases.sh

jasonlitka

About the same. A couple times per month. I typically don't notice until someone comes to me and says something isn't working. Usually an IPSec tunnel (when you change something on the master it replicates to the backup but without restarting the services the config isn't applied).

mcampbell

Heh, that is precisely how I find out too :) In my case, the first time it happened, I did see problems with our IPSec Site-to-Site VPN connection, but after that first time, it's worked fine in subsequent failovers. After that, it's historically been PPTP connections not being able to connect outside of the office (e.g., VPN user connects to the office, can connect to anything they want in the office, but then can't connect to anything outside of the office). Everything else works fine, and even people inside the office don't notice any problems, just VPN users.

I have been working on integrating a Nagios NRPE plugin I found that will tell me whether or not pfsense01 is the master in the CARP layout, hopefully giving me the edge in tracking this mug down.

jasonlitka

How many CARP IPs do you have setup? Just the 3 in your original post?

Under the assumption that maybe the volume of CARP traffic was causing issues I trimmed down my config from 58 CARP IPs to 6 + 52 IP Aliases. I'm not sure if this will make a difference, but it sure speeds up the failover from one node to the other.

mcampbell

Yeah, just the 3 (+ the dedicated CARP interface that I forgot to mention).

58 CARP IPs? Wow, it's good to know that pfSense can handle that many. I can see why it would speed that up; I've gathered that a comparatively large amount of stuff takes place in the hand off between nodes, and doing it 58x no doubt takes a bit of time. With just my 3, I've not seen any noticeable delay–heck, were it not for the issues I mentioned with the PPTP server, I might never have noticed it switched (didn't have any noticable downtime when it switched for no apparent reason).

mcampbell

Is there anyone else who might have had any luck or insight into this issue? Months later, and I'm still not much closer to solving the issue than when I first posted the topic. I've ended up rebooting pfsense01 anywhere from once every couple of weeks to a couple of times a week. I'd really be interested in anyone's theories on what the problem might be….

jasonlitka

My problem went away when I switched the bulk of the IPs over from CARP to IP Aliases on a single CARP per interface. The failover, when necessary, happens much faster as well.

mcampbell

Interesting in your case, I'm glad you got your problem solved. But in my case, I've only got one IP assigned per CARP interface, with the CARP IPs already virtualized by necessity, so I'm essentially the same as you (minus 50 IP addresses).

It seems like we had all the same symptoms, but a different (but probably related) cause. I guess you never did figure out what caused it?

jasonlitka

Have you tried different cables, NICs, and/or switches?

mcampbell

It's a fair point, but which one do I change? When it does its failover, they all failover, even though they're all on different cables, NICs, & switches. (For the record, this is a production system, so I can't just go mucking about doing trial-n-error steps). The dedicated CARP interface is only a crossover cable going between the two interfaces, so no switch to worry about.

The NICs are far more challenging to change out though, as each node is a mini-itx set up with 5 onboard NICs.

jasonlitka

You can probably swap out the cables without anyone noticing. Do the backup box first, then disable CARP on the primary and change those too.

If your NICs are all built in then I'd probably go to the switch next. You may just have to declare a maintenance window on that one.