Spontaneous Failover?

jasonlitka

I've spent the last hour or so digging through my syslog server (Kiwi sucks, BTW) and I found that last Friday morning (05:08:04) I had a ton of:

192.168.1.252 - kernel: wan_vip18: link state changed to DOWN
192.168.1.252 - kernel: wan_vip18: MASTER -> BACKUP (more frequent advertisement received)
192.168.1.253 - kernel: wan_vip18: link state changed to UP
192.168.1.253 - kernel: wan_vip18: BACKUP -> MASTER (preempting a slower master)

This happened within seconds of a (and this may be unrelated) cron job which ran /usr/local/bin/vnstat -u (05:08:00). Incidentally, that command doesn't actually exist. I'm going to remove it from cron to see if the issue goes away but my hopes aren't high.

mcampbell

Good luck Jason. How often does it happen to you? It's been real sporadic for me (1-2/month) (but seemingly always at the worst possible time), so I don't even have any recent logs to look at right now.

I looked in /etc/crontab on pfsense01, and didn't see anything like that entry in mine. This is the entirety of my crontab:

0       *       *       *       *       root    /usr/bin/nice -n20 newsyslog
1,31    0-5     *       *       *       root    /usr/bin/nice -n20 adjkerntz -a
1       3       1       *       *       root    /usr/bin/nice -n20 /etc/rc.update_bogons.sh
*/60    *       *       *       *       root    /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 sshlockout
1       1       *       *       *       root    /usr/bin/nice -n20 /etc/rc.dyndns.update
*/60    *       *       *       *       root    /usr/bin/nice -n20 /usr/local/sbin/expiretable -v -t 3600 virusprot
30      12      *       *       *       root    /usr/bin/nice -n20 /etc/rc.update_urltables
0       */24    *       *       *       root    /etc/rc.backup_rrd.sh
0       */24    *       *       *       root    /etc/rc.backup_dhcpleases.sh

jasonlitka

About the same. A couple times per month. I typically don't notice until someone comes to me and says something isn't working. Usually an IPSec tunnel (when you change something on the master it replicates to the backup but without restarting the services the config isn't applied).

mcampbell

Heh, that is precisely how I find out too :) In my case, the first time it happened, I did see problems with our IPSec Site-to-Site VPN connection, but after that first time, it's worked fine in subsequent failovers. After that, it's historically been PPTP connections not being able to connect outside of the office (e.g., VPN user connects to the office, can connect to anything they want in the office, but then can't connect to anything outside of the office). Everything else works fine, and even people inside the office don't notice any problems, just VPN users.

I have been working on integrating a Nagios NRPE plugin I found that will tell me whether or not pfsense01 is the master in the CARP layout, hopefully giving me the edge in tracking this mug down.

jasonlitka

How many CARP IPs do you have setup? Just the 3 in your original post?

Under the assumption that maybe the volume of CARP traffic was causing issues I trimmed down my config from 58 CARP IPs to 6 + 52 IP Aliases. I'm not sure if this will make a difference, but it sure speeds up the failover from one node to the other.

mcampbell

Yeah, just the 3 (+ the dedicated CARP interface that I forgot to mention).

58 CARP IPs? Wow, it's good to know that pfSense can handle that many. I can see why it would speed that up; I've gathered that a comparatively large amount of stuff takes place in the hand off between nodes, and doing it 58x no doubt takes a bit of time. With just my 3, I've not seen any noticeable delay–heck, were it not for the issues I mentioned with the PPTP server, I might never have noticed it switched (didn't have any noticable downtime when it switched for no apparent reason).

mcampbell

Is there anyone else who might have had any luck or insight into this issue? Months later, and I'm still not much closer to solving the issue than when I first posted the topic. I've ended up rebooting pfsense01 anywhere from once every couple of weeks to a couple of times a week. I'd really be interested in anyone's theories on what the problem might be….

jasonlitka

My problem went away when I switched the bulk of the IPs over from CARP to IP Aliases on a single CARP per interface. The failover, when necessary, happens much faster as well.

mcampbell

Interesting in your case, I'm glad you got your problem solved. But in my case, I've only got one IP assigned per CARP interface, with the CARP IPs already virtualized by necessity, so I'm essentially the same as you (minus 50 IP addresses).

It seems like we had all the same symptoms, but a different (but probably related) cause. I guess you never did figure out what caused it?

jasonlitka

Have you tried different cables, NICs, and/or switches?

mcampbell

It's a fair point, but which one do I change? When it does its failover, they all failover, even though they're all on different cables, NICs, & switches. (For the record, this is a production system, so I can't just go mucking about doing trial-n-error steps). The dedicated CARP interface is only a crossover cable going between the two interfaces, so no switch to worry about.

The NICs are far more challenging to change out though, as each node is a mini-itx set up with 5 onboard NICs.

jasonlitka

You can probably swap out the cables without anyone noticing. Do the backup box first, then disable CARP on the primary and change those too.

If your NICs are all built in then I'd probably go to the switch next. You may just have to declare a maintenance window on that one.