Preempting a slower master, but why?



  • Now that I finally have CARP notifications working, I'm trying to figure out why the backup keeps taking over. It happened again this morning. I got an email from both systems saying they were changing states.

    In the log for the backup it mentions it's preempting a slower master. Can someone tell me why this is happening, or point me in a direction to find out whats going on? The systems are not underpowered, and CPU utilization on the master is almost always below 10%.

    EDIT: So, in the span of 2 minutes it just flipped back and forth 30+ times. We've turned the backup off until we can figure out why this is happening.


  • Rebel Alliance Developer Netgate

    That is almost always a layer 2 issue - switches.

    If the transitions are only seen on the backup, it must be failing to see the heartbeats from the master on the interface(s) with CARP VIPs that switched.
    If the transitions are seen on master and slave, the master must be demoting itself, in which case the system log should have some clues about what happened. Usually that would mean something like an interface lost link/went down.



  • All of the VIP's fail over, and they both send notifications about the state change. When the change occurs, the backup says it's preempting a slower master, and the master says more frequent advertisements received. This occurs on all of the VIP's.

    This last storm of flip-flopping occured immediately after we unchecked block private networks on the backup server. This setting was already unchecked on the master (We are using private IP's for a mock Internet as we set up pfSense for a remote office.)

    Would turning on debugging provide more information on what happens?



  • Looking at the logs, I notice the first VIP to fail is one of the WAN VIP's. There are 5 VIP's on the WAN interface, and the WAN interfaces are connected to a cheap 3com 100meg 8 port switch. Is it possible that the amount of VIP's on this same interface is somehow causing dropped CARP traffic? Maybe the cheap switch isn't able to keep up with the frequent advertisements?


  • Rebel Alliance Developer Netgate

    It's possible but 5 VIPs wouldn't be that much traffic - one multicast packet per node per VIP per second, plus the skew.

    So really only 10 packets per second (give or take) for 5 VIPs on the WAN. That would have to be a pretty crappy switch to fail on that. Possible? Sure, but I'm not sure how likely it is… unless it's dying of old age.

    If you have another spare switch to try, it would certainly be worth swapping it out.



  • Could the firewall or QoS be causing problems like this? The fact that they started flip-flopping immediately after making a change to the firewall makes me suspicious.


  • Rebel Alliance Developer Netgate

    They shouldn't be, the CARP traffic should be excluded from firewall rules, though there may still be a way to break it that I'm unaware of.



  • Just an update, we switched out the switch with a better gigabit switch. We are still getting that flip-flopping effect of both systems pre-empting each other over and over and over again. We've moved our OpenVPN server off the pfSense boxes to a dedicated VM, so hopefully that will reduce the interruption caused by this until we can figure out what's going on here.

    The switch on the WAN side is http://www.newegg.com/Product/Product.aspx?Item=N82E16833704131. Not the most feature rich switch, but it works.

    Is there anything I'm missing here on what would be causing this? Is this a bug, perhaps? Or a misconfiguration between the two systems somehow?



  • This can't be normal.

    Everything was fine for a few days, both systems were up. The primary was the master and the secondary was the slave. All was well. Then this happened. No one was logged into the systems, no out of the ordinary traffic, the WAN interfaces weren't overly utilized.

    They seem to just start fighting over who should be master. We recently adjusted the CARP advertising frequency on the systems, thinking maybe that was the problem. As is, our primary has  base:1 skew:0 and the backup has base:2 and skew:100

    Is there anything I can look at to figure out what exactly is going on here? My inbox would greatly appreciate any advice :D




  • Sorry to keep posting replies, but I was curious about the log file from the primary. Below is the log file from the initial failover, up until we shut the secondary system off. Interface OPT1 is the pfSync interface.

    I noticed that when it starts to failover, I get 'MASTER -> BACKUP' entries for each VIP, the interface is changed to DOWN, and an email is dispatched.

    Then, at 11:19:43, vip2 changes to UP, with no explanation in the log as to why. It doesn't say it changed to MASTER, it just comes up. Then it immediately goes back down one second later, sending off another failover notification. This happens for every single VIP, and they keep coming up and going down until we shut off the secondary.

    Is it normal for the log to just say the interface came up, and not mention that it's taking over as master?

    EDIT: Another oddity I noticed was at 11:20:13, where it says:

    
    Oct 15 11:20:13	kernel: vip4: link state changed to D
    Oct 15 11:20:13	kernel: O<wn6<br>Oct 15 11:20:13	kernel: >vip2: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:13	kernel: vip3: link state changed to DOWN</wn6<br> 
    

    What's up with that?

    
    Oct 15 11:11:26	kernel: vip5: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:11:26	kernel: vip5: link state changed to DOWN
    Oct 15 11:11:26	kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:11:26	kernel: vip1: link state changed to DOWN
    Oct 15 11:11:26	kernel: vip6: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:11:26	kernel: vip6: link state changed to DOWN
    Oct 15 11:11:26	php: : Message sent to it@domain.com OK
    Oct 15 11:11:27	php: : Message sent to it@domain.com OK
    Oct 15 11:11:28	php: : Message sent to it@domain.com OK
    Oct 15 11:11:28	kernel: vip4: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:11:28	kernel: vip4: link state changed to DOWN
    Oct 15 11:11:28	kernel: vip3: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:11:28	kernel: vip3: link state changed to DOWN
    Oct 15 11:11:28	kernel: vip2: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:11:28	kernel: vip2: link state changed to DOWN
    Oct 15 11:11:28	php: : Message sent to it@domain.com OK
    Oct 15 11:11:29	php: : Message sent to it@domain.com OK
    Oct 15 11:11:30	php: : Message sent to it@domain.com OK
    Oct 15 11:19:43	kernel: vip2: link state changed to UP
    Oct 15 11:19:44	php: : Message sent to it@domain.com OK
    Oct 15 11:19:44	kernel: vip2: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:19:44	kernel: vip2: link state changed to DOWN
    Oct 15 11:19:45	php: : Message sent to it@domain.com OK
    Oct 15 11:20:12	kernel: vip4: link state changed to UP
    Oct 15 11:20:12	kernel: vip3: link state changed to UP
    Oct 15 11:20:12	kernel: vip2: link state changed to UP
    Oct 15 11:20:13	php: : Message sent to it@domain.com OK
    Oct 15 11:20:13	php: : Message sent to it@domain.com OK
    Oct 15 11:20:13	kernel: vip4: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:13	kernel: vip3: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:13	kernel: vip4: link state changed to D
    Oct 15 11:20:13	kernel: O<wn6<br>Oct 15 11:20:13	kernel: >vip2: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:13	kernel: vip3: link state changed to DOWN
    Oct 15 11:20:13	kernel: vip2: link state changed to DOWN
    Oct 15 11:20:14	php: : Message sent to it@domain.com OK
    Oct 15 11:20:15	php: : Message sent to it@domain.com OK
    Oct 15 11:20:15	php: : Message sent to it@domain.com OK
    Oct 15 11:20:16	php: : Message sent to it@domain.com OK
    Oct 15 11:20:20	kernel: vip1: link state changed to UP
    Oct 15 11:20:21	kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:21	kernel: vip1: link state changed to DOWN
    Oct 15 11:20:21	php: : Message sent to it@domain.com OK
    Oct 15 11:20:22	php: : Message sent to it@domain.com OK
    Oct 15 11:20:53	kernel: vip6: link state changed to UP
    Oct 15 11:20:53	kernel: vip3: link state changed to UP
    Oct 15 11:20:53	kernel: vip2: link state changed to UP
    Oct 15 11:20:53	kernel: vip1: link state changed to UP
    Oct 15 11:20:53	php: : Message sent to it@domain.com OK
    Oct 15 11:20:54	kernel: vip6: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:54	kernel: vip3: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:54	kernel: vip2: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:54	kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:20:54	kernel: vip6: link state changed to DOWN
    Oct 15 11:20:54	kernel: vip3: link state changed to DOWN
    Oct 15 11:20:54	kernel: vip2: link state changed to DOWN
    Oct 15 11:20:54	kernel: vip1: link state changed to DOWN
    Oct 15 11:20:56	php: : Message sent to it@domain.com OK
    Oct 15 11:20:58	php: : Message sent to it@domain.com OK
    Oct 15 11:20:58	php: : Message sent to it@domain.com OK
    Oct 15 11:20:59	kernel: vip2: link state changed to UP
    Oct 15 11:20:59	kernel: vip1: link state changed to UP
    Oct 15 11:21:00	php: : Message sent to it@domain.com OK
    Oct 15 11:21:00	kernel: vip2: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:21:00	kernel: vip1: MASTER -> BACKUP (more frequent advertisement received)
    Oct 15 11:21:00	kernel: vip2: link state changed to DOWN
    Oct 15 11:21:00	kernel: vip1: link state changed to DOWN
    Oct 15 11:21:00	php: : Message sent to it@domain.com OK
    Oct 15 11:21:01	php: : Message sent to it@domain.com OK
    Oct 15 11:21:02	php: : Message sent to it@domain.com OK
    Oct 15 11:21:02	php: : Message sent to it@domain.com OK
    Oct 15 11:21:03	php: : Message sent to it@domain.com OK
    Oct 15 11:21:03	kernel: vip4: link state changed to UP
    Oct 15 11:21:03	kernel: vip6: link state changed to UP
    Oct 15 11:21:03	kernel: vip3: link state changed to UP
    Oct 15 11:21:03	kernel: vip2: link state changed to UP
    Oct 15 11:21:03	kernel: vip1: link state changed to UP
    Oct 15 11:21:04	php: : Message sent to it@domain.com OK
    Oct 15 11:21:04	php: : Message sent to it@domain.com OK
    Oct 15 11:21:04	kernel: vip5: BACKUP -> MASTER (preempting a slower master)
    Oct 15 11:21:04	kernel: vip5: link state changed to UP
    Oct 15 11:21:05	php: : Message sent to it@domain.com OK
    Oct 15 11:21:06	php: : Message sent to it@domain.com OK
    Oct 15 11:21:06	php: : Message sent to it@domain.com OK
    Oct 15 11:21:07	php: : Message sent to it@domain.com OK
    Oct 15 11:21:07	php: : Message sent to it@domain.com OK
    Oct 15 11:21:08	php: : Message sent to it@domain.com OK
    Oct 15 11:22:14	kernel: re0: link state changed to DOWN
    Oct 15 11:22:14	check_reload_status: Linkup starting re0
    Oct 15 11:22:15	check_reload_status: Linkup starting re0
    Oct 15 11:22:15	kernel: re0: link state changed to UP
    Oct 15 11:22:19	php: : Hotplug event detected for opt1 but ignoring since interface is configured with static IP (192.168.100.1)
    Oct 15 11:22:20	php: : Hotplug event detected for opt1 but ignoring since interface is configured with static IP (192.168.100.1)
    Oct 15 11:22:20	check_reload_status: rc.newwanip starting re0
    Oct 15 11:22:25	php: : rc.newwanip: Informational is starting re0.
    Oct 15 11:22:25	php: : rc.newwanip: on (IP address: 192.168.100.1) (interface: opt1) (real interface: re0).
    Oct 15 11:22:25	apinger: Exiting on signal 15.
    Oct 15 11:22:26	check_reload_status: Reloading filter
    Oct 15 11:22:26	apinger: Starting Alarm Pinger, apinger(16230)</wn6<br> 
    


  • Just an update, I think we've solved the problem. We replaced our replacement switch with yet another one, this time we used a 3com managed switch. It looks like our pfsense boxes have finally stopped flipping between states.

    So, it looks like a cheap 3com consumer switch didn't work. A TP-Link gigabit switch didn't work, but the managed 3com did work.

    Edit: Does anyone know if there is a list of "approved" switches that are known to work with CARP? As is, it looks like consumer stuff can't handle it.

    Edit 2: Forget it. It went fine for 4 days, thought all was well, thought we had figured it out. I make this post, and now it flipped. However, it flipped on the LAN VIP first. So, at least it's doing something different this time. We might try putting another 3com on the LAN side to see what happens then.



  • nullifi, have you ever figured this out?

    We have been struggling with sporadic failovers (master demotes itself) for over two years now on one cluster.
    This never happens on any other cluster, just this particular one, and we can't seem to find out why. Both hosts run VMWare 4.1 with all MAS forgery, promiscuous mode, this NetReverseFiltering = 1. None of the other PFSense clusters on the two VMWares have this problem, just this one, so I think this may be PFSense/Config related rather than Layer 2, as the other boxes share the vswitch on VMWare.

    Also, on the CARP status page, clicking "Disable" then "Enable" repairs it until it demotes itself again.

    Any ideas?


Log in to reply