Backup node taking over CARP Virtual IP

jypsilantis

I run a 1 x main/1 x standby HA cluster. Recently, I rebuilt the primary node following a disk failure. Replication and services all look OK, the cluster works as expected, but I have noticed that the backup occasionally takes over the LAN CARP IP whilst it is still showing that it is in backup mode from CARP's perspective. When I try to access the web interface via the CARP VIP address, I am directed to the backup

Of particular interest, my OpenVPN endpoint still appears to work, even though the service is not running on the backup (when it holds the CARP VIP). So it looks as if the primary still has control of the VIP, but the behaviour of the web interface is counter intuitive.

XMLRPC replication is set up from the primary to the backup only - no reverse configuration as per the documentation.

I am running the latest stable release of the software on both nodes: "2.5.0-RELEASE"

Any pointers or assistance would be really appreciated.

(Ed: I have checked the WAN CARP VIP which is DHCP-allocated by my broadband mode,. The modem, reports that the WAN VIP is held by the backup as well. Nevertheless, the backup node is showing "backup" for both LAN and WAN CARP statuses)

Derelict

@jypsilantis CARP/pfSense HA is incompatible with dynamic addresses like those obtained via DHCP. I would say if it ever worked it was a fluke.

In a normal CARP setup the only reason a VIP in the BACKUP state would go MASTER is if that interface stopped receiving "better" advertisements from the MASTER node.

jypsilantis

@derelict thank you for the quick reply.

My LAN NICs are set to static IP addresses and the same is happening on these as well. I can try changing over to statics on the WAN side as well, but I think it won't make much difference.

The strange thing is that the backup is still showing "backup" even though it has control of the VIP.

It is almost as if there is some kind of load balancing happening - the backup appears to be slightly less loaded on the most part compared to the primary.

Everything seems to be working properly so I am not too concerned on that part, just a bit confusing when you try to log onto the active master and end up on the backup.

Derelict

@jypsilantis In general, unless the primary node is in maintenance mode, all CARP VIPs on the primary should be MASTER and all CARP VIPs on the secondary should be BACKUP.

If that is not the case the problem is generally a layer 2 / multicast/broadcast domain problem in the path between the nodes on that network.

There is a sticky at the top of this category in which I attempted to explain the various parts of an HA cluster.

jypsilantis

@derelict thanks for this.

I looked at the persistent article that you mentioned. The symptoms in my case are different - I do not have a master/master situation, so it looks like the nodes are correctly resolving and establishing priority orders.

I have a managed switch on the LAN side, and the modem has handled CARP without missing a beat for at least 3 years now. The problem appears to have occurred concurrently with the upgrade to the latest version of pfsense that I installed a few days ago.

Derelict

@jypsilantis Like I said, CARP is not compatible with interfaces that obtain their addressing from DHCP and never has been. I am probably misunderstanding what you actually have there.

jypsilantis

@derelict, one pair of interfaces (WAN) are on DHCP and the others (LAN) are true statically addressed (not DHCP pseudo static). The problem occurs on both.

If DHCP were the issue on the WAN, I would see, for example

                             CARP state             Ownership of VIP

Primary LAN NIC PRIMARY Yes
Backup LAN NIC. BACKUP No
Primary WAN NIC BACKUP No
Backup WAN NIC. PRIMARY Yes

What I am actually seeing:

                             CARP state             Ownership of VIP

Primary LAN NIC PRIMARY No
Backup LAN NIC. BACKUP Yes
Primary WAN NIC PRIMARY No
Backup WAN NIC. BACKUP Yes

Derelict

@jypsilantis That doesn't make much sense. You might want to just post screen shots of the CARP status pages or, better, output from both nodes of ifconfig -vvvvma

Some terminology so everyon'e on the same page: Nodes are primary/secondary, VIPs are MASTER/BACKUP.

jypsilantis

@derelict thanks for this.

Here are some screenshots.

fw1.local is the primary member of the HA cluster, and fw2.local is the backup

The WAN-side NICs share address 10.1.0.10, which is presented by the active member to the broadband modem/router. The modem/router assigns "primary" IP addresses to each member via DHCP: 10.1.0.97 for fw1 and 10.1.0.85 for fw2

The LAN-side NICs share address 10.0.0.3. fw1 has an intrinsic static IP of 10.0.0.1 and fw2 has 10.0.0.2.

From the screenshots, you can see that fw2 is running backup CARP on both of its NICs, and conversely fw1 is running MASTER for both of its interfaces. As such, there is no split master/backup or dual master/master. These statuses appear to persist.

However, when I access the web interface via 10.0.0.3, I land on fw2. Similarly, the modem/router reports that fw2 has control of 10.1.0.10.

If I reboot the backup, then fw1 takes over 10.0.0.3 and I get to its web interface via this address. However, several minutes after fw2 comes back up, it resumes control of 10.0.0.3 and the status as per the screenshots returns.

This is counter intuitive, but strangely everything seems to be working fine in all other respects.

[edit: just noticed that the net mask for the LAN side CARP is wrong - should be /16. I have made the changes. However, no effect to the above behaviour, fw2 took over 10.0.0.3 shortly after reboot]

Screen Shot 2021-03-17 at 3.30.16 pm.png Screen Shot 2021-03-17 at 3.31.09 pm.png

Derelict

@jypsilantis You'll need to look at layer 2 and see what is happening with the CARP MAC address. Everything there looks fine. Be sure you're also not doing something like port forwarding the webgui connections around.

jypsilantis

@derelict I may have found the problem. Possibly a corrupt or failing disk.

I replaced the disk on the backup node today, rebuilt and and restored configs from a previous (recent) backup file. Everything looks fine now.

I will keep monitoring in case the problem reoccurs, but it may be something as simple as this.

A really strange symptom if it is in fact a failing disk. SMART status was OK, so perhaps some corruption from the recent power outage that took out my primary firewall disk.

For anyone else who may experience this issue, try rebooting with the disk repair option, and/or change out the disk and rebuild/restore.

Thanks for your help and guidance.