XML_RPC High Availability sync failing

y3sgroup

Hi, I have two boxes in HA sync running the latest build (2.3.4-RELEASE amd64).

They work fine, acting in a master/slave private IP failover setup, but the XML_RPC connection over the HA link keeps failing with the following error:

A communications error occurred while attempting XMLRPC sync with username jblokes https://10.155.0.2:443

The log on the master unit shows:

Jun 14 00:45:15 php-fpm 54775 /rc.filter_synchronize: New alert found: A communications error occurred while attempting XMLRPC sync with username jblokes https://10.155.0.2:443.
Jun 14 00:45:15 php-fpm 54775 /rc.filter_synchronize: A communications error occurred while attempting XMLRPC sync with username jblokes https://10.155.0.2:443.
Jun 14 00:45:15 php-fpm 54775 /rc.filter_synchronize: XML_RPC_Client: Connection to RPC server 10.155.0.2:443 failed. Operation timed out 103

As a result of this, the interface firewall rules aren't being synced to the slave unit.

After googling the above error, I found that there used to be a bug in pfSense sometime ago, but I did not understand whether the fix for that isn't already included in my OS build or it's something I need to patch manually?.
https://redmine.pfsense.org/issues/5329

y3sgroup

UPDATE:

I have now established that the XML_RPC sysnc actually works, but only once. When I set it up and it runs, all the virtual IPs, the firewall rules and the rest are successfully copied across. However after that the connectivity over the TCP/UDP port 443 is lost and cannot be restored, even though the firewall rule for the HA net on both boxes is "allow any traffic to and from HA net to and from any port on HA net"

Initially I thought this may have had something to do with the fact that the master firewall has been built about a year ago, while the backup unit is brand new. The names of the interafces in both boxes have differed slightly. For example while "LAN1" interface was mapped to "lan and lagg1" on the master unit, the backup unit's "LAN1" was mapped to "opt3 and lagg1" and so on. As a direct result of this, when the rules were synced (once), they were assigned to wrong interfaces. A simillar thing also happened to the virtual IPs.

So I have rebuilt the backup unit to precisely correspond with the underlying pfSense naming scheme of the master unit (after realizing that as far as the pfSense is concerned, re-labelling the interface names does not change their originally assigned names).

Unfortunately, although the rules and virtual IPs are now being (seemingly) synced correctly, it only works once and then I get the infamous "A communications error occurred while attempting XMLRPC sync with username" error again and the TCP/UDP port 443 on the backup unit becomes unavailable again.

I'm running out of ideas here…

dotdash

Your interfaces need to match- the underlying interface such as em1, igb0, etc. need to be the same on both units. You can work around this by using lag interfaces if you can't match the hardware.
Try using the admin account to sync. Historically, using other accounts has been problematic.

whitwye

Same problem. Interfaces match. Using admin:

An error code was received while attempting XMLRPC sync with username admin https://192.168.100.2:443 - Code 6: The requested method didn't return an XML_RPC_Response object. @ 2017-06-18 16:40:30

y3sgroup

Yeah, mine match too now, but still getting the same error after the first sync. We're getting a paid pfSense support to have a look into it for us. I'll post the solution if we find one.

BTW, jblokes is just a name I made up really :) It is the admin account I'm using.

whitwye

Now having a further problem. The two systems can ping each other just fine. But the backup decided to do a CARP failover/takeover. Really bad behavior.

The systems are cabled directly – no switch on this interface. I've run Linux servers with CARP (well, UCARP) failover for years, on directly-cabled NICs, and never seen CARP fail like this. The CARP troubleshooting all assumes the crossover is handled through a switch, and describes how some switches introduce problems. Is FreeBSD just bad and NIC support?

dotdash

I use directly connected connections for sync most of the time, shouldn't be a problem. Make sure you are allowing everything between the boxes on the sync if. Also, the switches on the other interfaces (LAN, WAN, etc.) can also introduce problems- the two boxes need to be able to all of see each others CARP-enabled interfaces.

whitwye

When you say the swtiches on other interfaces can create problems, is that only the case if there is subnet overlap? Since we don't have that, can I rule the switches out?

dotdash

I mean that if the two WAN interfaces (for example) can't exchange CARP information, they will have problems.

whitwye

@dotdash:

I mean that if the two WAN interfaces (for example) can't exchange CARP information, they will have problems.

Why should the WAN interfaces need to exchange CARP information? Is the CARP setup here different from what I've always set up when doing it by hand, where CARP is exchanged only on the direct line between the servers?

Derelict

CARP is done on the interfaces themselves.

There is a far-too-common misconception that the SYNC interface has something to do with CARP. It does not. It is generally used for state sync (pfsync) and configuration sync (XMLRPC sync).

If you have two WAN interfaces, each with an address and sharing a CARP VIP, those interfaces themselves need to be able to exchange CARP heartbeats. The same is true for all CARP/HA interfaces. These are multicast using 224.0.0.18.

https://portal.pfsense.org/docs/book/highavailability/index.html

A useful troubleshooting tool is to packet capture for CARP traffic on the interface you think should be BACKUP but is MASTER instead. The built-in packet capture will decode CARP and show you the base/advskews, etc. In the default configuration the base/advskew of the primary should be 1/0. It should be 1/100 on the secondary. You will probably see nothing but 1/100 being sent by the secondary where it should be receiving the 1/0 from the primary but it is not. If it was, it would assume BACKUP status and stop the 1/100 heartbeats.

A very common misconfiguration is adding an interface but not tagging the new VLANs through all the way between interfaces, etc.