HA Sync problems since updating to 2.4.4

JeGr

Hi,

I know we had a thread going on but we have a similar problem with IMHO other reasons (or a bug).
Since updating to 2.4.4 series (no problems up to 2.4.3), the standby firewall very often responds with an

A communications error occured ... (restore_config_section)

error in the XMLRPC sync. That is very very often triggered by simply adding/editing firewall rules or even aliases. Right ATM I'm sitting on site and looking at the error counter going up and up while the admin staff here adds some simple firewall aliases - not even rules.

As these errors are often because of gateway killing state things that's the first thing we checked. The setup has

Multi WAN (2)
SYNC
LAN/XFER network to a core switch
a few small VLANs

Nothing out of the ordinary. Multi-WAN has a GW Failover Group that is currently not even in use as the main traffic still goes via the old system we are replacing with this cluster. So besides about 15 OVPN tunnels and 2 remote access OVPN servers there's not much traffic on the system and the VPNs work great so far (on the primary node). Although since updating to 2.4.4 a few weeks ago and 2.4.4-p3 today those new sync errors keep piling up, that weren't existent before.

The only GWs that are "DOWN" on the standby node are the VPN servers (of course) and all GWs belonging to them (for routing purposes) are configured with "no down action" so to not trigger anything but show if the tunnel peer is up or down. There is also no advanced settings option "state killing on gateway switching" active, so I'm completely baffled as to why those errors started appearing and why the hell they keep piling up on the standby node.

As @jimp checked in https://forum.netgate.com/topic/136573/xmlrpc-sync-errors-since-upgrade-to-2-4-4/3 we already did:

Ping Sync interfaces on both sides - no problem, the see and can reach each other just fine
Test Port: no problem
no blocked entries in firewall rules belonging to the SYNC interface
no firewall logs about SYNC

Running packages are:

OVPN client export
sudo
system patches
mtr-nox11
cron

so nothing serious. BSNMPd is running on internal interface, too, unbound or dnsmasq are both off.

All VPN interfaces are bound to localhost BTW, as we port forward incoming connections on both WANs to them, so nothing would have to even touch the OVPN server interfaces at all as there is no need to down/up them on localhost. Even implemented @Derelict 's suggestion of explicit-exit-notifiy.

Still no visible response. For every reload/save of an alias, rule, route etc. the standby system triggers loads of logs about interfaces having changed, things to have to reload etc. etc. like this:

<date> php-fpm   /xmlrpc.php: Resyncing OpenVPN instances.    -> // why?
<date> php-fpm   OpenVPN terminate old pid: xxxxx
<date> kernel   ovpns1: link state changed to DOWN

etc. etc. then new WAN IP triggers of rc.newwanip for all the VPN GW addresses, restart of SNMPd, all OpenVPN instances, Dpingers (even though the gateways have that disabled) counting up to round about 350 lines of Logfiles for one small Alias change and sync.

Any chance there's something we could debug/make better? The standby system jumps up to a load of 3-4.0 with that, too, while the main node is smoothly running at 0.3 with no problems only because the sync just triggers wave after wave of unneccessary reloads and triggers as there's nothing obvious (for me) to reload. The standby still has not taken over the VPN tunnels thus no GWs needed, thus nothing changed GW-wise but still all is gonna reloaded, downed and upped that much, that the system seems to run out of steam (PHP/nginx wise) and is even unresponsive to webUI calls unless we can restart the webGUI/PHP processes via SSH.

I know the linked thread above already discussed 2.4.4 not having anything obvous in to trigger that behavior but we are seeing that problem appear on clusters (like this) that operated absolutely flawless before. So anything we could try/debug/add would be appreciated!

Greets,
Jens

Derelict

@JeGr said in HA Sync problems since updating to 2.4.4:

so nothing would have to even touch the OVPN server interfaces at all as there is no need to down/up them on localhost. Even implemented @Derelict 's suggestion of explicit-exit-notifiy.

explicit-exit-notify does not apply in that case because the OpenVPN instance does not exit on failover. That means the clients will have to time out and when they attempt to reconnect they'll get whatever system holds the CARP VIP at that time.

JeGr

@Derelict said in HA Sync problems since updating to 2.4.4:

explicit-exit-notify does not apply in that case because the OpenVPN instance does not exit on failover. That means the clients will have to time out and when they attempt to reconnect they'll get whatever system holds the CARP VIP at that time.

Ah thanks for adding that. So would only work on a VPN configured on a failover gateway group type of WAN?

Edit: Also we got the problem to appear less now by adding the "no monitoring at all" switch to all 15 VPN gateways. But:

that eliminates the possibility for a some grouping of VPN interfaces
that blocks monitoring and status/availability checks on those gateways
is irritating to the admin staff (as all VPNs are always "on" even if they won't work)

etc.
We can definetly track that down to something that changed since 2.4.4 as with 2.4.3 and ~6-7 VPNs active at the time (with their gateways configured active) that problem didn't happen at all.

So situation is:

with 2.4.4 - whysoever - VPN GWs assigned and configured are "tilting out" the standby node
even without gateway monitoring, the standby node spits out ~300 lines of syslog with every config save on the master as every single VPN interface is set down, then up, then reconfigured, then detected with a "new IP" (even if that didn't change at all) etc. etc.
all that starts after every simple config change

I understand the config sync to be non-trivial, but it strikes me odd, that changes to aliases or rules have to trigger that whole bunch of interface action just to reload and activate some simple ruleset or alias changes?

Thanks for any insight,
Jens