CARP switch Master/Backup every 15 minutes
-
Hi
We have the same problem described here :
Topic 172700(PFSense 2.6.0)
We recieved these mails every 15 minutes from the slave pfsense (no mail from master)21:45:12 HA cluster member "(X.X.13.254@lagg0.1113): (V1113)" has resumed CARP state "MASTER" for vhid 16 21:45:12 HA cluster member "(X.X.11.254@lagg0.1411): (V1411)" has resumed CARP state "BACKUP" for vhid 64 21:45:12 HA cluster member "(X.X.166.254@lagg0.1266): (V1266)" has resumed CARP state "MASTER" for vhid 35 21:45:12 HA cluster member "(X.X.50.254@lagg0.1429): (V1429)" has resumed CARP state "MASTER" for vhid 34 21:45:12 HA cluster member "(X.X.18.254@lagg0.1118): (V1118)" has resumed CARP state "BACKUP" for vhid 21 21:45:12 HA cluster member "(X.X.18.254@lagg0.1118): (V1118)" has resumed CARP state "MASTER" for vhid 21 21:45:12 HA cluster member "(X.X.39.254@lagg0.1139): (V1139)" has resumed CARP state "MASTER" for vhid 68 21:45:12 HA cluster member "(X.X.22.254@lagg0.1122): (V1122)" has resumed CARP state "MASTER" for vhid 25 21:45:12 HA cluster member "(X.X.16.254@lagg0.1116): (V1116)" has resumed CARP state "MASTER" for vhid 19 21:45:12 HA cluster member "(X.X.2.254@lagg0.1402): (V1402)" has resumed CARP state "MASTER" for vhid 31 21:45:12 HA cluster member "(X.X.20.254@lagg0.1120): (V1120)" has resumed CARP state "MASTER" for vhid 23 21:45:12 HA cluster member "(X.X.1.254@lagg0.1401): (V1401)" has resumed CARP state "MASTER" for vhid 30 21:45:12 HA cluster member "(X.X.10.254@lagg0.1410): (V1410)" has resumed CARP state "MASTER" for vhid 61 21:45:12 HA cluster member "(X.X.1.254@lagg0.1101): (V1101)" has resumed CARP state "MASTER" for vhid 4 21:45:12 HA cluster member "(X.X.7.254@lagg0.1407): (V1407)" has resumed CARP state "MASTER" for vhid 42 21:45:12 HA cluster member "(X.X.166.254@lagg0.1266): (V1266)" has resumed CARP state "BACKUP" for vhid 35 21:45:12 HA cluster member "(X.X.20.254@lagg0.1120): (V1120)" has resumed CARP state "BACKUP" for vhid 23 21:45:12 HA cluster member "(X.X.2.254@lagg0.1402): (V1402)" has resumed CARP state "BACKUP" for vhid 31 21:45:12 HA cluster member "(X.X.10.254@lagg0.1410): (V1410)" has resumed CARP state "BACKUP" for vhid 61 21:45:12 HA cluster member "(X.X.22.254@lagg0.1122): (V1122)" has resumed CARP state "BACKUP" for vhid 25 21:45:12 HA cluster member "(X.X.1.254@lagg0.1401): (V1401)" has resumed CARP state "BACKUP" for vhid 30 21:45:12 HA cluster member "(X.X.1.254@lagg0.1101): (V1101)" has resumed CARP state "BACKUP" for vhid 4 21:45:12 HA cluster member "(X.X.7.254@lagg0.1407): (V1407)" has resumed CARP state "BACKUP" for vhid 42 21:45:12 HA cluster member "(X.X.50.254@lagg0.1429): (V1429)" has resumed CARP state "BACKUP" for vhid 34 21:45:12 HA cluster member "(X.X.39.254@lagg0.1139): (V1139)" has resumed CARP state "BACKUP" for vhid 68 21:45:12 HA cluster member "(X.X.13.254@lagg0.1113): (V1113)" has resumed CARP state "BACKUP" for vhid 16 21:45:12 HA cluster member "(X.X.16.254@lagg0.1116): (V1116)" has resumed CARP state "BACKUP" for vhid 19
We found nothing in the slave logs at 21:45
On the master at this time we found onlyphp-fpm.log:Nov 24 21:45:12 X.X.X.X php-fpm[31372]: /rc.carpmaster: HA cluster member "(X.X.22.254@lagg0.1122): (V1122)" has resumed CARP state "MASTER" for vhid 25
And we lost something like 4 or 5 pings (enough to freeze a video session)
The lagg0 is LACP on a HPE 5700. We have tried on a HP A5500, same result.
We tried to change the hash of the lacp from default to destination-ip source-ip.We have upgraded BIOS of our DELL R440 and network firmware.
We have 122 CARP interfaces and 52 IP aliases on one of them.
CARP interfaces are gateways for VLANs and NAT address (outbound). IP aliases are public adress for servers.We are going to reinstall slave first then the master.
But perhaps someone have an idea to help us before ;) ? -
So, we have reinstalled the BACKUP.
Then we have shutdowned the MASTER.The BACKUP became MASTER with no problem.
But the issue is still present. Every 15 minutes we lost network during 4 ou 5 ping and then the network comes back.
So I don't think it's a HA issue.
In pfsense log, we see nothing.
We are lost.
-
@marcolefo So in your initial description they are both master? Sounds like a connectivity loss between them every 15 minutes. Switch problem?
-
@steveits no the backup become master (as the log says) but on the GUI it's still BACKUP.
And note that the network happens too when the MASTER has been shutdowned.
I don't know if there is a link there is a cron every 15 minutes that launch /etc/rc.filter_configure_sync/rc.filter_configure_sync
-
We have powered on MASTER.
Since BACKUP reinstall we have notications XMLRPC Error with Operations timed outA communications error occurred while attempting to call XMLRPC method host_firmware_version: Unable to connect to tls://X.X.X.252:443·. Error: Operation timed out @ 2022-12-01 17:33:24 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:33:26 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:34:10 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:34:53 A communications error occurred while attempting to call XMLRPC method host_firmware_version: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:35:04 A communications error occurred while attempting to call XMLRPC method host_firmware_version: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:35:05 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:35:08 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:35:37 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:35:51 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:36:21 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:36:34 A communications error occurred while attempting to call XMLRPC method merge_installedpackages_section: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:37:04 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:37:18 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:37:48 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:38:01 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:38:31 A communications error occurred while attempting to call XMLRPC method merge_installedpackages_section: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:38:45 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:39:28 A communications error occurred while attempting to call XMLRPC method exec_php: Unable to connect to tls://X.X.X.252:443. Error: Operation timed out @ 2022-12-01 17:40:12
-
@marcolefo said in CARP switch Master/Backup every 15 minutes:
no the backup become master (as the log says)
But then there should be a log on the primary that it became backup? If you're saying it successfully/correctly moves?
but on the GUI it's still BACKUP
The log entries you posted are all within the same second so it would be tough to catch in the GUI.
/etc/rc.filter_configure_sync/rc.filter_configure_sync
Not /etc/rc.filter_configure_sync? That shorter path is a normal file and is from time based rules: https://forum.netgate.com/topic/137911/cron-job-etc-rc-filter_configure_sync. If that causes a connectivity break that could cause the flapping.
Do you have a lot of rules? The System Patches package has a patch for https://redmine.pfsense.org/issues/12827.
-
@steveits we have changed the switch with no change.
Perhaps LACP problem ?
-
@steveits said in CARP switch Master/Backup every 15 minutes:
@marcolefo said in CARP switch Master/Backup every 15 minutes:
no the backup become master (as the log says)
But then there should be a log on the primary that it became backup? If you're saying it successfully/correctly moves?
but on the GUI it's still BACKUP
The log entries you posted are all within the same second so it would be tough to catch in the GUI.
/etc/rc.filter_configure_sync/rc.filter_configure_sync
Not /etc/rc.filter_configure_sync? That shorter path is a normal file and is from time based rules: https://forum.netgate.com/topic/137911/cron-job-etc-rc-filter_configure_sync. If that causes a connectivity break that could cause the flapping.
Yes it's a bug of my fingers ;). The cron is exactly :
0,15,30,45 * * * * root /etc/rc.filter_configure_sync
Do you have a lot of rules? The System Patches package has a patch for https://redmine.pfsense.org/issues/12827.
Yes a lot of rules we have. I will take a look at this link now
-
@steveits said in CARP switch Master/Backup every 15 minutes:
Do you have a lot of rules? The System Patches package has a patch for https://redmine.pfsense.org/issues/12827.
I am quite a noob... I don't find the way to install patch...
-
@marcolefo ok RTFM : https://docs.netgate.com/pfsense/en/latest/development/system-patches.html
Need to sleep sorry for the flood.
-
@steveits thanks a lot !
The patch is working. No more ping lost no more zoom freeze, happiness.
I have another problem: no more sync between MASTER and BACKUP. I will make another thread ;).
Thanks again.
-
Ok everything is ok now.
The sync problem was a bad rule on pfsync interface.
Thanks again for your help and have a nice week end