Flapping backup/master/backup when some change is saved in MASTER.
-
I'm having some fast flapping between BACKUP/MASTER/BACKUP when some configuration is changed in MASTER.
Xeon(R) 4316 CPU @ 2.30GHz
64 GB RAM
bge0 1 Gbps Uplink to router (4 CARPs)
mce0 100 Gbps Mellanox with 42 VLANs, all has 1 CARP.
2.6.0-RELEASE (amd64)
built on Mon Jan 31 19:57:53 UTC 2022
FreeBSD 12.3-STABLEI'd captured traffic in MASTER and SLAVE, at bge0 and vlan (mce0.2032). For some reason, MASTER doesn't send advertisements for 5 seconds at bge0 and for 4 seconds at mce0.2032 (possibly at others 41 vlans). As advbase is 1, BACKUP resume to MASTER for 1 second and then back to BACKUP again.
Why MASTER is not sending advertisements for some seconds?
I'd looking for some reason and found this bug (https://reviews.freebsd.org/D18882) but I think it was already patched into pfSense (https://github.com/pfsense/FreeBSD-src/commits/devel-main/sys/netpfil/pf/if_pfsync.c).
"pfctl -sr | wc -l" has 1800 lines (rules). This firewall runs squid, dns forwarder, dhcp relay. MBUF usage in 12%, memory usage 5%, CPU usage 2%, load average 1.5.
Isn't the hardware enough to process all the things, make de sync etc? Or there is something in the process that is making MASTER stop sending advertisements for some seconds?
Bellow all dumps and comments.
MASTER:
- At 16:36:00 disable a simple rule in interface mce0.2032.
- System log shows just:
Jan 11 16:36:00 check_reload_status 444 Reloading filter.
bge0:
16:36:02.516572 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=203 advbase=1 advskew=0 authlen=7 counter=1140012247149333877 16:36:02.516577 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=201 advbase=1 advskew=0 authlen=7 counter=4824419756534567893 16:36:02.516580 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=204 advbase=1 advskew=0 authlen=7 counter=214875457368310347 16:36:02.516583 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=224 advbase=1 advskew=0 authlen=7 counter=3915711932258403305 16:36:07.317758 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=203 advbase=1 advskew=0 authlen=7 counter=1140012247149333877 16:36:07.317761 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=201 advbase=1 advskew=0 authlen=7 counter=4824419756534567893 16:36:07.317764 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=204 advbase=1 advskew=0 authlen=7 counter=214875457368310347 16:36:07.317767 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=224 advbase=1 advskew=0 authlen=7 counter=3915711932258403305
4 CARPs (vhid 203, 201, 204 and 224). Advertisements stop at second :02 and return at second :07.
mce0.2032:
16:36:01.516511 IP 10.0.0.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888 16:36:02.517589 IP 10.0.0.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888 16:36:06.316579 IP 10.0.0.253 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=100 authlen=7 counter=5756808289631439201 16:36:06.316889 IP 10.0.0.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888 16:36:07.317648 IP 10.0.0.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888 16:36:08.318762 IP 10.0.0.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888
Advertisements stop at second :02 and return at second :06. At second 16:36:06.316579 BACKUP sent one advertisement (IP .253 and advskew=100).
BACKUP:
System log shows:
Jan 11 16:36:05 check_reload_status 444 Carp master event Jan 11 16:36:05 kernel carp: [vhids]@[interfaces]: BACKUP -> MASTER (master timed out) (FOR EACH INTERFACE) Jan 11 16:36:05 check_reload_status 444 Carp master event (LOT OF MESSAGES) Jan 11 16:36:06 check_reload_status 444 Carp master event Jan 11 16:36:06 kernel carp: [vhids]@[interfaces]: MASTER -> BACKUP (more frequent advertisement received) (FOR EACH INTERFACE) Jan 11 16:36:06 check_reload_status 444 Carp backup event (LOT OF MESSAGES) Jan 11 16:36:07 php 23070 notify_monitor.php: Message sent to [email@address] OK
E-mail with:
HA cluster member [each interface and address] has resumed CARP state "MASTER" for [each vhid] HA cluster member [each interface and address] has resumed CARP state "BACKUP" for [each vhid]
Just flap BACKUP -> MASTER -> BACKUP at 16:36:06.
bge0:
16:36:02.516819 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=203 advbase=1 advskew=0 authlen=7 counter=1140012247149333877 16:36:02.516824 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=201 advbase=1 advskew=0 authlen=7 counter=4824419756534567893 16:36:02.516827 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=204 advbase=1 advskew=0 authlen=7 counter=214875457368310347 16:36:02.516830 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=224 advbase=1 advskew=0 authlen=7 counter=3915711932258403305 16:36:05.936560 IP ___.242 > 224.0.0.18: CARPv2-advertise 36: vhid=224 advbase=1 advskew=100 authlen=7 counter=3160232997241480219 16:36:05.936602 IP ___.242 > 224.0.0.18: CARPv2-advertise 36: vhid=204 advbase=1 advskew=100 authlen=7 counter=4860494927553491288 16:36:05.937644 IP ___.242 > 224.0.0.18: CARPv2-advertise 36: vhid=201 advbase=1 advskew=100 authlen=7 counter=1454516894218236533 16:36:05.938688 IP ___.242 > 224.0.0.18: CARPv2-advertise 36: vhid=203 advbase=1 advskew=100 authlen=7 counter=1212293663596441848 16:36:07.318178 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=203 advbase=1 advskew=0 authlen=7 counter=1140012247149333877 16:36:07.318216 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=201 advbase=1 advskew=0 authlen=7 counter=4824419756534567893 16:36:07.318229 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=204 advbase=1 advskew=0 authlen=7 counter=214875457368310347 16:36:07.318242 IP ___.241 > 224.0.0.18: CARPv2-advertise 36: vhid=224 advbase=1 advskew=0 authlen=7 counter=3915711932258403305
4 CARPs (vhid 203, 201, 204 and 224). Advertisements stop at second :02 and return at second :07. At second :05 BACKUP sent advertisements (IPs .242 and advskew=100).
mce0.2032:
16:36:01.516730 IP 10.___.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888 16:36:02.517809 IP 10.___.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888 16:36:05.951322 IP 10.___.253 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=100 authlen=7 counter=5756808289631439201 16:36:06.317115 IP 10.___.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888 16:36:07.317878 IP 10.___.252 > 224.0.0.18: CARPv2-advertise 36: vhid=32 advbase=1 advskew=0 authlen=7 counter=9385040083016279888
Advertisements stop at second :02 and return at second :06. At second 16:36:05.951322 BACKUP sent one advertisement (IP .253 and advskew=100).italicised text
-
I figured out that CARP uses the formula 3 * (advbase + (advskew / 256)) to detect that MASTER got down. This information is in OpenBSD ifconfig man.
Taken together, the advbase and advskew indicate how frequently, in seconds, the host will advertise the fact that it considers itself master of the virtual host. The formula is advbase + (advskew / 256). If the master does not advertise within three times this interval, this host will begin advertising as master.
So, from my dumps, as the interval that MASTER didn't advertise is about 5 seconds, I'd adjusted advbase to 2. Then, BACKUP will become MASTER only after 6 seconds without see advertisements from MASTER.
Although the firewall has a good hardware, it delays some seconds to apply new configurations. I think 6s for failover is fine here. It seems that advbase = 1 is not always possible.
Thanks!
:) -
Dear @correajl
Thank you for sharing. We got similar issue after ricent update from 2.7.0 to 2.7.2, and you workaround seem to work.How did you manage to keep advbase setting on the backup permanent?
For me it gets constantly override by the configuration sync.Best regards Alex
-
Hi Alex!
First, in synchronization (HA) options you need to check Vritual IPs.
You can see at Configuring CARP docs .
Then you need to configure the CARP address only in the primary node and it will be automatically copied to the secondary.
In this docs we can see in "Advertising Frequency" -> "Skew" the instruction "A primary node is typically set to 0 or 1, secondary nodes will be 100 or higher. This adjustment is handled automatically by XMLRPC synchronization".
So, I think you need only to configure the Virtual IPs to be synchronized in HA settings then configure the virtual IPs.
PS.: related to this thread, we can see a note in the docs: If CARP appears to be too sensitive to latency on a given network, adjusting the Base by adding one second at a time is recommended until stability is achieved.
-
@correajl thank you for the reply.
I thought that you found a way to set different advbase values on both nodes.Anyway I found my issues, and it was not the same as yours - as I am not very familiar with netgear switches I missed that storm-control was enabled for multicast.
The storm-control became the root cause for the issue.