MLAG switch reboot freaks out LACP & CARP

jlacalamita

I have an HA FW pair connected via LACP LAG in fast mode to a pair of EdgeCore switches in an MCLAG/MLAG configuration.
If I pull FWA cable ix0 or ix1 LACP sees the link down, carp also notices and logs some info, but FWB does not become CARP master as the port channel does not go down. This is the behavior I expect.

If I cold boot switch ECA (simulating a HW or power failure) one port goes down right way and is logged by pfsense, ~15 sends later the other port stops distributing and the lagg goes down and FWB becomes the master for all the CARPs even though ix1 never goes down.

Lagg comes up without issue and is stable until ECA is rebooted. Also odd that I can reboot ECB and FWA is not impacted other than seeing ix0 transitions down/up.

I have another piece of gear (HA controller storage) on another portchannel which does not react and failover. Still trying to figure out if things recover before it initiates a failover. I'm suspecting pfsense has an issue but I'm not fluent at this level or sure of this. I did a tcpdump on ix1, dumped into wireshark but I can't make heads or tails of it.

FWA---------FWB
| \ / |
| (ix1) / |
| \ / |
(ix0) |
| / \ |
| / \ |
ECA----------ECB

Trying to determine if there is a switch issue or something to tune on the firewalls. Here is some debug output from when I cold boot ECA switch. Right away FWA sees something happening on the portchannel. ~15 secs later second interface reacts, lagg goes down and CARP master moves to FWB.

Latest release on NetGate/Supermicro.
Super Micro 1541
Version 23.05.1-RELEASE (amd64)
built on Wed Jun 28 03:57:27 UTC 2023
FreeBSD 14.0-CURRENT

Anyone know if its pfsense and if so how to fix or debug this?

I was really hoping to keep LACP for the extra BW as opposed to going with Active/Passive Faoliver mode.

Jul 25 12:55:25 FWA kernel: ix0: Interface stopped DISTRIBUTING, possible flapping
Jul 25 12:55:40 FWA kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
Jul 25 12:55:40 FWA kernel: lagg0: link state changed to DOWN
Jul 25 12:55:40 FWA kernel: carp: 24@lagg0.51: MASTER -> INIT (hardware interface down)
Jul 25 12:55:40 FWA kernel: carp: demoted by 240 to 240 (interface down)
Jul 25 12:55:40 FWA kernel: carp: 25@lagg0.51: MASTER -> INIT (hardware interface down)
Jul 25 12:55:40 FWA kernel: carp: demoted by 240 to 480 (interface down)
Jul 25 12:55:40 FWA kernel: lagg0.51: link state changed to DOWN
Jul 25 12:55:40 FWA kernel: carp: 71@lagg0.71: MASTER -> INIT (hardware interface down)
Jul 25 12:55:40 FWA kernel: carp: demoted by 240 to 720 (interface down)
Jul 25 12:55:40 FWA kernel: lagg0.71: link state changed to DOWN
Jul 25 12:55:40 FWA kernel: carp: 97@lagg0.11: MASTER -> INIT (hardware interface down)
Jul 25 12:55:40 FWA kernel: carp: demoted by 240 to 960 (interface down)
Jul 25 12:55:40 FWA kernel: lagg0.11: link state changed to DOWN
Jul 25 12:55:40 FWA kernel: carp: 68@igb4: MASTER -> BACKUP (more frequent advertisement received)
Jul 25 12:55:40 FWA kernel: carp: 208@igb5: MASTER -> BACKUP (more frequent advertisement received)
Jul 25 12:55:40 FWA check_reload_status[831]: Linkup starting lagg0
Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
Jul 25 12:55:40 FWA check_reload_status[831]: Linkup starting lagg0.51
Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
Jul 25 12:55:40 FWA check_reload_status[831]: Linkup starting lagg0.71
Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
Jul 25 12:55:40 FWA check_reload_status[831]: Linkup starting lagg0.11
Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
Jul 25 12:55:41 FWA php-fpm[84838]: /rc.linkup: Hotplug event detected for D1(opt1) static IP address (4: a.b.c.2)
Jul 25 12:55:41 FWA check_reload_status[831]: Reloading filter
Jul 25 12:55:41 FWA check_reload_status[831]: Reloading filter

pfsense555

@jlacalamita I am experiencing this same issue now on the latest pfSense version, did you ever get this resolved?

Firewall hardware:
Super Micro 1537

BIOS Vendor: American Megatrends Inc.
Version: 2.0c
Release Date: Thu Jun 27 2019

Version 24.03-RELEASE (amd64)
built on Mon May 13 8:17:00 EDT 2024
FreeBSD 15.0-CURRENT

Switch hardware:
Super Micro SSE-X3548S(R)
Firmware Version FW: 1.4.2.2

keyser

@jlacalamita I’m pretty certain this relates to how the remaining switch handles the loss of its MCLAG partner. Most likely it temporarily stops responding/sending the required LACP control frames for LACP Fast mode.

keyser

@jlacalamita Just at the top of my head LACP fast operates at a 5 sec. Interval, and three lost frames equals lost partner.
That’s 15 secs and fits the bill with your log entries.

So take a look at your switch firmwares / release notes and see if there is a fix. Otherwise slow mode might be the solution.

EDIT: looked it up. LACP fast is a control frame every 1 second, and timeout i three lost frames. But perhaps your switch uses a 5 sec interval for its LACP PDU frames?

pfsense555

@keyser LACP fast operates at 1 sec. intervals.

keyser

@pfsense555 Yeah, I probably edited that when you posted this.

Is the switch interval set to fast, and have you verified that also is 1 sec?
Because the interval setting at pfsense actually makes no difference here.
LACP timeout is three lost frames on the interval the other side is configured with and announcing. So if the switch i 5 sec. intervals it fits the bill.

Actually - it likely still fits the bill, because you also have to include the time it takes for the remaining switch to recognize the peer switch failed and time that out.
So that might be the first 3 - 10 seconds, and then the remaining switch tries to reconfigure its LACP announcements and fails. 3 secs later your pfsense determines it recieved no LACP frames for 3 sec, and downs the LAGG

keyser

@pfsense555 The easy way to find out is to do packetcapture on pfsense, and see what happens to LACP control frames when you remove power from one switch.