Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    MLAG switch reboot freaks out LACP & CARP

    Scheduled Pinned Locked Moved L2/Switching/VLANs
    7 Posts 3 Posters 516 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J
      jlacalamita
      last edited by

      I have an HA FW pair connected via LACP LAG in fast mode to a pair of EdgeCore switches in an MCLAG/MLAG configuration.
      If I pull FWA cable ix0 or ix1 LACP sees the link down, carp also notices and logs some info, but FWB does not become CARP master as the port channel does not go down. This is the behavior I expect.

      If I cold boot switch ECA (simulating a HW or power failure) one port goes down right way and is logged by pfsense, ~15 sends later the other port stops distributing and the lagg goes down and FWB becomes the master for all the CARPs even though ix1 never goes down.

      Lagg comes up without issue and is stable until ECA is rebooted. Also odd that I can reboot ECB and FWA is not impacted other than seeing ix0 transitions down/up.

      I have another piece of gear (HA controller storage) on another portchannel which does not react and failover. Still trying to figure out if things recover before it initiates a failover. I'm suspecting pfsense has an issue but I'm not fluent at this level or sure of this. I did a tcpdump on ix1, dumped into wireshark but I can't make heads or tails of it.

      FWA---------FWB
      | \ / |
      | (ix1) / |
      | \ / |
      (ix0) |
      | / \ |
      | / \ |
      ECA----------ECB

      Trying to determine if there is a switch issue or something to tune on the firewalls. Here is some debug output from when I cold boot ECA switch. Right away FWA sees something happening on the portchannel. ~15 secs later second interface reacts, lagg goes down and CARP master moves to FWB.

      Latest release on NetGate/Supermicro.
      Super Micro 1541
      Version 23.05.1-RELEASE (amd64)
      built on Wed Jun 28 03:57:27 UTC 2023
      FreeBSD 14.0-CURRENT

      Anyone know if its pfsense and if so how to fix or debug this?

      I was really hoping to keep LACP for the extra BW as opposed to going with Active/Passive Faoliver mode.

      Jul 25 12:55:25 FWA kernel: ix0: Interface stopped DISTRIBUTING, possible flapping
      Jul 25 12:55:40 FWA kernel: ix1: Interface stopped DISTRIBUTING, possible flapping
      Jul 25 12:55:40 FWA kernel: lagg0: link state changed to DOWN
      Jul 25 12:55:40 FWA kernel: carp: 24@lagg0.51: MASTER -> INIT (hardware interface down)
      Jul 25 12:55:40 FWA kernel: carp: demoted by 240 to 240 (interface down)
      Jul 25 12:55:40 FWA kernel: carp: 25@lagg0.51: MASTER -> INIT (hardware interface down)
      Jul 25 12:55:40 FWA kernel: carp: demoted by 240 to 480 (interface down)
      Jul 25 12:55:40 FWA kernel: lagg0.51: link state changed to DOWN
      Jul 25 12:55:40 FWA kernel: carp: 71@lagg0.71: MASTER -> INIT (hardware interface down)
      Jul 25 12:55:40 FWA kernel: carp: demoted by 240 to 720 (interface down)
      Jul 25 12:55:40 FWA kernel: lagg0.71: link state changed to DOWN
      Jul 25 12:55:40 FWA kernel: carp: 97@lagg0.11: MASTER -> INIT (hardware interface down)
      Jul 25 12:55:40 FWA kernel: carp: demoted by 240 to 960 (interface down)
      Jul 25 12:55:40 FWA kernel: lagg0.11: link state changed to DOWN
      Jul 25 12:55:40 FWA kernel: carp: 68@igb4: MASTER -> BACKUP (more frequent advertisement received)
      Jul 25 12:55:40 FWA kernel: carp: 208@igb5: MASTER -> BACKUP (more frequent advertisement received)
      Jul 25 12:55:40 FWA check_reload_status[831]: Linkup starting lagg0
      Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
      Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
      Jul 25 12:55:40 FWA check_reload_status[831]: Linkup starting lagg0.51
      Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
      Jul 25 12:55:40 FWA check_reload_status[831]: Linkup starting lagg0.71
      Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
      Jul 25 12:55:40 FWA check_reload_status[831]: Linkup starting lagg0.11
      Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
      Jul 25 12:55:40 FWA check_reload_status[831]: Carp backup event
      Jul 25 12:55:41 FWA php-fpm[84838]: /rc.linkup: Hotplug event detected for D1(opt1) static IP address (4: a.b.c.2)
      Jul 25 12:55:41 FWA check_reload_status[831]: Reloading filter
      Jul 25 12:55:41 FWA check_reload_status[831]: Reloading filter

      P keyserK 3 Replies Last reply Reply Quote 0
      • P
        pfsense555 @jlacalamita
        last edited by

        @jlacalamita I am experiencing this same issue now on the latest pfSense version, did you ever get this resolved?

        Firewall hardware:
        Super Micro 1537

        BIOS Vendor: American Megatrends Inc.
        Version: 2.0c
        Release Date: Thu Jun 27 2019

        Version 24.03-RELEASE (amd64)
        built on Mon May 13 8:17:00 EDT 2024
        FreeBSD 15.0-CURRENT

        Switch hardware:
        Super Micro SSE-X3548S(R)
        Firmware Version FW: 1.4.2.2

        1 Reply Last reply Reply Quote 0
        • keyserK
          keyser Rebel Alliance @jlacalamita
          last edited by

          @jlacalamita I’m pretty certain this relates to how the remaining switch handles the loss of its MCLAG partner. Most likely it temporarily stops responding/sending the required LACP control frames for LACP Fast mode.

          Love the no fuss of using the official appliances :-)

          1 Reply Last reply Reply Quote 0
          • keyserK
            keyser Rebel Alliance @jlacalamita
            last edited by keyser

            @jlacalamita Just at the top of my head LACP fast operates at a 5 sec. Interval, and three lost frames equals lost partner.
            That’s 15 secs and fits the bill with your log entries.

            So take a look at your switch firmwares / release notes and see if there is a fix. Otherwise slow mode might be the solution.

            EDIT: looked it up. LACP fast is a control frame every 1 second, and timeout i three lost frames. But perhaps your switch uses a 5 sec interval for its LACP PDU frames?

            Love the no fuss of using the official appliances :-)

            P 1 Reply Last reply Reply Quote 0
            • P
              pfsense555 @keyser
              last edited by

              @keyser LACP fast operates at 1 sec. intervals.

              keyserK 2 Replies Last reply Reply Quote 0
              • keyserK
                keyser Rebel Alliance @pfsense555
                last edited by

                @pfsense555 Yeah, I probably edited that when you posted this.

                Is the switch interval set to fast, and have you verified that also is 1 sec?
                Because the interval setting at pfsense actually makes no difference here.
                LACP timeout is three lost frames on the interval the other side is configured with and announcing. So if the switch i 5 sec. intervals it fits the bill.

                Actually - it likely still fits the bill, because you also have to include the time it takes for the remaining switch to recognize the peer switch failed and time that out.
                So that might be the first 3 - 10 seconds, and then the remaining switch tries to reconfigure its LACP announcements and fails. 3 secs later your pfsense determines it recieved no LACP frames for 3 sec, and downs the LAGG

                Love the no fuss of using the official appliances :-)

                1 Reply Last reply Reply Quote 0
                • keyserK
                  keyser Rebel Alliance @pfsense555
                  last edited by

                  @pfsense555 The easy way to find out is to do packetcapture on pfsense, and see what happens to LACP control frames when you remove power from one switch.

                  Love the no fuss of using the official appliances :-)

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.