Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    SG-5100 WAN failover at gigabit saturation

    Scheduled Pinned Locked Moved Official Netgate® Hardware
    14 Posts 3 Posters 1.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A
      ashlm
      last edited by

      I've an SG-5100 configured with gigabit down WAN via igb0 as Tier 1 and an 80Mb PPPoE on igb1 as Tier 2. When the igb0 connection is set as the gateway the connection is stable through to saturation (120MB/s down) for prolongued periods (>5 mins) with RTTsd remaining below 400ms. When I set the gateway to the failover gasteway group and saturate the connection the RTTsd ramps up to over 1000ms over approximately 30s, then the Tier 1 interface drops due to latency and failover occurs.

      Any ideas on how to resolve?

      S 1 Reply Last reply Reply Quote 0
      • S
        SteveITS Galactic Empire @ashlm
        last edited by

        @ashlm We had a similar issue once with a client.
        I had a thread about it here a couple years ago give or take, which basically consisted of, "wow 1000ms is bad you should have the ISP fix that." Of course it was transient. Unfortunately at the time the fail-back didn't work (we had to set up a cron job), and the second ISP had problems so it was problematic.

        You can adjust the thresholds in System/Routing/Gateways/(edit gateway) under Advanced. Or if you can find the source of the traffic create a limiter, or enable traffic shaping to deprioritize it.

        Also in the Gateway Group there is a setting "Trigger Level." Without digging up the thread, as I recall we had some trouble tuning that and the latency settings to work as expected per the docs, so you may have to experiment a bit.

        Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
        When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
        Upvote 👍 helpful posts!

        A 1 Reply Last reply Reply Quote 1
        • A
          ashlm @SteveITS
          last edited by

          @steveits Thanks for your reply. Just seems odd that the gateway stays live and has no latency issues when it is the only gateway but once failover is introduced it starts misbehaving.

          Will look into shaping / limiting, but its definitely a band aid and not a solution.

          S 1 Reply Last reply Reply Quote 0
          • S
            SteveITS Galactic Empire @ashlm
            last edited by

            @ashlm The latency triggers the failover. Changing the latency threshold to say 1500ms would not trigger the failover. Or changing the "Time Period" on the gateway which makes it average over a longer time. That's of course not ideal if it is always that slow, but that's what we found to avoid the 30-second-busy failovers.

            And yes limiter/shaping is in some ways a band aid but avoids the latency. IOW it's not really a pfSense problem, the problem is the device is flooding the connection, so pfSense is doing what it's been told and failing over when latency spikes.

            In our case it was a client and we aren't on site so it took a long time to catch it while it was happening and track it down to a Mac, by MAC address. We think it was doing a backup or maybe a long video upload, never quite figured that out as we didn't get a great answer from the person. (which is why I think it was a backup)

            Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
            When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
            Upvote 👍 helpful posts!

            A 1 Reply Last reply Reply Quote 0
            • A
              ashlm @SteveITS
              last edited by

              @steveits Thanks again for the reply, it's very helpful.

              @steveits said in SG-5100 WAN failover at gigabit saturation:

              The latency triggers the failover.

              Yes, but latency on the gigabit interface reaches the failover threshold (>1s) only when failover is enabled. RTTsd remains below 400ms, well below the failover threshold, when the gigabit interface is set as the solitary gateway, and the gigabit interface remains up for the entire test.

              RTTsd only exceeds 1000ms when failover is enabled.

              S 1 Reply Last reply Reply Quote 0
              • S
                SteveITS Galactic Empire @ashlm
                last edited by

                @ashlm Oh, I get what you're saying now! I hadn't noticed that but wasn't looking for it. That would explain why we only saw it at that client. We thought it was the Mac because that's the only device we ever saw "cause" the problem, on several occasions.

                Since it sounds like you can reproduce it I suggest opening a case at redmine.pfsense.org and link to this thread.

                Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
                Upvote 👍 helpful posts!

                A 1 Reply Last reply Reply Quote 0
                • A
                  ashlm @SteveITS
                  last edited by ashlm

                  @steveits Thanks, have done so. Enabling gateway failover introduces latency increase and causes artificial failover scenario

                  S 1 Reply Last reply Reply Quote 0
                  • S
                    SteveITS Galactic Empire @ashlm
                    last edited by

                    @ashlm Is that the right URL? It talks about traffic shaping, and is from 8 days ago. :)

                    Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                    When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
                    Upvote 👍 helpful posts!

                    1 Reply Last reply Reply Quote 1
                    • stephenw10S
                      stephenw10 Netgate Administrator
                      last edited by

                      You're testing that in 21.02?

                      Can you upgrade to 22.01 and see if it's still happening?

                      Steve

                      A 1 Reply Last reply Reply Quote 0
                      • A
                        ashlm @stephenw10
                        last edited by

                        @stephenw10 Apologies, that's a mistake. "22.01-RELEASE (amd64)
                        built on Mon Feb 07 16:37:59 UTC 2022."

                        1 Reply Last reply Reply Quote 0
                        • stephenw10S
                          stephenw10 Netgate Administrator
                          last edited by

                          Ah, Ok. Do you know if this is new behaviour in 22.01?

                          A 1 Reply Last reply Reply Quote 0
                          • A
                            ashlm @stephenw10
                            last edited by ashlm

                            @stephenw10 The same failover scenario manifested in 21.02 on the SG-3100, though that device couldn't achieve gigabit down on the WAN interface and was replaced with the SG-5100 without further testing with a solitary gateway. I updated th SG-5100 to the latest release before deployment, so can't say for certain if it would happen on 21.02 on the SG-5100.

                            A 1 Reply Last reply Reply Quote 1
                            • A
                              ashlm @ashlm
                              last edited by

                              @ashlm The issue issue is resolved, or rather is not an issue / not an accurate description. The same latency increase to >1s was recorded while testing the solitary gateway config this morning, therefore is no longer confined / attributable to enabling failover.

                              1 Reply Last reply Reply Quote 1
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                Ah, Ok thanks for the update. I couldn't replicate it here.

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post
                                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.