SG-5100 WAN failover at gigabit saturation
-
I've an SG-5100 configured with gigabit down WAN via igb0 as Tier 1 and an 80Mb PPPoE on igb1 as Tier 2. When the igb0 connection is set as the gateway the connection is stable through to saturation (120MB/s down) for prolongued periods (>5 mins) with RTTsd remaining below 400ms. When I set the gateway to the failover gasteway group and saturate the connection the RTTsd ramps up to over 1000ms over approximately 30s, then the Tier 1 interface drops due to latency and failover occurs.
Any ideas on how to resolve?
-
@ashlm We had a similar issue once with a client.
I had a thread about it here a couple years ago give or take, which basically consisted of, "wow 1000ms is bad you should have the ISP fix that." Of course it was transient. Unfortunately at the time the fail-back didn't work (we had to set up a cron job), and the second ISP had problems so it was problematic.You can adjust the thresholds in System/Routing/Gateways/(edit gateway) under Advanced. Or if you can find the source of the traffic create a limiter, or enable traffic shaping to deprioritize it.
Also in the Gateway Group there is a setting "Trigger Level." Without digging up the thread, as I recall we had some trouble tuning that and the latency settings to work as expected per the docs, so you may have to experiment a bit.
-
@steveits Thanks for your reply. Just seems odd that the gateway stays live and has no latency issues when it is the only gateway but once failover is introduced it starts misbehaving.
Will look into shaping / limiting, but its definitely a band aid and not a solution.
-
@ashlm The latency triggers the failover. Changing the latency threshold to say 1500ms would not trigger the failover. Or changing the "Time Period" on the gateway which makes it average over a longer time. That's of course not ideal if it is always that slow, but that's what we found to avoid the 30-second-busy failovers.
And yes limiter/shaping is in some ways a band aid but avoids the latency. IOW it's not really a pfSense problem, the problem is the device is flooding the connection, so pfSense is doing what it's been told and failing over when latency spikes.
In our case it was a client and we aren't on site so it took a long time to catch it while it was happening and track it down to a Mac, by MAC address. We think it was doing a backup or maybe a long video upload, never quite figured that out as we didn't get a great answer from the person. (which is why I think it was a backup)
-
@steveits Thanks again for the reply, it's very helpful.
@steveits said in SG-5100 WAN failover at gigabit saturation:
The latency triggers the failover.
Yes, but latency on the gigabit interface reaches the failover threshold (>1s) only when failover is enabled. RTTsd remains below 400ms, well below the failover threshold, when the gigabit interface is set as the solitary gateway, and the gigabit interface remains up for the entire test.
RTTsd only exceeds 1000ms when failover is enabled.
-
@ashlm Oh, I get what you're saying now! I hadn't noticed that but wasn't looking for it. That would explain why we only saw it at that client. We thought it was the Mac because that's the only device we ever saw "cause" the problem, on several occasions.
Since it sounds like you can reproduce it I suggest opening a case at redmine.pfsense.org and link to this thread.
-
-
@ashlm Is that the right URL? It talks about traffic shaping, and is from 8 days ago. :)
-
You're testing that in 21.02?
Can you upgrade to 22.01 and see if it's still happening?
Steve
-
@stephenw10 Apologies, that's a mistake. "22.01-RELEASE (amd64)
built on Mon Feb 07 16:37:59 UTC 2022." -
Ah, Ok. Do you know if this is new behaviour in 22.01?
-
@stephenw10 The same failover scenario manifested in 21.02 on the SG-3100, though that device couldn't achieve gigabit down on the WAN interface and was replaced with the SG-5100 without further testing with a solitary gateway. I updated th SG-5100 to the latest release before deployment, so can't say for certain if it would happen on 21.02 on the SG-5100.
-
@ashlm The issue issue is resolved, or rather is not an issue / not an accurate description. The same latency increase to >1s was recorded while testing the solitary gateway config this morning, therefore is no longer confined / attributable to enabling failover.
-
Ah, Ok thanks for the update. I couldn't replicate it here.