Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN
-
I tried disabling the DSL interface (WAN01 - Tier 2) and wouldn't you know, Starlink interface (WAN02 - Tier 1) starts to work without issue. Re-enable DSL interface (WAN01 - Tier 2) and within an hour I am seeing the same issue with packet loss shooting up to 100% on the Starlink connection
The DSL connection has a static IP address, but for years now I have just left the interface IPv4 Configuration Type as "DHCP" without issue. As a quick test I switched it over to "Static IPv4" along with it's assigned IP address. Hours now with both the DSL and Starlink interfaces active with no issues. Everything is running like it was a couple weeks back. Will continue to monitor for the rest of the day.
For now, while I monitor I need to sit here and think about why this appears to be the solution for me, and why it is only a recent problem.
@jimeez or @preston do either of you have a static IP for your respective DSL connections?
-
@knoppolis said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
@jimeez or @preston do either of you have a static IP for your respective DSL connections?
My CenturyLink DSL connection is not static. This is a good data point though, thanks. Let us know how it does.
-
@knoppolis said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
@jimeez or @preston do either of you have a static IP for your respective DSL connections?
Also a no here.
I am very curious to see if this holds up for you. Although, if it does, my and @preston's issue will be an even bigger mystery.
-
So @preston mentioned something to me in a private chat that got my wheels turning. He brought up the fact that, prior to this issue, his StarLink connection would drop out around 4AM most days then come right back up. Mine did this too. Like clockwork. I always thought the reason was that the StarLink unit was receiving an update and restarting or something. But now I'm wondering if that 24 hour cycle is somehow related to this problem. Only now instead of every 24 hours it's happening every 15 minutes.
I went back and checked my notification logs. This 24 hour drop out was very consistent. Then on August 24th the 15 minute dropout started happening.
-
So I lobbed a support ticket to StarLink. Referenced this thread. Their response as follows:
Would you be able to confirm how you currently have your health checks set up for a failover to occur? The typical recommendation we provide our enterprise customers is to relax heath checks (i.e. pings, etc.) to deal with occasional connection drops from Starlink. Checking every 10 seconds & getting 5 fails in a row would be a good threshold to start with.
Would anyone be able to tell me where to go look in pfsense to find the answer to their question?
-
Assuming they are talking about System>Routing>Gateways and making edits to the details for your Starlink gateway.
Here is what I currently have for my Starlink gateway:
I have been playing around with the "Packet Loss Thresholds" to keep the failover from happening with a low of 30 and a high of 60. I also played around with the other intervals but it really made no difference.
I used this reddit post as a starting point/reference for these adjustments.
https://www.reddit.com/r/PFSENSE/comments/1eg0wpk/starlink_monitoring_in_pfsense/ -
Update on the status/health of my set up. Everything has been running fine for the last 9 hours.
Here is what the packet loss situation looked like for the last 48 hours. No issues after setting a static IP for my DSL connection.
-
@knoppolis said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Update on the status/health of my set up. Everything has been running fine for the last 9 hours.
Here is what the packet loss situation looked like for the last 48 hours. No issues after setting a static IP for my DSL connection.
Interesting. So how does one set a static IP for their DSL connection? Doesn't the provider set that?
-
@jimeez my DSL service came with a static IP address, in pfsense go to the interface in question. Need to change the IPv4 Configuration Type to Static IPv4, then below you will have the ability to set the IPv4 Address to said static address.
-
@knoppolis said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
@jimeez my DSL service came with a static IP address...
I guess that's what I'm asking. You can't just go and assign a static IP to an interface if it's not set up that way with your service provider, can you?
-
@jimeez ahh, sorry. So you can't just give yourself whatever IP address you want, but you could try entering whatever IP you have currently been granted from your provider to test if you get the same results? Honestly not sure if that would work, the big thing would be that as soon as the assigned IP changed your DSL connection would go down until you went back to the DHCP setting for the interface in pfsense.
-
Surprisingly enough, I got a (for now) positive response from StarLink. They are telling me they are going to look into this. Their 1st level support staff asked me some questions which I answered. I then got a reply thanking me for the input and saying that they would dig into it. I was NOT expecting that response. Will see what happens.
-
That's great news. I hope they get back to you with something.
When I contacted them (in the beginning of all of this) they thought it might be my original gen 1 circular dish causing the problems. They sent me a new gen 3 dish and router....but same results.
-
Turns out their response was one that was already bounced around here and on Reddit.
While I do not have any exact guidance for how to configure this specific router. The probe interval does seem very strict as it is set to check every 500 milliseconds or 0.5seconds compared to the general recommendation of checking every 10 seconds or 10000 milliseconds. The frequent failovers may be improved if you attempt relaxing these health checks to deal with the occasional drops in service due to utilizing a satellite internet service.
I suspected this would not work but did it anyway so I could report back to them with factual info. And, unfortunately it did not fix it. Every 15 minutes, like clockwork, to the second, the StarLink interface fails due to high packet loss and eventually is perceived to be offline...even though it is not. After a bit comes back up. Then fails again exactly 15 minutes later. Turn off the second interface and everything works fine. Weirdest, most frustrating thing.
Couple more questions for you regarding your config. There has to be something here that will eventually lead to an answer.
- Do you use pfBlocker?
- You already confirmed that you don't use NUT, but have you noticed any other services that fail when you activate the DSL interface like NUT does for me?
*Assuming you have some port forwarding configured what do you use for the Dest. Address? Individual interfaces? Any? Perhaps something else?
The NUT service failing really has me scratching my head and I believe must be a clue to what's going on. Why would that service fail immediately upon activation of the second (DSL) interface. It never used to. Only after August 22nd....
-
I also tried changing the probe interval. No help.
-
I am not running pfBlocker.
-
The services that fail (most of the time, but not allways) when the Starlink goes offline are EITHER the kea-dhcp4 or the kea-dhcp6 server. Which is what was taking me down the dhcp rabbit hole.
-
The only 'extra' package that I have running is Tailscale.
-
I applied the recommended pfSense 24.03 system patches (through the Netgate System Patch package) last week with no help.
-
-
Do you use kea-dhcp, and if so does it fail for you?
-
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Do you use kea-dhcp, and if so does it fail for you?
I do not use it. Had this disabled previously and didn't even know it until someone suggested this as a fix.
Last night I got my hands on a 4-port EdgeRouter. Did some reading last night and think I have enough knowledge to test this thing out in a dual-WAN scenario. Hope to get to it this weekend and see what happens. This should give us some insight into where the problem lies: CenturyLink, StarLink, or pfSense.
-
I'm leaning toward it being a pfSense issue. Starlink works fine by itself, CenturyLink works fine by itself. Some setting has to be wrong, or some bug has cropped up.
Side note, my Starlink has been very reliable. I don't really need the Centurylink anymore, but I want (and the whole reason I went with Netgate/pfSense) the failover option.
-
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
I'm leaning toward it being a pfSense issue. Starlink works fine by itself, CenturyLink works fine by itself. Some setting has to be wrong, or some bug has cropped up.
I do tend to agree, although both of us stated that we made no changes to our hardware or configurations prior to the onset of this issue. I mean I went about a year and a half with no issue using the same config. So who knows.
Side note, my Starlink has been very reliable. I don't really need the Centurylink anymore, but I want (and the whole reason I went with Netgate/pfSense) the failover option.
I am in the exact same boat. With the exception of VERY heavy rain and snow storms StarLink has been rock solid. Which was not the case early on. But SL works great now. I also use the DSL line for a Dynamic DNS client.
Hopefully the EdgeRouter setup gives us more insight. Will reply as soon as I get a chance to test it.
-
@jimeez said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
I am in the exact same boat. With the exception of VERY heavy rain and snow storms StarLink has been rock solid. Which was not the case early on. But SL works great now. I also use the DSL line for a Dynamic DNS client.
Wow, the similarities continue to amaze me.
Only the heavy rain seems to take SL down for a minute or so here. I use (or at least I did before all this started) the Centurylink WAN for all the IOT devices around here, a DynDNS, and an OpenVPN server.
I eliminated all of that for this troubleshooting.