Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN
-
Turns out their response was one that was already bounced around here and on Reddit.
While I do not have any exact guidance for how to configure this specific router. The probe interval does seem very strict as it is set to check every 500 milliseconds or 0.5seconds compared to the general recommendation of checking every 10 seconds or 10000 milliseconds. The frequent failovers may be improved if you attempt relaxing these health checks to deal with the occasional drops in service due to utilizing a satellite internet service.
I suspected this would not work but did it anyway so I could report back to them with factual info. And, unfortunately it did not fix it. Every 15 minutes, like clockwork, to the second, the StarLink interface fails due to high packet loss and eventually is perceived to be offline...even though it is not. After a bit comes back up. Then fails again exactly 15 minutes later. Turn off the second interface and everything works fine. Weirdest, most frustrating thing.
Couple more questions for you regarding your config. There has to be something here that will eventually lead to an answer.
- Do you use pfBlocker?
- You already confirmed that you don't use NUT, but have you noticed any other services that fail when you activate the DSL interface like NUT does for me?
*Assuming you have some port forwarding configured what do you use for the Dest. Address? Individual interfaces? Any? Perhaps something else?
The NUT service failing really has me scratching my head and I believe must be a clue to what's going on. Why would that service fail immediately upon activation of the second (DSL) interface. It never used to. Only after August 22nd....
-
I also tried changing the probe interval. No help.
-
I am not running pfBlocker.
-
The services that fail (most of the time, but not allways) when the Starlink goes offline are EITHER the kea-dhcp4 or the kea-dhcp6 server. Which is what was taking me down the dhcp rabbit hole.
-
The only 'extra' package that I have running is Tailscale.
-
I applied the recommended pfSense 24.03 system patches (through the Netgate System Patch package) last week with no help.
-
-
Do you use kea-dhcp, and if so does it fail for you?
-
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Do you use kea-dhcp, and if so does it fail for you?
I do not use it. Had this disabled previously and didn't even know it until someone suggested this as a fix.
Last night I got my hands on a 4-port EdgeRouter. Did some reading last night and think I have enough knowledge to test this thing out in a dual-WAN scenario. Hope to get to it this weekend and see what happens. This should give us some insight into where the problem lies: CenturyLink, StarLink, or pfSense.
-
I'm leaning toward it being a pfSense issue. Starlink works fine by itself, CenturyLink works fine by itself. Some setting has to be wrong, or some bug has cropped up.
Side note, my Starlink has been very reliable. I don't really need the Centurylink anymore, but I want (and the whole reason I went with Netgate/pfSense) the failover option.
-
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
I'm leaning toward it being a pfSense issue. Starlink works fine by itself, CenturyLink works fine by itself. Some setting has to be wrong, or some bug has cropped up.
I do tend to agree, although both of us stated that we made no changes to our hardware or configurations prior to the onset of this issue. I mean I went about a year and a half with no issue using the same config. So who knows.
Side note, my Starlink has been very reliable. I don't really need the Centurylink anymore, but I want (and the whole reason I went with Netgate/pfSense) the failover option.
I am in the exact same boat. With the exception of VERY heavy rain and snow storms StarLink has been rock solid. Which was not the case early on. But SL works great now. I also use the DSL line for a Dynamic DNS client.
Hopefully the EdgeRouter setup gives us more insight. Will reply as soon as I get a chance to test it.
-
@jimeez said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
I am in the exact same boat. With the exception of VERY heavy rain and snow storms StarLink has been rock solid. Which was not the case early on. But SL works great now. I also use the DSL line for a Dynamic DNS client.
Wow, the similarities continue to amaze me.
Only the heavy rain seems to take SL down for a minute or so here. I use (or at least I did before all this started) the Centurylink WAN for all the IOT devices around here, a DynDNS, and an OpenVPN server.
I eliminated all of that for this troubleshooting.
-
So I think I just eliminated pfSense from the equation. Fired up an EdgeRouter Lite-3 this afternoon. Got it up and running for dual-WAN failover, no problem. (it was amazingly simple)
15 minute mark on THE nose, ETH1 (StarLink) went down and ETH2 (CenturyLink) took over.
Not really sure what to make of this. So whatever is going on it's universal to these two connections and independent of the gateway device. I haven't really investigated CenturyLink...not sure where to go from here.
-
Well, that is some good data anyway! I feel like we are at least narrowing it down. With all the hours spent on this, I have seen a lot of posts about how "Starlink doesn't play nice" with third party routers... I also know what a horrible ISP CenturyLink is and haven't ruled them out as the culprit, especially since it's the 15 minutes (900 seconds) that is associated with the Century Link interface.
I have been reading more about DHCP and have some new ideas to try.
Looking at Interfaces -> WAN (Starlink or Centurylink). On the configuration page under DHCP Client Information, there is an advanced option. When you check that box, it gives you Protocol Timings at the bottom. I'm going to mess around with that later tonight and see if does any thing. Not real hopeful, but worth a try.
-
@jimeez random question. For your NIC's that your two WAN connections come in on. Is it a single card with dual or quad ports, or two physically separate NIC's? I had a single quad port Intel NIC, with two of the ports used for my WAN connections and one for my LAN.
As soon as I get a chance I am going to try putting another NIC in the box and see if having two isolated NIC's makes any difference.
-
Three Single NIC cards. All Intel. One of the first things I did when this started happening was replace the NICs.
Tonight I'm going to experiment with the EdgeRouter a bit more and put CenturyLink as the main connection with StarLink as the failover. See if I get the same result.
-
So there we have it. The problem HAS to be on StarLink's end. At least this is my unprofessional conclusion.
I feel silly for not trying this before now, but tonight I re-inserted the EdgeRouter into my network but this time I made CenturyLink the primary and StarLink the secondary fail-over. Guess what? It's been working fine for the last several hours.
Whatever is happening every 15 minutes when StarLink is the primary WAN is beyond me. But it is currently working fine in a dual-WAN fail-over environment on an EdgeRouter Lite-3. I suppose the real test will be to see if I get the same result on the pfSense box.
-
One more item to add to the list of discoveries, I tried making CentruyLink the primary interface this morning with StarLink the backup fail-over. No issues. The connection remained solid for several hours with both interfaces active. As soon as I switched it back to StarLink as primary and CenturyLink as fail-over the drop s started all over again. Every 15 minutes on the exact 15 minute mark.
I don't know how this is NOT a StarLink issue.
-
I can confirm the same on my end. When CL is primary, it stays up.
Question about your setup: What are you using for DNS servers? Are you using something like 1.1.1.1 or 8.8.8.8? I've tried different combinations and nothing seems to matter.
-
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Question about your setup: What are you using for DNS servers? Are you using something like 1.1.1.1 or 8.8.8.8? I've tried different combinations and nothing seems to matter.
Yeah, I use four DNS servers. In this order
8.8.8.8 8.8.4.4 (and then two other DNS servers from a geographically local ISP)
-
I still haven't found a solution...
-
Starlink pushed an update yesterday. I applied it, no change.
-
I finally fiddled with the Advanced DHCP configuration, no change.
-
I switched back to ISC DHCP again (the Kea service keeps crashing/stopping when adding Centurylink), no change.
-
-
Same here. I have spread this information to several folks that are much more knowledgeable than I am and have not figured out a solution.
One of these people suggested using tcdump to capture activity on both the LAN and WAN side when the drop occurs on the 15 minute mark. I haven't done that yet. Might mess around with it today.
I really think you're on to something with the "dhclient 86826 bound to 76.0.28.79 – renewal in 900 seconds”. But I don't know what to do with it. I have not dug around in the CenturyLink modem settings yet. But maybe there is something that changed on CenturyLink's end? Would be curious to see the date of the most recent firmware and if it corresponds to the onset of our problem.
Assume you are using the CenturyLink modem in Transparent Bridge mode?
-
What happens if you put your Centurylink as tier 3?
Have any of you put your Centurylink modem back to modem mode and tried to let it handle the PPP?
-
@jimeez said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Assume you are using the CenturyLink modem in Transparent Bridge mode?
Yep, I am running the Centurylink modem (Zyxel C110Z running Firmware CZW007-4.16.012.15) in Transparent bridging mode. The firmware was one version out of date when all this started. I updated it to the latest firmware I could find early on in our troubleshooting.
-
@chpalmer said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
What happens if you put your Centurylink as tier 3?
Have any of you put your Centurylink modem back to modem mode and tried to let it handle the PPP?
I haven't tried putting the CL modem back in modem mode (for fear of a double NAT), but that is a good idea. I will give that a try when I can and report back.
As far as Tier 3, I only have two WANs.
Like @jimeez, when I make Centurylink Tier 1 and Starlink Tier 2, Starlink seems to stay online (as far as I can tell). Although, I'm starting to get lost in everything I've tried so far...