Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN
-
I'm leaning toward it being a pfSense issue. Starlink works fine by itself, CenturyLink works fine by itself. Some setting has to be wrong, or some bug has cropped up.
Side note, my Starlink has been very reliable. I don't really need the Centurylink anymore, but I want (and the whole reason I went with Netgate/pfSense) the failover option.
-
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
I'm leaning toward it being a pfSense issue. Starlink works fine by itself, CenturyLink works fine by itself. Some setting has to be wrong, or some bug has cropped up.
I do tend to agree, although both of us stated that we made no changes to our hardware or configurations prior to the onset of this issue. I mean I went about a year and a half with no issue using the same config. So who knows.
Side note, my Starlink has been very reliable. I don't really need the Centurylink anymore, but I want (and the whole reason I went with Netgate/pfSense) the failover option.
I am in the exact same boat. With the exception of VERY heavy rain and snow storms StarLink has been rock solid. Which was not the case early on. But SL works great now. I also use the DSL line for a Dynamic DNS client.
Hopefully the EdgeRouter setup gives us more insight. Will reply as soon as I get a chance to test it.
-
@jimeez said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
I am in the exact same boat. With the exception of VERY heavy rain and snow storms StarLink has been rock solid. Which was not the case early on. But SL works great now. I also use the DSL line for a Dynamic DNS client.
Wow, the similarities continue to amaze me.
Only the heavy rain seems to take SL down for a minute or so here. I use (or at least I did before all this started) the Centurylink WAN for all the IOT devices around here, a DynDNS, and an OpenVPN server.
I eliminated all of that for this troubleshooting.
-
So I think I just eliminated pfSense from the equation. Fired up an EdgeRouter Lite-3 this afternoon. Got it up and running for dual-WAN failover, no problem. (it was amazingly simple)
15 minute mark on THE nose, ETH1 (StarLink) went down and ETH2 (CenturyLink) took over.
Not really sure what to make of this. So whatever is going on it's universal to these two connections and independent of the gateway device. I haven't really investigated CenturyLink...not sure where to go from here.
-
Well, that is some good data anyway! I feel like we are at least narrowing it down. With all the hours spent on this, I have seen a lot of posts about how "Starlink doesn't play nice" with third party routers... I also know what a horrible ISP CenturyLink is and haven't ruled them out as the culprit, especially since it's the 15 minutes (900 seconds) that is associated with the Century Link interface.
I have been reading more about DHCP and have some new ideas to try.
Looking at Interfaces -> WAN (Starlink or Centurylink). On the configuration page under DHCP Client Information, there is an advanced option. When you check that box, it gives you Protocol Timings at the bottom. I'm going to mess around with that later tonight and see if does any thing. Not real hopeful, but worth a try.
-
@jimeez random question. For your NIC's that your two WAN connections come in on. Is it a single card with dual or quad ports, or two physically separate NIC's? I had a single quad port Intel NIC, with two of the ports used for my WAN connections and one for my LAN.
As soon as I get a chance I am going to try putting another NIC in the box and see if having two isolated NIC's makes any difference.
-
Three Single NIC cards. All Intel. One of the first things I did when this started happening was replace the NICs.
Tonight I'm going to experiment with the EdgeRouter a bit more and put CenturyLink as the main connection with StarLink as the failover. See if I get the same result.
-
So there we have it. The problem HAS to be on StarLink's end. At least this is my unprofessional conclusion.
I feel silly for not trying this before now, but tonight I re-inserted the EdgeRouter into my network but this time I made CenturyLink the primary and StarLink the secondary fail-over. Guess what? It's been working fine for the last several hours.
Whatever is happening every 15 minutes when StarLink is the primary WAN is beyond me. But it is currently working fine in a dual-WAN fail-over environment on an EdgeRouter Lite-3. I suppose the real test will be to see if I get the same result on the pfSense box.
-
One more item to add to the list of discoveries, I tried making CentruyLink the primary interface this morning with StarLink the backup fail-over. No issues. The connection remained solid for several hours with both interfaces active. As soon as I switched it back to StarLink as primary and CenturyLink as fail-over the drop s started all over again. Every 15 minutes on the exact 15 minute mark.
I don't know how this is NOT a StarLink issue.
-
I can confirm the same on my end. When CL is primary, it stays up.
Question about your setup: What are you using for DNS servers? Are you using something like 1.1.1.1 or 8.8.8.8? I've tried different combinations and nothing seems to matter.
-
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Question about your setup: What are you using for DNS servers? Are you using something like 1.1.1.1 or 8.8.8.8? I've tried different combinations and nothing seems to matter.
Yeah, I use four DNS servers. In this order
8.8.8.8 8.8.4.4 (and then two other DNS servers from a geographically local ISP)
-
I still haven't found a solution...
-
Starlink pushed an update yesterday. I applied it, no change.
-
I finally fiddled with the Advanced DHCP configuration, no change.
-
I switched back to ISC DHCP again (the Kea service keeps crashing/stopping when adding Centurylink), no change.
-
-
Same here. I have spread this information to several folks that are much more knowledgeable than I am and have not figured out a solution.
One of these people suggested using tcdump to capture activity on both the LAN and WAN side when the drop occurs on the 15 minute mark. I haven't done that yet. Might mess around with it today.
I really think you're on to something with the "dhclient 86826 bound to 76.0.28.79 – renewal in 900 seconds”. But I don't know what to do with it. I have not dug around in the CenturyLink modem settings yet. But maybe there is something that changed on CenturyLink's end? Would be curious to see the date of the most recent firmware and if it corresponds to the onset of our problem.
Assume you are using the CenturyLink modem in Transparent Bridge mode?
-
What happens if you put your Centurylink as tier 3?
Have any of you put your Centurylink modem back to modem mode and tried to let it handle the PPP?
-
@jimeez said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Assume you are using the CenturyLink modem in Transparent Bridge mode?
Yep, I am running the Centurylink modem (Zyxel C110Z running Firmware CZW007-4.16.012.15) in Transparent bridging mode. The firmware was one version out of date when all this started. I updated it to the latest firmware I could find early on in our troubleshooting.
-
@chpalmer said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
What happens if you put your Centurylink as tier 3?
Have any of you put your Centurylink modem back to modem mode and tried to let it handle the PPP?
I haven't tried putting the CL modem back in modem mode (for fear of a double NAT), but that is a good idea. I will give that a try when I can and report back.
As far as Tier 3, I only have two WANs.
Like @jimeez, when I make Centurylink Tier 1 and Starlink Tier 2, Starlink seems to stay online (as far as I can tell). Although, I'm starting to get lost in everything I've tried so far...
-
@chpalmer said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Have any of you put your Centurylink modem back to modem mode and tried to let it handle the PPP?
You may be on to something there!
I took the Centurylink modem out of Transparent Bridging mode and connected a LAN port on the Centurylink modem to my pfSense box (WAN2) and I made it past the 15 minute mark without losing Starlink (WAN1). Also, no errors in the DHCP logs. I am also using ISC DHCP instead of keadhcp.
Edit: It's been about 30 minutes and things are still working (Starlink WAN1 staying online). I'm out of time tonight to do any more troubleshooting, but will over the next few days. Will report back.
-
Everything I have ever read or watched about connecting a DSL modem to pfSense instructs that the modem be placed in transparent bridge mode. Curious to know what settings you applied in pfSense to get this to work.
-
@chpalmer said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
What happens if you put your Centurylink as tier 3?
As @preston already stated, both of us have only two WANs.
Have any of you put your Centurylink modem back to modem mode and tried to let it handle the PPP?
Sounds like this may have worked for @preston . Assuming it did/does, I have a couple questions:
- With the DSL modem no longer in transparent bridge mode I assume that it will assign that WAN interface a local IP address of 192.168.1.xxx. If this assumption is correct is this connection now sitting behind a double NAT?
- If that's the case, I guess we can no longer use pfSense to resolve our Dynamic DNS clients as that interface will no longer have an outside IP address.
You'll have to pardon my ignorance here. I only know enough to be slightly dangerous.
-
Things are still working here with the CenturyLink modem out of transparent bridging mode.
Here is what I did ( I connected my laptop via ethernet to the CenturyLink modem for the setup):
-
Factory reset the CenturyLink modem (again), disabled the CL modem WiFi, reset the admin password and so on.
-
My CenturyLink modem's default GUI address is 192.168.0.1 - I left that as is.
-
I connected LAN 1 on CL modem to my WAN2 pfSense port.
-
Under DHCP reservations in the CL modem, I assigned my pfSense box an IP of 192.168.0.2.
-
I disabled the DHCP server on the CL modem.
-
Rebooted the CL modem
-
Unplugged my laptop from the CL modem.
-
I reconnected to my pfSense network and set up the CL interface and gateway.
-
My DNS servers and monitor IPs are 1.1.1.1 for Starlink and 8.8.8.8 for CenturyLink respectively.
-
My pfSense LAN is in the 192.168.1.xxx range
-
The pfSense dashboard shows the CL WAN IP as 192.168.0.2, but when I check sites like infosniper.net I can see the CL IP address.
-
As an added bonus I can now access the CL modem GUI (192.168.0.1) via the pfSense network without having to fiddle with additional pfSense settings.
-
I'm not sure about Dynamic DNS, but I have been using Tailscale with Starlink and it has worked great.
-
@chpalmer may just be our hero!
As far as Double NAT while using the CL WAN, I really don't know (or understand it completely), but here is my Traceroute from the CL WAN to www.google.com:
1 192.168.0.1 0.544 ms 0.425 ms 0.402 ms 2 184.102.159.254 28.701 ms 28.475 ms 28.817 ms 3 71.33.4.9 28.078 ms 28.604 ms 28.296 ms 4 4.68.144.169 59.685 ms 46.815 ms 42.017 ms 5 4.68.127.114 44.359 ms 55.718 ms 63.480 ms 6 * * * 7 142.251.60.10 42.169 ms 216.239.51.116 43.287 ms 209.85.255.172 42.109 ms 8 209.85.247.117 42.379 ms 192.178.249.234 43.372 ms 209.85.247.117 42.327 ms 9 142.251.233.230 43.373 ms 44.324 ms 142.250.190.4 41.627 ms
-