Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN
-
I'm at my wits end. Tried just about everything I could find through other searches and reading here and other searches. Nothing has worked so far. Starting to think I might have a hardware issue, but thought I'd make a post before taking the time to deploy a new gateway device.
(I'll do my best to provide all the pertinent info...but will likely miss something)
This is a simple home use setup. PFSENSE 2.7.2-RELEASE (amd64) running on an old Lenovo desktop tower with an Intel Core i7-7700 CPU @ 3.60GHz, 16G RAM, nVME drive, three single port Intel PCI-E NIC cards. Dual WAN setup. WAN1 is StarLink. WAN2 is CenturyLink DSL. Gateway group created to fail-over to the DSL connection when StarLink goes down (Tier 1 and Tier 2 respectively.... trigger level is "member down"). This box has been rock solid for a few years now. Zero issues. And NO configuration changes prior to the start of this problem. This configuration has been up and running since March of 2023 and has worked flawlessly since then. Until last Thursday....
For reasons unknown to me, the StarLink connection's packet loss goes through the roof and the member goes down switching over to the DSL connection. Once the packet loss settles down it switches back over to the StarLink connection. Just like it's supposed to, right? However, this happens over and over and over again every 3-10 minutes all day long. It happens more frequently while "under load" (i.e. downloading large files). If I disable either of the two interfaces, everything works just fine. As soon as I re-enable the disabled interface this cycle of fail-over starts back up and does not stop. Doesn't matter which one I leave enabled and which one I disable, the connection works fine if only one interface is enabled.
Things I have done so far to diagnose and (unsuccessfully) fix the problem.
- By-passed all internal networking equipment and connected a couple machines directly to both the DSL and StarLink modems. Zero issues. Solid functioning connections.
- Installed new NIC cards. Did not change anything.
- Changed the monitor IP address of each gateway. They were previously identical to the gateway IP. Switched them over to Google and Open DNS respectively. No change.
- Disabled "Gateway Monitoring" and "Gateway Monitoring Action" on each gateway. Made no difference.
I wasn't quite sure what log entries to post, but following a log entry excerpt for the Gateway.
Aug 25 17:48:59 dpinger 454 send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% alarm_hold 10000ms dest_addr 8.8.4.4 bind_addr 75.165.107.163 identifier "CENTURYLINK_DHCP " Aug 25 17:48:59 dpinger 747 send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% alarm_hold 10000ms dest_addr 8.8.8.8 bind_addr 100.77.12.203 identifier "STARLINK_DHCP " Aug 25 17:49:01 dpinger 747 STARLINK_DHCP 8.8.8.8: Alarm latency 0us stddev 0us loss 100% Aug 25 17:50:24 dpinger 747 STARLINK_DHCP 8.8.8.8: Clear latency 29647us stddev 10607us loss 5% Aug 25 18:04:36 dpinger 747 STARLINK_DHCP 8.8.8.8: Alarm latency 22415us stddev 9977us loss 22%
I'm sure there is other relevant info I should be including, but it's escaping me at the moment. Grateful for any suggestions.
Many thanks in advance.
-
I am having the same issue with a very similar setup! I have had this setup working for nearly 2 years with no issues, until about 10-14 days ago (about the same time your issues started). I've been working for days on this and can't solve it.
I am running pfsense 24.03 on a Netgate 4100 box.
My setup is WAN1: Starlink and WAN2: Centurylink DSL. Every 15 minutes Starlink is marked down due to 100% packet loss and I failover to Centurylink. After about 2 minutes Starlink returns and I switch back to Starlink.
When I disable the WAN2 Centurylink interface, Starlink is rock solid (1% packet loss max over the last 12 hours). The moment I enable WAN2 Centurylink, I can start the countdown and WAN1 Starlink will show 100% packet loss and marked offline every 15 minutes, fail over to Centurylink, a minute or so later, fail back to Starlink.
I started over and deleted all gateways and gateway groups, disabled all policy routing rules, changed monitor IPs, fiddled with thresholds for pactket loss and latency, with no change in results. I factory reset (and restored from a backup) with same results.
Bottom line, Starlink works fine by itself. The moment I add Centurylink as a second WAN (with or with out gateway groups being created), Starlink will go offline every 15 minutes.
-
@preston
Well, you may have just saved me some time. Was just about to build, configure, and deploy a new device. Seeing your post though gives me relief to know that someone else is experiencing this.Couple questions for you:
- Do you have NUT installed?
- If so, does the NUT service stop when you re-enable CenturyLink?
- Have you tried doing a hard reboot of both the StarLink and CenturyLink modems?
Right now, my theory is that something changed recently that causes pfsense to "see" the packet loss on the StarLink connection differently. It sees loss when there is none. Something happened. Can't figure out what it is.
-
I do not have the NUT package installed.
I have done a power cycle on Starlink. As a matter of fact, I just upgraded to Starlink Gen 3 yesterday (Starlink router running in bypass mode). My original gen1 circular dish was showing the exact same 100% packet loss and gateway down every 15 minute problem.
I honestly can't remember if I power cycled the Centurylink modem. I've done so much that I'm starting to lose track. I'll try that and report back.
Looking back at the traffic graphs, it seems like my problems began on 8-24-2024 at about 4:00 am Central Time. I think that is when my Starlink usually installs it's updates and reboots, so I thought maybe it was a Starlink change, maybe a dpinger issue, or DHCP. So many rabbit holes to go down, and I am certainly no pfSense expert.
Just like you, though, Starlink works great when it is the only Gateway.
.
-
@preston Well there goes that theory. I thought NUT was somehow interfering. Whenever I re-enable the CenturlyLink interface the NUT service stops and I have to restart the service. Other weird behavior that I thought maybe had something to do with it.
Well, hopefully this post gathers attention as others search for this same problem. See what turns up.
If I end up figuring it out will be sure to post here with details.
-
A little more information.
I power cycled the Centurylink modem (Zyxel C110Z running Firmware CZW007-4.16.012.15) running in Transparent Bridging Mode. I haven't made any recent changes to the CL modem.
I ENABLED the Centurylink interface on pfSense, left the Centurylink gateway DISABLED, and after 15 minutes the Starlink gateway went down.
-
Still working the issue... Is it DHCP???
I noticed that when when the Starlink WAN goes down, many times (not always) the kea-dhcp4 or kea dhcp6 server service stops and will not restart even when clicking the 'play' button on the dashboard.
I see a reference in the DHCP logs that has the dhclient bound to the Centurylink IP (ix2 interface) with a renewal in 900 seconds (15 minutes). 15 minutes is EXACTLY the time it takes for Starlink to go down after enabling the Centurylink interface.
-
@preston Interesting. I'll see if I can reproduce this later this evening. I too haven't given up. Just not sure where to go with it next.
-
The time that Starlink WAN goes down, I see this in the logs.
-
One thing I haven't tried yet is a factory reset and starting all over without restoring from my backup.
I did do a reset and restore from backup, but now I wonder if the restore backup just transferred the bad settings back. A no-op.
-
This is exactly what I was about to do when I saw your first reply to my post. Literally I was about to do a fresh install on an old PC I have lying (laying?) around. Had just installed a dual NIC card and was about to start the process of deploying a "fresh" device. No restore from backup. No other packages. Just a clean start, dual wan fail-over setup.
But than I saw your reply. And I thought, "What are the odds that two of us have the exact same problem at roughly the exact same time?" It's very unlikely. Something changed somewhere. Either at CenturyLink's end or StarLink's end. Or elsewhere. But I'm 99% positive it can't be our equipment or the configuration.
I have this other PC ready to go. Maybe I'll give it a shot some night this week just for shits and giggles.
-
If you do a traceroute to something outside what is the first address that answers?
I would seriously consider another address to monitor than 8.8.8.8 on that gateway.
I have had issues using that address in the past.
-
Thank you for the reply. I have tried a few different monitoring addresses. Doesn't seem to make a difference. And the problem ONLY exists when both interfaces are active. I can have the routing set to only one of the gateways...no fail-over....and it still cycles a down member due to packet loss. As soon as I disable one of the two interfaces everything works fine again.
1 <1 ms <1 ms <1 ms fw1.xxxxxx.localdomain [192.168.1.1] 2 19 ms 20 ms 19 ms 100.64.0.1 3 16 ms 20 ms 23 ms 172.16.252.90 4 19 ms 19 ms 23 ms undefined.hostname.localhost [206.224.65.136] 5 17 ms 27 ms 16 ms undefined.hostname.localhost [206.224.64.173] 6 60 ms 47 ms 20 ms 140.248.126.222 7 17 ms 21 ms 20 ms 151.101.67.5
-
@jimeez said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Thank you for the reply. I have tried a few different monitoring addresses. Doesn't seem to make a difference. And the problem ONLY exists when both interfaces are active. I can have the routing set to only one of the gateways...no fail-over....and it still cycles a down member due to packet loss. As soon as I disable one of the two interfaces everything works fine again.
My issue is exactly the same as jimeez. I too have tried different monitoring addresses wiht no change. Simply enabling the Century Link interface causes 100% packet loss after 15 minutes on the Starlink WAN. After a minute or so, the Starlink WAN will return online, but then fail with 100% packet loss every 15 minutes.
-
But look at the ultimate latency of both links. One is satellite which by default will have a higher latency.. and the other is DSL which with interleaving will generally have 38ms or so.. (educated guess)
If you do not have the second faster (latency) interface then the system will simply stay on the only gateway is see's.
If (and I have not had the opportunity yet to play with Starlink although at work we will be soon..) your Starlink interface see's a change in latency that is drastic enough then I can see your system trying to switch to a more stable link..
Try this. From a command prompt.. c> Ping -t 8.8.8.8 and let that run for an hour or so. Watch the latency there and see if it changes much. If it does not then I am probably barking up the wrong tree.. But my SWAG says you will probably see some latency swings. Of coarse take your second link down and only allow it on the Starlink.
-
Thanks for the reply.
Here are stats from my Starlink for the last 24 hours. The Starlink app statistics also match the pfSense stats.
-
I don't want to get my hopes up, but it's been 62 minutes and I have not lost the Starlink WAN. Here is what I did today:
-
Deleted the Centurylink Gateway and Centurylink interface.
-
Assigned the Centurylink interface and gateway.
-
Power cycled the Centurylink modem.
-
Disabled the kea-dhcp6 service
-
Under System/Routing/Gateways: Changed default IPv6 to NONE.
I haven't added any Gateway groups and failover settings yet, but so far the Starlink WAN is staying up. For now (testing) I have "Block private networks and loopback addresses" and "Block bogon networks" both checked. I also haven't set up a monitor IP or DNS server for Centurylink (one thing at a time).
.
.I really think this alone might have done the trick:
-
-
Oh wow. No kidding? That would be amazing if this solved things. If it does, I wonder what that means in terms of why this started happening. Something with CenturyLink perhaps?
Also, general question regarding the IP address of your CenturyLink WAN. I've seen this in a lot of the hot-to videos I watch. Why is the IP address of the CL WAN 192.168.0.1 rather than a CL-assigned IP address?
-
That screenshot above is showing the 192.168.0.1 as a monitor IP. I was trying to make as few changes as possible to see what would break it so I did not have a monitor IP or DNS server set.
I have now added a DNS server and monitor address of 8.8.8.8 to the CL connection and it is now showing the CL IP address on the dashboard correctly. After doing so, I had to disable and re-enable the CL interface to get to pull a proper IP.
!!! After adding the DNS server to the CL connection I lost Starlink at the 15 minute mark! D@mn! !!!
Maybe I'm getting closer to the answer since Starlink stayed online for several hours and only went down when I made DNS changes to the CL connection.
-
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Thanks for the reply.
Here are stats from my Starlink for the last 24 hours. The Starlink app statistics also match the pfSense stats.
Actually I appreciate you posting those numbers.. It will help me with my day job when we get our setup for a remote site we have.. ;)