Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN
-
If anyone has any ideas, I am still working this problem.
Here are my DHCP log entries from about the time I enabled the Centurylink WAN 2 (ix2) interface 11:55 to to the time that Starlink WAN1 goes offline with 100% packet loss 15 minutes later. I hope there are some 'log whisperers' out there that can help. Am I barking up the wrong tree thinking it's a DHCP issue?
The correlation I see here is that at 11:55:30 dhc client binds to the Centurylink IP with a 900 second renewal. Exactly 900 seconds later, Starlink WAN1 goes offline with 100% packet loss. It takes Starlink WAN1 about 1-2 minutes to come back online and then the 15 minute cycle repeats.
Thank-you.
-
I haven't given up yet. While I have had zero success getting it to work on pfSense, I figured I'd give OPNsense a try next. Planning to work on it this coming weekend. Will report back with my findings.
Surely we can be the only two having this issue.
-
Agreed. Two people with working dual WANs that suddenly stops working.
Some kind of change happened with Centurylink, Starlink ,or pfSense.
-
Having basically the same issue as well. Dual WAN in a gateway group, Starlink as Tier 1 and DSL as Tier 2. No issues for the last 2+ years until around Aug. 24th when Starlink suddenly started dropping out about every 2 hours.
Will be back on site this Thursday to more troubleshooting and will see if disabling the DSL connection provides the same results that you guy saw. I also have a second Starlink dish that I am going to add into the mix just for fun.
-
Just some of my thoughts.
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
The correlation I see here is that at 11:55:30 dhc client binds to the Centurylink IP with a 900 second renewal.
CenturyLink = interface ix3 - a renewal of 150 seconds ? Right ?
For IPv4, 900 sec or 15 minutes is already very low, but ok, as this include 'new' technology, why not.
Then what is this Century Link ?900 seconds = the Starlink, right ?
... Sep 9 11:55:33 kea-dhcp4 42003 INFO [kea-dhcp4.lease-cmds-hooks.0x3156f3012000] LEASE_CMDS_DEINIT_OK unloading Lease Commands hooks library successful Sep 9 11:55:33 kea-dhcp4 42003 INFO [kea-dhcp4.dhcp4.0x3156f3012000] DHCP4_SHUTDOWN server shutdown Sep 9 11:55:30 kea-dhcp4 42003 INFO [kea-dhcp4.dhcp4.0x3156f3012000] DHCP4_STARTED Kea DHCPv4 server version 2.4.1 started Sep 9 11:55:30 kea-dhcp4 42003 WARN [kea-dhcp4.dhcp4.0x3156f3012000] DHCP4_MULTI_THREADING_INFO enabled: yes, number of threads: 2, queue size: 64 ...
(from bottom to top) : ... and a DCHP LAN server also restarts .... why ?
@preston said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Exactly 900 seconds later, Starlink WAN1 goes offline with 100% packet loss.
Here :
Sep 9 12:10:57 dhclient 86826 bound to 76.0.28.79 -- renewal in 900 seconds. Sep 9 12:10:57 dhclient 47261 Creating resolv.conf Sep 9 12:10:57 dhclient 46263 RENEW Sep 9 12:10:57 dhclient 86826 DHCPACK from 71.33.5.2 Sep 9 12:10:56 dhclient 86826 DHCPREQUEST on ix2 to 71.33.5.2 port 67 Sep 9 12:10:45 dhclient 86826 DHCPREQUEST on ix2 to 71.33.5.2 port 67 Sep 9 12:10:39 dhclient 86826 DHCPREQUEST on ix2 to 71.33.5.2 port 67 Sep 9 12:10:36 dhclient 86826 DHCPREQUEST on ix2 to 71.33.5.2 port 67 Sep 9 12:10:34 dhclient 86826 DHCPREQUEST on ix2 to 71.33.5.2 port 67 Sep 9 12:10:32 dhclient 86826 DHCPREQUEST on ix2 to 71.33.5.2 port 67 Sep 9 12:10:31 dhclient 86826 DHCPREQUEST on ix2 to 71.33.5.2 port 67 Sep 9 12:10:30 dhclient 86826 DHCPREQUEST on ix2 to 71.33.5.2 port 67
(from bottom to top)
At 12:10:30 its reewal time .... DHCPREQUEST on ix2 but no answer.
So one second delay : ... DHCPREQUEST on ix2 but no answer.
2 seconds deklay ... DHCPREQUEST on ix2 but no answer.
4 seconds delay DHCPREQUEST on ix2 but no answer.
8 seconds DHCPREQUEST on ix2 but no answer.etc everything is fine here, the stand-off delay doubles at every request - that's normal.
and suddenly :
Sep 9 12:10:57 dhclient 86826 DHCPACK from 71.33.5.2
An answer from the 'startlink' DHCP server came back 27 seconds later - ouf !!Not to bad, I guess, as I don't know where the DHCP 'starlink' server is, how many inter linked laser hops between satellites the packet made .... where the ground station is etc.
Let say .... the links was bad for a moment ? Chinese space junk in the way ? The link was overloaded ?
(we'll never know)At this moment, the same Ipv4 = 76.0.2x.79 - came back, thus renew.
Still, you said : "2 minutes later", counting from the start of the DHCP renewal, the connection is 'dead'.
My question : is this related to the fact that a a simple 'DHCPREQUEST' request packet took 30 seconds to be answered ? If the conenctuion is that bad at that moment, then yeah, the connection will be considered as very bad by dpinger (huge pings) .... and it will 'reset' the connection for sure.
edit : wait : satellites are not geo locke din the sky, they really do move ... was the disk syncing to a new satellite ? How much should that take ?
Does that change the DHCP server - does the gateway change ?
I know, sorry, more questions as answers.Btw : if the IPv6 gateway has been shut down, why not also silence the LAN IPv6 DHCP server ?
Also : Why not testing with the good old 'ISC-DHCP' stuff instead of KEA, just to be sure ?
-
Thanks for the response.
- I can rule out Starlink as a bad connection as I can monitor it's stability via the app. It also remains up 99.99% of the time when it is the only interface enabled.
- I have tried reverting to ISC-DHCP with the same results.
- I tried disabling IPv6 everywhere with the same results.
- The 900 seconds is for the CenturyLink DSL (ix2) connection.
- I've tried DNS resolver and DNS forwarder with the same results.
- I'm not 100% certain it's a DHCP issue...just guessing since I found the 900 second entry in the log which is exactly how long it takes for the Starlink WAN to go down.
- I've factory reset and changed the CenturyLink modem's address to 172.16.0.1 (instead of the default 192.168.0.1) with the same results.
- When Starlink WAN1 goes down, it takes about 2 minutes for it to return and then the 15 minute cycle repeats.
- @jimeez also found some reddit posts with people having a similar issue.
Crazy thing is, every thing was working fine for a long time, and I didn't make any changes (no updates, no new packages, nothing) when the failure began.
-
I tried disabling the DSL interface (WAN01 - Tier 2) and wouldn't you know, Starlink interface (WAN02 - Tier 1) starts to work without issue. Re-enable DSL interface (WAN01 - Tier 2) and within an hour I am seeing the same issue with packet loss shooting up to 100% on the Starlink connection
The DSL connection has a static IP address, but for years now I have just left the interface IPv4 Configuration Type as "DHCP" without issue. As a quick test I switched it over to "Static IPv4" along with it's assigned IP address. Hours now with both the DSL and Starlink interfaces active with no issues. Everything is running like it was a couple weeks back. Will continue to monitor for the rest of the day.
For now, while I monitor I need to sit here and think about why this appears to be the solution for me, and why it is only a recent problem.
@jimeez or @preston do either of you have a static IP for your respective DSL connections?
-
@knoppolis said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
@jimeez or @preston do either of you have a static IP for your respective DSL connections?
My CenturyLink DSL connection is not static. This is a good data point though, thanks. Let us know how it does.
-
@knoppolis said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
@jimeez or @preston do either of you have a static IP for your respective DSL connections?
Also a no here.
I am very curious to see if this holds up for you. Although, if it does, my and @preston's issue will be an even bigger mystery.
-
So @preston mentioned something to me in a private chat that got my wheels turning. He brought up the fact that, prior to this issue, his StarLink connection would drop out around 4AM most days then come right back up. Mine did this too. Like clockwork. I always thought the reason was that the StarLink unit was receiving an update and restarting or something. But now I'm wondering if that 24 hour cycle is somehow related to this problem. Only now instead of every 24 hours it's happening every 15 minutes.
I went back and checked my notification logs. This 24 hour drop out was very consistent. Then on August 24th the 15 minute dropout started happening.
-
So I lobbed a support ticket to StarLink. Referenced this thread. Their response as follows:
Would you be able to confirm how you currently have your health checks set up for a failover to occur? The typical recommendation we provide our enterprise customers is to relax heath checks (i.e. pings, etc.) to deal with occasional connection drops from Starlink. Checking every 10 seconds & getting 5 fails in a row would be a good threshold to start with.
Would anyone be able to tell me where to go look in pfsense to find the answer to their question?
-
Assuming they are talking about System>Routing>Gateways and making edits to the details for your Starlink gateway.
Here is what I currently have for my Starlink gateway:
I have been playing around with the "Packet Loss Thresholds" to keep the failover from happening with a low of 30 and a high of 60. I also played around with the other intervals but it really made no difference.
I used this reddit post as a starting point/reference for these adjustments.
https://www.reddit.com/r/PFSENSE/comments/1eg0wpk/starlink_monitoring_in_pfsense/ -
Update on the status/health of my set up. Everything has been running fine for the last 9 hours.
Here is what the packet loss situation looked like for the last 48 hours. No issues after setting a static IP for my DSL connection.
-
@knoppolis said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
Update on the status/health of my set up. Everything has been running fine for the last 9 hours.
Here is what the packet loss situation looked like for the last 48 hours. No issues after setting a static IP for my DSL connection.
Interesting. So how does one set a static IP for their DSL connection? Doesn't the provider set that?
-
@jimeez my DSL service came with a static IP address, in pfsense go to the interface in question. Need to change the IPv4 Configuration Type to Static IPv4, then below you will have the ability to set the IPv4 Address to said static address.
-
@knoppolis said in Dual WAN Fail-over Issue - Tier 1 WAN frequently failing upon activation of the second Tier 2 WAN:
@jimeez my DSL service came with a static IP address...
I guess that's what I'm asking. You can't just go and assign a static IP to an interface if it's not set up that way with your service provider, can you?
-
@jimeez ahh, sorry. So you can't just give yourself whatever IP address you want, but you could try entering whatever IP you have currently been granted from your provider to test if you get the same results? Honestly not sure if that would work, the big thing would be that as soon as the assigned IP changed your DSL connection would go down until you went back to the DHCP setting for the interface in pfsense.
-
Surprisingly enough, I got a (for now) positive response from StarLink. They are telling me they are going to look into this. Their 1st level support staff asked me some questions which I answered. I then got a reply thanking me for the input and saying that they would dig into it. I was NOT expecting that response. Will see what happens.
-
That's great news. I hope they get back to you with something.
When I contacted them (in the beginning of all of this) they thought it might be my original gen 1 circular dish causing the problems. They sent me a new gen 3 dish and router....but same results.
-
Turns out their response was one that was already bounced around here and on Reddit.
While I do not have any exact guidance for how to configure this specific router. The probe interval does seem very strict as it is set to check every 500 milliseconds or 0.5seconds compared to the general recommendation of checking every 10 seconds or 10000 milliseconds. The frequent failovers may be improved if you attempt relaxing these health checks to deal with the occasional drops in service due to utilizing a satellite internet service.
I suspected this would not work but did it anyway so I could report back to them with factual info. And, unfortunately it did not fix it. Every 15 minutes, like clockwork, to the second, the StarLink interface fails due to high packet loss and eventually is perceived to be offline...even though it is not. After a bit comes back up. Then fails again exactly 15 minutes later. Turn off the second interface and everything works fine. Weirdest, most frustrating thing.
Couple more questions for you regarding your config. There has to be something here that will eventually lead to an answer.
- Do you use pfBlocker?
- You already confirmed that you don't use NUT, but have you noticed any other services that fail when you activate the DSL interface like NUT does for me?
*Assuming you have some port forwarding configured what do you use for the Dest. Address? Individual interfaces? Any? Perhaps something else?
The NUT service failing really has me scratching my head and I believe must be a clue to what's going on. Why would that service fail immediately upon activation of the second (DSL) interface. It never used to. Only after August 22nd....