Multi-WAN gateway failover not switching back to tier 1 gw after back online
-
@mo10 my problem is that post recovery, WAN1 never goes back to being default. I have to use a script to bring down WAN2 so that WAN1 becomes default again. Not an ideal solution but it works.
-
@ibbetsion
Would you be able to make a test an reset you pfsense (save configuration first) and just setup the multi-wan an try again?
-
Hello!
Assuming a pretty standard multiwan: WAN1 -> tier1, WAN2 -> tier2, PREFWAN1/PREFWAN2/BALANCE gwgroups.
Whether you have states left open on WAN2 after WAN1 comes back up (sticky connections?) , or the default in the system routing table doesnt switch back to WAN1 after it recovers (make sure you dont have BALANCE as the default gateway), I believe the best approach is to policy route everything.
After WAN1 comes back, does traffic routed to a PREFWAN1 gwgroup still go out WAN2?
John
-
Yes, it's quite possible that I moved the config to a new box.
And yes, when WAN1 comes back, new connections still go out WAN2.
It does appear to recover some time later though - maybe by the following day? I've not looked into it carefully.
Bob
-
Are you using DHCP on the WANs or what are you using?
-
I have what looks like the same problem. Gateway group with three gateways. When the tier1 goes down (packet loss) tier2 is used. When tier1 comes back, it does not get used and requires manual reconfigure or reboot. No changes I'm aware of to trigger this behaviour. No hints in the logs.
-
@idiotzoo
Are you using DHCP on the WANs or what are you using on each WAN?
Please don't use DHCP, use static instead an report back. Set your main WAN as upstream Gateway. -
@mo10 The tier2 link us using PPPoE, correct me if I'm wrong but I can't use PPPoE with static IPv4 config.
I'm not sure what you mean by "Set your main WAN as upstream Gateway".
The main WAN link is static. This is a WISP link with a local NAT gateway connected via a vlan, so the physical link never goes down from PFsense point of view. The gateway (a ubiquti radio) also is no use as indicator of the connection health so I have to ping to something and use packet loss to determine the link's state.
This was working. The only change is the tier3 link appears to have failed entirely, so this is sitting in a pending state. I'm wondering if this is causing the gateway group to behave incorrectly. Next time the issue occurs I'll remove it and see what happens.
-
This sounds like a setup Error (pending). Do as you say, delete tier 3 from group an delete tier 3 interface. Then add everything again.
I was asking about DHCP because this was the reason I had problems. I heared dual pppoe can cause problems as well but I am not sure.
-
@mo10 Someone on site has verified the tier3 connection is borked at layer1 so removing that isn't going to hurt anything. Certainly having a dead link on a lower priority (higher tier) shouldn't cause any issues with the gateway group behaviour and if it does, this is a bug.... but it would be nice to know why a functioning system has stopped working. As this line failure is the only change I'm hopeful that at least explains the issue.
-
i have found out that there are really strange problems when unplugging and replugging a cable on any wan-port while using DHCP on it.
So maybe you can reproduce your problem by physically unplugging and replugging on your interfaces.
What helped me without needing to reboot: just hit save on any interface.
-
I removed the failed wan link from the gateway group, no difference. I've now disabled that interface entirely, still doesn't work.
I'm at a bit of a loss.
Anybody know if there's any debugging I can look at? Right now I only know there's a problem if the users tell me. The gateways all look fine, it just doesn't switch back to the tier1 as it should.
-
@idiotzoo said in Multi-WAN gateway failover not switching back to tier 1 gw after back online:
I have what looks like the same problem. Gateway group with three gateways. When the tier1 goes down (packet loss) tier2 is used. When tier1 comes back, it does not get used and requires manual reconfigure or reboot. No changes I'm aware of to trigger this behaviour. No hints in the logs.
Hello!
What gateway group (failover/loadbalance) are you using as the Default Gateway on System -> Routing -> Gateways?
What gateway group(s) are you using for all your rules with outbound WAN traffic?
John
-
@serbus sorry for the delay in replying.
The system default is wan1 (the fast wan link)
Outbound traffic with a source on the LAN is using a gateway group called office_internet with the wan1 as tier1 and a slower PPPoE ADSL link as tier2. -
I've got this issue at a client's office.
If I set a 2 tier gateway group as default gateway for IPv4, on failure of tier 1, tier 2 takes over but doesn't switch back to tier 1 on tier 1 recovery (confirmed on gateway status page). This doesn't happen eve after waiting for an hour.
Interestingly, if I set default gateway to the tier 1 link (which then works as expected) and back to the gateway group, the group is still stuck at tier 2.
This is the same with 'member down' and 'packet loss' options. Tier 1 is PPPoE with dynamic gateway (if that makes a difference).
One thing that may be relevant is that both tier 1 and tier 2 have the same gateway IP. Tier 1 is PPPoE over VDSL, tier 2 is L2TP to the same ISP.
pfSense seems to fail to create correct default routes after fails and I'm often left with no default route despite having working gateways set and active. I need to disable and re-enable the interface to bring it back.
Is this a PPPoE thing?
-
@basicmonkey said in Multi-WAN gateway failover not switching back to tier 1 gw after back online:
working gateways set and active. I need to disable and re-enable the interface to bring it back.
IsYou might be on to something with it being a ppp connection. The tier2 gateway on my network used to be via second router, now it’s pppoe from pfsense. That’s probably when it stopped working.
I’m still convinced this is a bug and pfsense is broke.
-
I’m still convinced this is a bug and pfsense is broke.
You are correct. The odd thing is that it sometimes works. I've told it to drop states when there is a change, which definitely is a prerequisite, but it sometimes does go back to the tier 1 gateway. I really wish this would get fixed, it's one of the big advantages PFSense has (if it worked right!)
Bob
-
@idiotzoo What's your tier 1? Is it the same gateway IP?
I've not come across any of this before as my pf install sits behind a router that deals with all the different WAN options seamlessly. It's only recently that 2 clients have wanted to go down the pf route, have both bought 7100s and now the WAN failover isn't working as well as it should.
Since both sites have all their NAT and other services bound to Virtual IPs, killing states isn't too much of an issue. The ISP just moves their /30 and /29s over when the main line goes down.
The issue is gateway group recovery to tier 1, and repeated states of no default route in the table even though all gateways are available.
-
@basicmonkey Not the same, different ISPs. The site has a WISP router gateway as the tier1 and pppoe connection (ADSL) from pfsense as tier2. The failover is never going to be nice and seamless but actually failover to tier2 always works, it just never goes back to tier1.
It’s setup this way because the WISP link has three radio hops and a very long piece of cat5 between the pfsense box and the nearest tower, which I think is at least one more radio hop to the fibre backhaul. It’s all best efforts with more single points of failure than I care to count. Various bits of the chain are prone to random power outages. It’s all amazingly reliable considering but the ADSL line, though slow, tends to always work.... unless too much water gets into the junction boxes. Isn’t it fun looking after a rural network.
I’m minded to setup the ADSL on a separate router so pfsense isn’t handling the pppoe and see if it starts working. To be honest I’d consider ditching pfsense at this stage and trying other things but we bought netgate hardware so I’m stuck with it for now.
@nleaudio Thought I’d say a quick Hi Bob from an ampmix user :)
-
@idiotzoo I'm rural too, but thanks to some UK investment we've got 2x fibre lines with 330/50 on each! Very lucky. Due to our location, there's a few miles of core between us and the exchange so I need a backup. I can get some 4G off a cell a distance away using a directional and a Teltonika router. Our ISP will route our /29s down L2TP over 4G if fibres go down.
All of the above sits on a Cisco 2921 which handles per-packet load sharing across the fibres and then failover to the L2TP/4G.
On a failure, we lose a second or so but connections stay alive as main IPs don't change. It really helps having a router in front of the pf. Let the pf do the things it's great at and let the router do the thing it's great at.
The 2921 is a bit of an old beast so looking at MikroTik CCR1009-7G-1C-1S+ for these 2 client offices. They have their own way of doing per-packet. Never tried them before, config looks a bit tricky but always a learning curve!