Multi-WAN gateway failover not switching back to tier 1 gw after back online
-
Are you using DHCP on the WANs or what are you using?
-
I have what looks like the same problem. Gateway group with three gateways. When the tier1 goes down (packet loss) tier2 is used. When tier1 comes back, it does not get used and requires manual reconfigure or reboot. No changes I'm aware of to trigger this behaviour. No hints in the logs.
-
@idiotzoo
Are you using DHCP on the WANs or what are you using on each WAN?
Please don't use DHCP, use static instead an report back. Set your main WAN as upstream Gateway. -
@mo10 The tier2 link us using PPPoE, correct me if I'm wrong but I can't use PPPoE with static IPv4 config.
I'm not sure what you mean by "Set your main WAN as upstream Gateway".
The main WAN link is static. This is a WISP link with a local NAT gateway connected via a vlan, so the physical link never goes down from PFsense point of view. The gateway (a ubiquti radio) also is no use as indicator of the connection health so I have to ping to something and use packet loss to determine the link's state.
This was working. The only change is the tier3 link appears to have failed entirely, so this is sitting in a pending state. I'm wondering if this is causing the gateway group to behave incorrectly. Next time the issue occurs I'll remove it and see what happens.
-
This sounds like a setup Error (pending). Do as you say, delete tier 3 from group an delete tier 3 interface. Then add everything again.
I was asking about DHCP because this was the reason I had problems. I heared dual pppoe can cause problems as well but I am not sure.
-
@mo10 Someone on site has verified the tier3 connection is borked at layer1 so removing that isn't going to hurt anything. Certainly having a dead link on a lower priority (higher tier) shouldn't cause any issues with the gateway group behaviour and if it does, this is a bug.... but it would be nice to know why a functioning system has stopped working. As this line failure is the only change I'm hopeful that at least explains the issue.
-
i have found out that there are really strange problems when unplugging and replugging a cable on any wan-port while using DHCP on it.
So maybe you can reproduce your problem by physically unplugging and replugging on your interfaces.
What helped me without needing to reboot: just hit save on any interface.
-
I removed the failed wan link from the gateway group, no difference. I've now disabled that interface entirely, still doesn't work.
I'm at a bit of a loss.
Anybody know if there's any debugging I can look at? Right now I only know there's a problem if the users tell me. The gateways all look fine, it just doesn't switch back to the tier1 as it should.
-
@idiotzoo said in Multi-WAN gateway failover not switching back to tier 1 gw after back online:
I have what looks like the same problem. Gateway group with three gateways. When the tier1 goes down (packet loss) tier2 is used. When tier1 comes back, it does not get used and requires manual reconfigure or reboot. No changes I'm aware of to trigger this behaviour. No hints in the logs.
Hello!
What gateway group (failover/loadbalance) are you using as the Default Gateway on System -> Routing -> Gateways?
What gateway group(s) are you using for all your rules with outbound WAN traffic?
John
-
@serbus sorry for the delay in replying.
The system default is wan1 (the fast wan link)
Outbound traffic with a source on the LAN is using a gateway group called office_internet with the wan1 as tier1 and a slower PPPoE ADSL link as tier2. -
I've got this issue at a client's office.
If I set a 2 tier gateway group as default gateway for IPv4, on failure of tier 1, tier 2 takes over but doesn't switch back to tier 1 on tier 1 recovery (confirmed on gateway status page). This doesn't happen eve after waiting for an hour.
Interestingly, if I set default gateway to the tier 1 link (which then works as expected) and back to the gateway group, the group is still stuck at tier 2.
This is the same with 'member down' and 'packet loss' options. Tier 1 is PPPoE with dynamic gateway (if that makes a difference).
One thing that may be relevant is that both tier 1 and tier 2 have the same gateway IP. Tier 1 is PPPoE over VDSL, tier 2 is L2TP to the same ISP.
pfSense seems to fail to create correct default routes after fails and I'm often left with no default route despite having working gateways set and active. I need to disable and re-enable the interface to bring it back.
Is this a PPPoE thing?
-
@basicmonkey said in Multi-WAN gateway failover not switching back to tier 1 gw after back online:
working gateways set and active. I need to disable and re-enable the interface to bring it back.
IsYou might be on to something with it being a ppp connection. The tier2 gateway on my network used to be via second router, now it’s pppoe from pfsense. That’s probably when it stopped working.
I’m still convinced this is a bug and pfsense is broke.
-
I’m still convinced this is a bug and pfsense is broke.
You are correct. The odd thing is that it sometimes works. I've told it to drop states when there is a change, which definitely is a prerequisite, but it sometimes does go back to the tier 1 gateway. I really wish this would get fixed, it's one of the big advantages PFSense has (if it worked right!)
Bob
-
@idiotzoo What's your tier 1? Is it the same gateway IP?
I've not come across any of this before as my pf install sits behind a router that deals with all the different WAN options seamlessly. It's only recently that 2 clients have wanted to go down the pf route, have both bought 7100s and now the WAN failover isn't working as well as it should.
Since both sites have all their NAT and other services bound to Virtual IPs, killing states isn't too much of an issue. The ISP just moves their /30 and /29s over when the main line goes down.
The issue is gateway group recovery to tier 1, and repeated states of no default route in the table even though all gateways are available.
-
@basicmonkey Not the same, different ISPs. The site has a WISP router gateway as the tier1 and pppoe connection (ADSL) from pfsense as tier2. The failover is never going to be nice and seamless but actually failover to tier2 always works, it just never goes back to tier1.
It’s setup this way because the WISP link has three radio hops and a very long piece of cat5 between the pfsense box and the nearest tower, which I think is at least one more radio hop to the fibre backhaul. It’s all best efforts with more single points of failure than I care to count. Various bits of the chain are prone to random power outages. It’s all amazingly reliable considering but the ADSL line, though slow, tends to always work.... unless too much water gets into the junction boxes. Isn’t it fun looking after a rural network.
I’m minded to setup the ADSL on a separate router so pfsense isn’t handling the pppoe and see if it starts working. To be honest I’d consider ditching pfsense at this stage and trying other things but we bought netgate hardware so I’m stuck with it for now.
@nleaudio Thought I’d say a quick Hi Bob from an ampmix user :)
-
@idiotzoo I'm rural too, but thanks to some UK investment we've got 2x fibre lines with 330/50 on each! Very lucky. Due to our location, there's a few miles of core between us and the exchange so I need a backup. I can get some 4G off a cell a distance away using a directional and a Teltonika router. Our ISP will route our /29s down L2TP over 4G if fibres go down.
All of the above sits on a Cisco 2921 which handles per-packet load sharing across the fibres and then failover to the L2TP/4G.
On a failure, we lose a second or so but connections stay alive as main IPs don't change. It really helps having a router in front of the pf. Let the pf do the things it's great at and let the router do the thing it's great at.
The 2921 is a bit of an old beast so looking at MikroTik CCR1009-7G-1C-1S+ for these 2 client offices. They have their own way of doing per-packet. Never tried them before, config looks a bit tricky but always a learning curve!
-
@basicmonkey that’s some nice fibre action. Sadly the outfit I support, near Scarborough, has no such connectivity. They’re only a couple of miles from the nearest exchange, but the infrastructure is ancient. Having said the ADSL is reliable, it has been, but one of the two lines was taken out during some rain and never fixed.
The Mikrotik gear is incredibly full featured but I find the UI and config opaque and it’s certainly a steep learning curve.
I’m really using pfsense as a router with failover.... there’s minimal fire walking required, it’s multi-wan functionality that I’ve used for years, always successfully until the recent problems.
The problem with doing anything at the isp level is we only have the WISP (up to 21mb) or ADSL (1-2mb). 4G is available by patchy, though I plan to get one of the Mikrotik 4G routers with built in antenna, they can do a great job in an area with marginal signal.
Fortunately there’s nothing requiring the IPs stay the same... though I am interested in playing with mptcp router to load balance across the available links. It will depend if our WISP is willing to offer a termination at their end, it’s probably not going to pan out but it looks interesting.
-
@idiotzoo Feel for you having to use any RF. It's an absolute pain!
I'll have a look at the MikroTik config, does look a bit alien when I'm so used to Cisco. Might take a punt on one of the cheaper ones. Have you used one in anger?
I'd recommend the Teltonika 4G gear, very good. Have just moved from Sierra.
It'd be great if I could get the pf multi-wan working smoothly as it would save another bit of kit for my clients.
-
@basicmonkey I’ve got a Mikrotik Hex as the home broadband router. Its amazingly capable. I’ve got something like 18 subnets for lab work at the moment, a couple of IPSec site to site tunnels, tplink and pfsense at the other ends, BGP and OSPF peering to things running in GNS3. To be honest there’s nothing I’ve tried it can’t do, providing I can work out how to do it. For £60 or thereabouts it’s worth it to at least get some familiarity with the OS. I find the firewall and NAT to be deeply unintuitive. There seems to be about 4 places to define a vlan and I’m still not sure what’s actually the right way to do some things. I’ve ended up with weird mtu settings on SVIs so I’ve definitely done something wrong, but it works.
-
@idiotzoo I'll give it a try! Always good to learn something new. All my local layer 3 is done in a stack of 3850s, firewall and NAT in pf. Literally just need a packet pusher that can do per-packet load sharing and failover.
-