Multi-WAN gateway failover not switching back to tier 1 gw after back online
-
:( :(
That's bad news. Yes, I already use Google DNS servers 8.8.8.8 and 8.8.4.4 as monitor IP's, and DNS association selected in the General Section. I'm pretty sure configuration is OK (at this point, I've reviewed it a hundred times :) ) and this looks like a pfsense problem.
I've openned a bug (https://redmine.pfsense.org/issues/5090#change-20401). Would be good if you add your comments there, so they can investigate the problem.
-
:( :(
That's bad news. Yes, I already use Google DNS servers 8.8.8.8 and 8.8.4.4 as monitor IP's, and DNS association selected in the General Section. I'm pretty sure configuration is OK (at this point, I've reviewed it a hundred times :) ) and this looks like a pfsense problem.
I also tend to believe the same, pfsense has an issue. I spent a week trying to figure this out, I will try one more setup from scratch and make snapshots in vmware.
I've openned a bug (https://redmine.pfsense.org/issues/5090#change-20401). Would be good if you add your comments there, so they can investigate the problem.
-
…and this looks like a pfsense problem...
I cannot second that!
I have this working for quite some time now with WAN1 (100Mb cable) and a rather old WAN2 (6Mb DSL).
I have failover to W2 if W1 is down and immediately W1 again when available.Show us your System | Routing | Gateway Groups page.
-
…and this looks like a pfsense problem...
I cannot second that!
I have this working for quite some time now with WAN1 (100Mb cable) and a rather old WAN2 (6Mb DSL).
I have failover to W2 if W1 is down and immediately W1 again when available.Show us your System | Routing | Gateway Groups page.
Hi. Thanks for reply. Please see the screenshots
-
You have one or two Gateway Groups defined? The one with time stamp 02-25-44.
What you call "WANGROUP" is easier to handle when called " PPPoE 2 UPC"
Now you need an additional "UPC 2 PPPoE" group with reversed tiers.
Add another firewall rule for that one as well and it should work.And start with setting both "Trigger levels" to "Member Down".
-
Hi again
But this is a temporal solution you've found to make it work or is the normal configuration for failover? I can't understand why we have to create two Gateways Groups, cause we only want one failover direction, not the oposite. I've been reviewing documentation and online info, and you should only need to create one gateway group.
When you create the second firewall rule with the inverted Tier numbers, if that rule goes after the normal rule, the firewall should never reach the second one, because it looks at the rules sequentially, so when it reaches the first one, it directs the traffic through the main group (the group is online, because wan2 is online). It should never reach the second rule, so if it's doing it in your case, I think something strange happens.
I'm probably wrong or I'm missing something, but I just want to clarify if with your configuration it's working because that's the normal config or is some other problem that makes it work although is not the right configuration.
Now I can't test it in my client, because is a production system, but I'll try to make a demo in our office to see if I can verify your configuration. If it works, it could be a good temporal patch to solve the problem, but I still think something is not going well. The group should recover gateways and order preference automatically (that's what Tier is for).
Thanks for your help
-
Hi again
But this is a temporal solution you've found to make it work or is the normal configuration for failover? I can't understand why we have to create two Gateways Groups, cause we only want one failover direction, not the oposite. I've been reviewing documentation and online info, and you should only need to create one gateway group.
When you create the second firewall rule with the inverted Tier numbers, if that rule goes after the normal rule, the firewall should never reach the second one, because it looks at the rules sequentially, so when it reaches the first one, it directs the traffic through the main group (the group is online, because wan2 is online). It should never reach the second rule, so if it's doing it in your case, I think something strange happens.
I'm probably wrong or I'm missing something, but I just want to clarify if with your configuration it's working because that's the normal config or is some other problem that makes it work although is not the right configuration.
Now I can't test it in my client, because is a production system, but I'll try to make a demo in our office to see if I can verify your configuration. If it works, it could be a good temporal patch to solve the problem, but I still think something is not going well. The group should recover gateways and order preference automatically (that's what Tier is for).
Thanks for your help
Hi. Indeed, a second rule makes no sense to me also but I will test it as advised, test it for the second time actuallly. I've tried even with 3 rules, same results.
In the mean time I've done more testing and got to a conclusion but first please let me know how did you simulate the main wan failure?
-
Second rule and gateway group is not necessary unless you want some traffic to prefer the second route and fail over the other way.
You only need the one Tier 1 to Tier 2 to fail all traffic over is that direction.
It certainly should recover and "fail back" when the Tier 1 route comes back up.
-
Second rule and gateway group is not necessary unless you want some traffic to prefer the second route and fail over the other way.
You only need the one Tier 1 to Tier 2 to fail all traffic over is that direction.
It certainly should recover and "fail back" when the Tier 1 route comes back up.
Well, in my case it doesn't. WAN1 is a PPPOE connection, and after I re-plug or the Ethernet cable in WAN 1 all connections still go through OPT1.
For the testing purpose, I added another router in front of pfsense so it won't have to use a PPPOE connection, I assigned to WAN1 a static IP like OPT1 has. In this case to some extent it works if I unplug/re-plug the connection on the first router (take down the ISP Interface, the fiber media converter) so both WAN and OPT1 stay up in pfsense. Still, some sites refuses to load in Chrome with the following error: DNS_PROBE_FINISHED_NXDOMAIN
So, for now my only conclusion is that there is a problem with pfsense when you unplug and re-plug the cable on the interface using a PPPOE connection. The dns error is still a mystery to me, I still need to figure it out.
-
Well you need to fix your DNS. Sounds like it might not be working right on one or both WANs. Are you using the forwarder or the resolver?
It shouldn't matter which WAN the resolver uses because it should only be trying to talk to authoritative name servers that should accept queries from everywhere.
The problem lies in forwarders because you usually point the forwarder at ISP caching servers and they might only accept connections from their network so it matters which DNS servers are used out which interface.
-
Well you need to fix your DNS. Sounds like it might not be working right on one or both WANs. Are you using the forwarder or the resolver?
It shouldn't matter which WAN the resolver uses because it should only be trying to talk to authoritative name servers that should accept queries from everywhere.
The problem lies in forwarders because you usually point the forwarder at ISP caching servers and they might only accept connections from their network so it matters which DNS servers are used out which interface.
I tried both the resolver and the forwarder, some sites are just not resolved. Unfortunately I don't think I can use pfsense in a production environment, for me at least failover it's not working with pppoe :(.
should "State Killing on Gateway Failure" should be on?
Thanks
-
Depends on whether or not you want states killed on a gateway failure.
-
Depends on whether or not you want states killed on a gateway failure.
Well, isn't better to have them reset on a gw failure? The definition is a bit tricky for this option
-
I only skimmed through this thread so I apologize if this was already suggested but – are you certain your clients are set to use the pfSense IP as their DNS resolver? If e.g. you have a gateway defined with a custom monitor IP of 8.8.8.8 or the DNS servers on your General settings page are locked to a specific gateway, then static routes are built which will force traffic out that specific gateway, even if it's down. So this could result in DNS being "dead" when one of the gateways goes down. Is this possibly what's happening?
-
I only skimmed through this thread so I apologize if this was already suggested but – are you certain your clients are set to use the pfSense IP as their DNS resolver? If e.g. you have a gateway defined with a custom monitor IP of 8.8.8.8 or the DNS servers on your General settings page are locked to a specific gateway, then static routes are built which will force traffic out that specific gateway, even if it's down. So this could result in DNS being "dead" when one of the gateways goes down. Is this possibly what's happening?
Monitor IPs are currently set to one of each ISP, in General I have a pair of DNSes set for each gateway (four servers in total). Clients DNS is manullay set 192.168.1.1 (pfsense)
-
I tried both the resolver and the forwarder, some sites are just not resolved.
If you do not know how to get more information than that about what's actually happening, you are probably in over your head.
-
I tried both the resolver and the forwarder, some sites are just not resolved.
If you do not know how to get more information than that about what's actually happening, you are probably in over your head.
Oh, nice. What can I say?Thanks? :)…..Thanks.
-
Hi
yanakis, in my case we have the fiber media converter and the router (not PPPoE), and happens the same. I switched off or disconnected the router, but never tried switching off media converter (good idea).
In my last installations I don't usually use DNS forwarder/resolver for localhost, but in this case I do (I configured it in the past and never change it). Have you tried deactivating that option in General Settings? Just to see if something changes.
I understand luckman212 concerns about DNS and static routes created by pfsense for each DNS associated to a wan, but in my case we had two different DNS configured and working, and failed. And in any case, once wan is recovered again, DNS works again and everything should work again.
By the way, I tried with "State Killing on Gateway Failure" on and off, and recover fails in both cases. I keep it unchecked, because with external sip connections is mandatory to make failover work (at least in my case). And I personally prefer to reset states if a gateway fails, to avoid problems.
Regards
PD: I don't think you are in over your head… Thanks for all
-
Hi
yanakis, in my case we have the fiber media converter and the router (not PPPoE), and happens the same. I switched off or disconnected the router, but never tried switching off media converter (good idea).
In my last installations I don't usually use DNS forwarder/resolver for localhost, but in this case I do (I configured it in the past and never change it). Have you tried deactivating that option in General Settings? Just to see if something changes.
I understand luckman212 concerns about DNS and static routes created by pfsense for each DNS associated to a wan, but in my case we had two different DNS configured and working, and failed. And in any case, once wan is recovered again, DNS works again and everything should work again.
By the way, I tried with "State Killing on Gateway Failure" on and off, and recover fails in both cases. I keep it unchecked, because with external sip connections is mandatory to make failover work (at least in my case). And I personally prefer to reset states if a gateway fails, to avoid problems.
Regards
PD: I don't think you are in over your head… Thanks for all
Well, I left empty the DNS fields in General but failback to WAN still not working after WAN recovery unless I change something in Firewall or Routing and apply changes :(
-
…and this looks like a pfsense problem...
I cannot second that!
I have this working for quite some time now with WAN1 (100Mb cable) and a rather old WAN2 (6Mb DSL).
I have failover to W2 if W1 is down and immediately W1 again when available.Show us your System | Routing | Gateway Groups page.
Hi Cris. Can you please post your setup? Thanks