Dual WAN Failover bounces on and off…



  • Hello,

    I have a pretty simple setup.  I have 2 WAN connections, one is a cable modem, the other is DSL.  I simply want pfSense to failover to the slower DSL connection if the cable modem goes down.

    With the way I have this configured it "kind of" works.  When I unplug my cable modem, it correctly detects that it went offline, and initially still shows the DSL (OPT1) as Online.  And I can usually hit websites for 5 or 10 seconds.  Then the status of my DSL connection continually flips back and forth between Online and Offline.  And depending on when I catch it I can make a connection to the outside.

    I followed all the recommendations from http://doc.pfsense.org/index.php/MultiWanVersion1.2  Although I do not have any DMZs setup, and just have one gateway pool for the failover.

    My primay and secondary DNS for pfSense are set to a DNS server from each of my ISPs, and they are set to respectively as the monitor IP.  I have tried many many different options for the Monitor IP.  I have tried the Gateway IP, external web IPs, even internal IPs on the DSL to just try and not get it to bounce to offline.

    The only other thing to note is that that my WAN connection (to cable modem) is hooked directly to my cable modem and I am getting an IP directly from my ISP using DHCP.  The DSL connection (opt1) is connecting via static IP to the gateway of my DSL Modem/Router.

    Any ideas would be appreciated.  It works great if the secondary DSL connection would just stay Online.

    Thanks,

    Sean



  • Can you post a screen dump of your pools



  • Sure, I just have the one pool:

    Thanks again!

    -Sean



  • Looks fine.

    My next step would be something like this:
    use the dsl line as gateway for a while
    change WAN to DSL , opt1 to Cable
    bridge the dsl modem if you can
    change pc and use the live cd



  • Last night I switched the two connections.  Everything seemed to work fine, when I pulled the plug on the DSL it switched over to cable modem and seemed to stay active.

    When I switched it back the other way around my issue persisted unfortunately.  The DSL connection on the failover is just really flaky.  I really think its something with my setup and the monitoring or something, not sure.

    But, its not a big deal.  Just would have been nice to get working.  Thanks for your help though.

    -Sean



  • The other thing to note in case anyone has any weird things to try, is that I flipped my pool to load balance, and everything is working fine.  Both connections show Online constantly.  And I verified it was balancing between the two.



  • @tseanf:

    The other thing to note in case anyone has any weird things to try, is that I flipped my pool to load balance, and everything is working fine.  Both connections show Online constantly.  And I verified it was balancing between the two.

    Tseanf,
    I am finishing seting up multi wan with failover, and I was wondering, how or where do you see if one of the links is "Online" or not? WHere do you go to see that status?
    Thanks!



  • current status can be seen at status>loadbalancer and historical changes at status>systemlogs, loadbalancer.



  • You know, I noticed a similar issue with failover "flapping" just today on my Multi WAN setup (T1 and DSL line, static IPs on both). Running pfSense 1.2.

    I have three pools, a load balance pool, and a WAN1-failto-WAN2 and a WAN2-failto-WAN1 pool so that I can prefer certain gateways depending on the type of traffic.

    What happened was that over a period of about 5-10 minutes (I can get logs tomorrow) both WAN connections would fail ping tests for a few seconds, then they'd be OK for a few seconds, then they'd fail again for a few seconds and so on. It sounds strangely familar to the issue tseanf describes, so I will try testing failover of each line tomorrow to see if one line failing triggers the flapping for some reason.



  • I didn't know about the load balancer logs until Hoba mentioned them.  I will also play with it more tonight and report what logs I have when my connection is flapping.



  • status>rrd graphs, quality graphs might be interesting as well. If the connections become unreliable you should see it there as well. You might even be able by comparing other graphs (states, pps,…) and the times when your wans go down if something else is going on.



  • Looking at my incident yesterday, here is a log snippet:

    Apr 29 16:12:16 	slbd[427]: Service WAN1FailsToWAN2 changed status, reloading filter policy
    Apr 29 16:12:16 	slbd[427]: ICMP poll succeeded for 12.213.4.24, marking service UP
    Apr 29 16:12:16 	slbd[427]: ICMP poll succeeded for 68.94.156.1, marking service UP
    Apr 29 16:12:15 	slbd[427]: Switching to sitedown for VIP 127.0.0.1:666
    Apr 29 16:12:15 	slbd[427]: Switching to sitedown for VIP 127.0.0.1:666
    Apr 29 16:12:14 	slbd[427]: Service WAN2FailsToWAN1 changed status, reloading filter policy
    Apr 29 16:12:14 	slbd[427]: ICMP poll succeeded for 68.94.156.1, marking service UP
    Apr 29 16:12:14 	slbd[427]: ICMP poll succeeded for 12.213.4.24, marking service UP
    Apr 29 16:12:11 	slbd[427]: Service WAN1FailsToWAN2 changed status, reloading filter policy
    Apr 29 16:12:11 	slbd[427]: ICMP poll failed for 12.213.4.24, marking service DOWN
    Apr 29 16:12:11 	slbd[427]: ICMP poll failed for 68.94.156.1, marking service DOWN
    Apr 29 16:12:10 	slbd[427]: Switching to sitedown for VIP 127.0.0.1:666
    Apr 29 16:12:10 	slbd[427]: Switching to sitedown for VIP 127.0.0.1:666
    Apr 29 16:12:09 	slbd[427]: Service WAN2FailsToWAN1 changed status, reloading filter policy
    Apr 29 16:12:09 	slbd[427]: ICMP poll failed for 68.94.156.1, marking service DOWN
    Apr 29 16:12:09 	slbd[427]: ICMP poll failed for 12.213.4.24, marking service DOWN
    Apr 29 16:12:06 	slbd[427]: Service LoadBalance changed status, reloading filter policy
    Apr 29 16:12:06 	slbd[427]: ICMP poll failed for 68.94.156.1, marking service DOWN
    Apr 29 16:12:06 	slbd[427]: ICMP poll failed for 12.213.4.24, marking service DOWN
    

    68.94.156.1 is the monitor IP for WAN1, 12.213.4.24 is the monitor IP for WAN2.

    This basically continued for nearly 10 minutes (started at 16:12:06 and ended at 16:21:18) with the pools going up and down every few seconds when it mysteriously cleared itself up.

    The quality graphs indicated heavy packet loss on both links during this time.

    Now, I also use SmokePing to monitor the pfSense box and both lines using several other machines around the world, and while one did pick up some packet loss on both lines, the others picked up some packet loss on WAN2 but did not see any at all on WAN1.

    Based on this I think it's possible that WAN2 did in fact go down for a bit, but WAN1 most likely did not.

    How does the ICMP poll detect failures and then determine that a host is down?



  • It pings the monitor IPs every few seconds and if a line of x pings (not sure how many atm, iirc 5 or something in that range) fails the link is considered down. Maybe try some different Monitor IPs and see if that makes a difference? Unreliable monitor IPs can cause link down detection though the link is still up.



  • Ok I did some testing tonight to grab the logs.

    Cable Monitor IP: 68.87.77.130
    DSL Monitor IP: 216.17.3.122

    I disconnected Cable at 17:40:00 and plugged it back in at 17:44:30

    Load Balancer Logs:

    
    Apr 30 17:40:11 	slbd[68488]: ICMP poll failed for 68.87.77.130, marking service DOWN
    Apr 30 17:40:11 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:40:48 	slbd[68488]: ICMP poll failed for 216.17.3.122, marking service DOWN
    Apr 30 17:40:48 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:40:49 	slbd[68488]: Switching to sitedown for VIP 127.0.0.1:666
    Apr 30 17:41:00 	last message repeated 2 times
    Apr 30 17:41:00 	slbd[68488]: ICMP poll succeeded for 216.17.3.122, marking service UP
    Apr 30 17:41:00 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:42:03 	slbd[68488]: ICMP poll failed for 216.17.3.122, marking service DOWN
    Apr 30 17:42:03 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:42:06 	slbd[68488]: Switching to sitedown for VIP 127.0.0.1:666
    Apr 30 17:42:21 	last message repeated 3 times
    Apr 30 17:42:25 	slbd[68488]: ICMP poll succeeded for 216.17.3.122, marking service UP
    Apr 30 17:42:25 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:42:56 	slbd[68488]: ICMP poll failed for 216.17.3.122, marking service DOWN
    Apr 30 17:42:56 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:42:58 	slbd[68488]: Switching to sitedown for VIP 127.0.0.1:666
    Apr 30 17:43:13 	last message repeated 3 times
    Apr 30 17:43:13 	slbd[68488]: ICMP poll succeeded for 216.17.3.122, marking service UP
    Apr 30 17:43:13 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:43:55 	slbd[68488]: ICMP poll failed for 216.17.3.122, marking service DOWN
    Apr 30 17:43:55 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:43:57 	slbd[68488]: Switching to sitedown for VIP 127.0.0.1:666
    Apr 30 17:44:17 	last message repeated 4 times
    Apr 30 17:44:17 	slbd[68488]: ICMP poll succeeded for 216.17.3.122, marking service UP
    Apr 30 17:44:17 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    Apr 30 17:44:34 	slbd[68488]: ICMP poll succeeded for 68.87.77.130, marking service UP
    Apr 30 17:44:34 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
    
    

    RRD Quality Graph (Cable):

    RRD Quality Graph (Cable):

    Note on the graphs, I was messing a bit before 5:40 too, and both were flapping.  Also before settling on those Monitor IPs (which are DNS servers for each ISP), I tried many many different monitor IPs.

    Thanks,

    -Sean



  • tseanf, is your Cable line OPT1 and your DSL line WAN or the other way around?

    Your logs look just like mine (except you only have one load balance pool).

    Have you checked your routing tables - are there static routes to the monitor IPs for each interface? Mine look OK now, but I have changed my loadbalance config since the flapping occurred. It seems like the pings for one of the monitor IPs are getting routed out the wrong interface.



  • Cable is WAN and DSL is OPT1…

    Yeah, I only set it up to fail over to DSL.  My Cable connection is 16mbps compared to my 1.5mbps DSL, so I don't really care about load balancing.


Locked