Dual WAN Failover bounces on and off…

Perry

Looks fine.

My next step would be something like this:
use the dsl line as gateway for a while
change WAN to DSL , opt1 to Cable
bridge the dsl modem if you can
change pc and use the live cd

tseanf

Last night I switched the two connections. Everything seemed to work fine, when I pulled the plug on the DSL it switched over to cable modem and seemed to stay active.

When I switched it back the other way around my issue persisted unfortunately. The DSL connection on the failover is just really flaky. I really think its something with my setup and the monitoring or something, not sure.

But, its not a big deal. Just would have been nice to get working. Thanks for your help though.

-Sean

tseanf

The other thing to note in case anyone has any weird things to try, is that I flipped my pool to load balance, and everything is working fine. Both connections show Online constantly. And I verified it was balancing between the two.

MindTwist

@tseanf:

The other thing to note in case anyone has any weird things to try, is that I flipped my pool to load balance, and everything is working fine. Both connections show Online constantly. And I verified it was balancing between the two.

Tseanf,
I am finishing seting up multi wan with failover, and I was wondering, how or where do you see if one of the links is "Online" or not? WHere do you go to see that status?
Thanks!

hoba

current status can be seen at status>loadbalancer and historical changes at status>systemlogs, loadbalancer.

drees

You know, I noticed a similar issue with failover "flapping" just today on my Multi WAN setup (T1 and DSL line, static IPs on both). Running pfSense 1.2.

I have three pools, a load balance pool, and a WAN1-failto-WAN2 and a WAN2-failto-WAN1 pool so that I can prefer certain gateways depending on the type of traffic.

What happened was that over a period of about 5-10 minutes (I can get logs tomorrow) both WAN connections would fail ping tests for a few seconds, then they'd be OK for a few seconds, then they'd fail again for a few seconds and so on. It sounds strangely familar to the issue tseanf describes, so I will try testing failover of each line tomorrow to see if one line failing triggers the flapping for some reason.

tseanf

I didn't know about the load balancer logs until Hoba mentioned them. I will also play with it more tonight and report what logs I have when my connection is flapping.

hoba

status>rrd graphs, quality graphs might be interesting as well. If the connections become unreliable you should see it there as well. You might even be able by comparing other graphs (states, pps,…) and the times when your wans go down if something else is going on.

drees

Looking at my incident yesterday, here is a log snippet:

Apr 29 16:12:16 	slbd[427]: Service WAN1FailsToWAN2 changed status, reloading filter policy
Apr 29 16:12:16 	slbd[427]: ICMP poll succeeded for 12.213.4.24, marking service UP
Apr 29 16:12:16 	slbd[427]: ICMP poll succeeded for 68.94.156.1, marking service UP
Apr 29 16:12:15 	slbd[427]: Switching to sitedown for VIP 127.0.0.1:666
Apr 29 16:12:15 	slbd[427]: Switching to sitedown for VIP 127.0.0.1:666
Apr 29 16:12:14 	slbd[427]: Service WAN2FailsToWAN1 changed status, reloading filter policy
Apr 29 16:12:14 	slbd[427]: ICMP poll succeeded for 68.94.156.1, marking service UP
Apr 29 16:12:14 	slbd[427]: ICMP poll succeeded for 12.213.4.24, marking service UP
Apr 29 16:12:11 	slbd[427]: Service WAN1FailsToWAN2 changed status, reloading filter policy
Apr 29 16:12:11 	slbd[427]: ICMP poll failed for 12.213.4.24, marking service DOWN
Apr 29 16:12:11 	slbd[427]: ICMP poll failed for 68.94.156.1, marking service DOWN
Apr 29 16:12:10 	slbd[427]: Switching to sitedown for VIP 127.0.0.1:666
Apr 29 16:12:10 	slbd[427]: Switching to sitedown for VIP 127.0.0.1:666
Apr 29 16:12:09 	slbd[427]: Service WAN2FailsToWAN1 changed status, reloading filter policy
Apr 29 16:12:09 	slbd[427]: ICMP poll failed for 68.94.156.1, marking service DOWN
Apr 29 16:12:09 	slbd[427]: ICMP poll failed for 12.213.4.24, marking service DOWN
Apr 29 16:12:06 	slbd[427]: Service LoadBalance changed status, reloading filter policy
Apr 29 16:12:06 	slbd[427]: ICMP poll failed for 68.94.156.1, marking service DOWN
Apr 29 16:12:06 	slbd[427]: ICMP poll failed for 12.213.4.24, marking service DOWN

68.94.156.1 is the monitor IP for WAN1, 12.213.4.24 is the monitor IP for WAN2.

This basically continued for nearly 10 minutes (started at 16:12:06 and ended at 16:21:18) with the pools going up and down every few seconds when it mysteriously cleared itself up.

The quality graphs indicated heavy packet loss on both links during this time.

Now, I also use SmokePing to monitor the pfSense box and both lines using several other machines around the world, and while one did pick up some packet loss on both lines, the others picked up some packet loss on WAN2 but did not see any at all on WAN1.

Based on this I think it's possible that WAN2 did in fact go down for a bit, but WAN1 most likely did not.

How does the ICMP poll detect failures and then determine that a host is down?

hoba

It pings the monitor IPs every few seconds and if a line of x pings (not sure how many atm, iirc 5 or something in that range) fails the link is considered down. Maybe try some different Monitor IPs and see if that makes a difference? Unreliable monitor IPs can cause link down detection though the link is still up.

tseanf

Ok I did some testing tonight to grab the logs.

Cable Monitor IP: 68.87.77.130
DSL Monitor IP: 216.17.3.122

I disconnected Cable at 17:40:00 and plugged it back in at 17:44:30

Load Balancer Logs:


Apr 30 17:40:11 	slbd[68488]: ICMP poll failed for 68.87.77.130, marking service DOWN
Apr 30 17:40:11 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:40:48 	slbd[68488]: ICMP poll failed for 216.17.3.122, marking service DOWN
Apr 30 17:40:48 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:40:49 	slbd[68488]: Switching to sitedown for VIP 127.0.0.1:666
Apr 30 17:41:00 	last message repeated 2 times
Apr 30 17:41:00 	slbd[68488]: ICMP poll succeeded for 216.17.3.122, marking service UP
Apr 30 17:41:00 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:42:03 	slbd[68488]: ICMP poll failed for 216.17.3.122, marking service DOWN
Apr 30 17:42:03 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:42:06 	slbd[68488]: Switching to sitedown for VIP 127.0.0.1:666
Apr 30 17:42:21 	last message repeated 3 times
Apr 30 17:42:25 	slbd[68488]: ICMP poll succeeded for 216.17.3.122, marking service UP
Apr 30 17:42:25 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:42:56 	slbd[68488]: ICMP poll failed for 216.17.3.122, marking service DOWN
Apr 30 17:42:56 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:42:58 	slbd[68488]: Switching to sitedown for VIP 127.0.0.1:666
Apr 30 17:43:13 	last message repeated 3 times
Apr 30 17:43:13 	slbd[68488]: ICMP poll succeeded for 216.17.3.122, marking service UP
Apr 30 17:43:13 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:43:55 	slbd[68488]: ICMP poll failed for 216.17.3.122, marking service DOWN
Apr 30 17:43:55 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:43:57 	slbd[68488]: Switching to sitedown for VIP 127.0.0.1:666
Apr 30 17:44:17 	last message repeated 4 times
Apr 30 17:44:17 	slbd[68488]: ICMP poll succeeded for 216.17.3.122, marking service UP
Apr 30 17:44:17 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy
Apr 30 17:44:34 	slbd[68488]: ICMP poll succeeded for 68.87.77.130, marking service UP
Apr 30 17:44:34 	slbd[68488]: Service FailoverInternet changed status, reloading filter policy

RRD Quality Graph (Cable):

Note on the graphs, I was messing a bit before 5:40 too, and both were flapping. Also before settling on those Monitor IPs (which are DNS servers for each ISP), I tried many many different monitor IPs.

Thanks,

-Sean

drees

tseanf, is your Cable line OPT1 and your DSL line WAN or the other way around?

Your logs look just like mine (except you only have one load balance pool).

Have you checked your routing tables - are there static routes to the monitor IPs for each interface? Mine look OK now, but I have changed my loadbalance config since the flapping occurred. It seems like the pings for one of the monitor IPs are getting routed out the wrong interface.

tseanf

Cable is WAN and DSL is OPT1…

Yeah, I only set it up to fail over to DSL. My Cable connection is 16mbps compared to my 1.5mbps DSL, so I don't really care about load balancing.