Gateway Monitoring (Advanced - Down)

biggsy

I must say that my preference is for "interval".

I like the suggestion from adam65535 and agree with phil.davis about real-world use.

If the down timer starts at the point where a probe is determined to have failed, then exceeding a simple limit on sequential failed probes should be sufficient to consider the interface as "down".

adam65535

What I like about expressing the count is that you are making it obvious how many probes will be sent by making the user explicitly inputting that instead of just inferring it by some overall timeout.

Between clearing this up and finding a way to allow the user to specify a carp/pfsync queue or figuring out how to bypass carp/pfsync traffic from queues I suspect there will be far less people with improper configs that cause gateway or HA flapping back and forth.

Discussion about carp/pfsync being forced into default queue:
http://forum.pfsense.org/index.php?topic=45045

wallabybob

There are quite a few different topics in the forums dealing with WAN links not recovering from "down". I am suspecting that a combination of values of gateway monitoring parameters and gateway behaviour might contribute to this.

Start of thought experiment
Suppose there is no mechanism to block a "gateway up" process until an already running "gateway down" process had terminated.

Having both "gateway down" and "gateway up" processes running concurrently would probably yield "ugly" results.

"Gateway up" processing starts on a single successful probe therefore the minimum time between start of "gateway down" and start of "gateway up" is something less than the probe interval. There could be cases where this minimum time is not long enough to ensure the "gateway down" processing completes before "gateway up" processing starts.

Requiring two consecutive successful probes would mean that at least the probe interval would elapse between start of "gateway down" and start of "gateway up". But if the system is "busy" will that be long enough?
End of thought experiment

Is there an effective mechanism to block "gateway up" processing until pending "gateway down" processing has completed? If there isn't, why is such a mechanism not needed?

Probably similar questions should be asked of the processing stimulated by hardware "link down" and "link up" events and their relationship with "gateway down" and 'gateway up" events. For event, if a hardware "link down" event happens during "gateway up" processing is the system guaranteed to be left in a "consistent" state?

Tease distraction question: Would the description of my thought experiment have been clearer if I had used "Frequency probe" rather than "probe interval"?

ZPrime

wallabybob, there's a lot of insight there :)

I've been occasionally hit by WAN link problems as you mentioned, so I've been searching all over for the best place to discuss and try to help resolve them. The one thing that is baffling to me is that I did not have any problems on 1.2.x or earlier 2.0 builds, I only started to see issues on 2.0.2 or 2.0.3, which makes me wonder if "stuff was changed."

I only have a single gateway, so my monitoring settings are at the defaults (interval 1 / 10 sec trigger)… so we shouldn't be too quick to point the finger at "ID_10T error" due to misunderstanding of the monitoring options. ;)

wallabybob

@bradenmcg:

I've been occasionally hit by WAN link problems as you mentioned, so I've been searching all over for the best place to discuss and try to help resolve them.

I see you have recently posted to http://forum.pfsense.org/index.php/topic,57258.0.html (DHCP on WAN suddenly started failing) and http://forum.pfsense.org/index.php/topic,63599.0.html (WAN link goes down every 12 hours (DHCP related?)) I suggest you continue the discussion in the more appropriate of those two topics.

@bradenmcg:

The one thing that is baffling to me is that I did not have any problems on 1.2.x or earlier 2.0 builds, I only started to see issues on 2.0.2 or 2.0.3, which makes me wonder if "stuff was changed."

Change in version number almost always means stuff was changed. There is a link on the pfSense doc pages (http://doc.pfsense.org) to "release notes" summarizing changes in recent releases.