Gateway Monitoring (Advanced - Down)

NOYB

Frequency probe, probe frequency, and probe interval are not always the same things.

I'm with the others that assert probe interval is the better / more appropriate description / label for what is being configured here.

To me, frequency probe is a device that measures the frequency of something.

phil.davis

I can't resist replying to this. wallabybob is correct. A frequency is the number of events/occurrences/cycles/repeats per unit of time (humans nearly always use "per second"). A frequency of 2 in units of "per second" means it happens twice each second, i.e. the interval/period between each event/occurrence/cycle/repeat is 0.5 seconds. A frequency of 10 is an interval/period of 0.1. A frequency of 0.1 is an interval/period of 10.

The number being entered is simply NOT a frequency. That is the scientific and language reality. The description should use words like "interval" or "period" - take your pick.

I do think this should be fixed up on the GUI. There are enough mathematically/scientifically trained people who read the GUI text that see the word "frequency" and then get exactly the inverse idea about what to type in. (I did the first time until I read the rest of the notes and decided it actually must be an interval/period)

NOYB

Nope. Period is not correct either. Mathematical equivalents are not necessarily language equivalents.

Period indicates a duration or a time frame. Such as n seconds or a point in time to another point in time.

Saying “probe period” in this case would indicate either the duration or time frame of the probe. Not the interval between probes. Which is what is being configured here in this case.

Of the terminologies suggested thus far. “Probe Interval” is the most appropriate label for this setting.

For the English language, “Frequency Probe” is an incorrect labeling of this setting with respect to what is being set/configured. No if’s and’s or but’s about it. It just is. No wonder there is so much confusion surrounding its configuration. And changing it to "Probe Interval" will undoubtedly alleviate much of the confusion.

Please, I'm begging. Pretty please with mounds of chocolate on top. Change it. ;D

adam65535

It seems strange the values together that are used for this. Even people on this boards are having trouble understanding it it seems.

Why not set a time between probes time and then set a count for the number of retries to determine down. It would be easier to think about IMHO coming from someone like me without a math or physics background :). It is much more logical for any timeout to sync with the probe interval(or whatever you call it). Why allow someone to set a probe interval of 30 seconds and a timeout of 40 seconds. That seems confusing to me and would make me second guess what I am doing.

You could then display the resulting overall timeout time before down or that just might make it confusing again though :).

phil.davis

I agree with NOYB - period is appropriate for things like waveforms, that have a cycle. Interval is better for things that happen at a repeating point in time. The underlying field in the ccodde is already called 'interval'.
I have submitted pull request https://github.com/pfsense/pfsense/pull/673 to make it say "Probe Interval" everywhere. IMHO that is the way to go for now.

@adam65535 has just suggested that the down time should only be able to be a multiple of the probe interval. In that case, the user can specify the number of probes, rather than down time. That would remove any ambiguity. I can't think of any real-world use-cases where this more-restricted behaviour would be a problem.

However, I suspect it is well-and-truely too late to be doing that for 2.1.

biggsy

I must say that my preference is for "interval".

I like the suggestion from adam65535 and agree with phil.davis about real-world use.

If the down timer starts at the point where a probe is determined to have failed, then exceeding a simple limit on sequential failed probes should be sufficient to consider the interface as "down".

adam65535

What I like about expressing the count is that you are making it obvious how many probes will be sent by making the user explicitly inputting that instead of just inferring it by some overall timeout.

Between clearing this up and finding a way to allow the user to specify a carp/pfsync queue or figuring out how to bypass carp/pfsync traffic from queues I suspect there will be far less people with improper configs that cause gateway or HA flapping back and forth.

Discussion about carp/pfsync being forced into default queue:
http://forum.pfsense.org/index.php?topic=45045

wallabybob

There are quite a few different topics in the forums dealing with WAN links not recovering from "down". I am suspecting that a combination of values of gateway monitoring parameters and gateway behaviour might contribute to this.

Start of thought experiment
Suppose there is no mechanism to block a "gateway up" process until an already running "gateway down" process had terminated.

Having both "gateway down" and "gateway up" processes running concurrently would probably yield "ugly" results.

"Gateway up" processing starts on a single successful probe therefore the minimum time between start of "gateway down" and start of "gateway up" is something less than the probe interval. There could be cases where this minimum time is not long enough to ensure the "gateway down" processing completes before "gateway up" processing starts.

Requiring two consecutive successful probes would mean that at least the probe interval would elapse between start of "gateway down" and start of "gateway up". But if the system is "busy" will that be long enough?
End of thought experiment

Is there an effective mechanism to block "gateway up" processing until pending "gateway down" processing has completed? If there isn't, why is such a mechanism not needed?

Probably similar questions should be asked of the processing stimulated by hardware "link down" and "link up" events and their relationship with "gateway down" and 'gateway up" events. For event, if a hardware "link down" event happens during "gateway up" processing is the system guaranteed to be left in a "consistent" state?

Tease distraction question: Would the description of my thought experiment have been clearer if I had used "Frequency probe" rather than "probe interval"?

ZPrime

wallabybob, there's a lot of insight there :)

I've been occasionally hit by WAN link problems as you mentioned, so I've been searching all over for the best place to discuss and try to help resolve them. The one thing that is baffling to me is that I did not have any problems on 1.2.x or earlier 2.0 builds, I only started to see issues on 2.0.2 or 2.0.3, which makes me wonder if "stuff was changed."

I only have a single gateway, so my monitoring settings are at the defaults (interval 1 / 10 sec trigger)… so we shouldn't be too quick to point the finger at "ID_10T error" due to misunderstanding of the monitoring options. ;)

wallabybob

@bradenmcg:

I've been occasionally hit by WAN link problems as you mentioned, so I've been searching all over for the best place to discuss and try to help resolve them.

I see you have recently posted to http://forum.pfsense.org/index.php/topic,57258.0.html (DHCP on WAN suddenly started failing) and http://forum.pfsense.org/index.php/topic,63599.0.html (WAN link goes down every 12 hours (DHCP related?)) I suggest you continue the discussion in the more appropriate of those two topics.

@bradenmcg:

The one thing that is baffling to me is that I did not have any problems on 1.2.x or earlier 2.0 builds, I only started to see issues on 2.0.2 or 2.0.3, which makes me wonder if "stuff was changed."

Change in version number almost always means stuff was changed. There is a link on the pfSense doc pages (http://doc.pfsense.org) to "release notes" summarizing changes in recent releases.