Gateway Monitoring (Advanced - Down)

jimp

I get asked about it often enough I figured it was worth the block of text and also to ease/erase confusion. It wouldn't be quite as bad if it wasn't in a narrow box :-)

NOYB

Thank you. Think that is clearer.

wallabybob

While I agree the revised text is an improvement I think it perpetuates the problem by calling an interval a frequency:

. . . your Down time is 40 seconds but on a 30 second frequency,

I appreciate that history might make it difficult to do this, but at least in user visible text, the configuration item Frequency Probe should be changed to the more correct Probe Interval and explanatory text changed appropriately, for example, the changed text replaced by

The Down time specifies the length of time before the gateway is marked as down, but the accuracy is controlled by the Probe Interval value. For example, if your Down time is 40 seconds and the Probe Interval 30 seconds, only one probe would have to fail before the gateway is marked down at the 40 second mark. By default, the gateway is considered down after 10 seconds, and the probe interval is 1 second, so 10 probes would have to fail before the gateway is marked down.

While I am in a pedantic mood, when does the Down Time start? Presumably when a probe is considered to have "failed". In that case for the example values, I expect two consecutive probes would have to "fail" (one to start the Down timer and one 30 seconds later) for the gateway to be considered down. But that assumes a particular mechanism.

There are significant consequences in pfSense considering a gateway down. I expect a significant proportion of pfSense users would not want "gateway down" processing triggered by a single failed gateway probe. So how does one choose values for these parameters so that a single failed probe won't cause a gateway to be considered "down"?

adam65535

Why not have the values/example mentioned in the comment be dynamically adjusted to what is set.

You set 40 Down time
you set 30 second probe interval

The Down time specifies the length of time before the gateway is marked as down, but the accuracy is controlled by the Probe Interval value. For example, the value that is currently set for Down time is 40 seconds and the Probe Interval is 30 seconds. Only 1 probe would have to fail before the gateway is marked down at the 40 second mark.
WARNING: A single probe failure would cause the gateway to be set to down with currently set values. At least 2 is recommended. Increase Down time or decrease probe interval.

Have some kind of warning text below added and yellow or red color if it is only 1 probe till failure stating that it is only 1 and that at least 2 is recommended.

Yea… easy to say when I am not a php programmer so I wouldn't be coding it :).

jimp

@wallabybob:

While I agree the revised text is an improvement I think it perpetuates the problem by calling an interval a frequency:

. . . your Down time is 40 seconds but on a 30 second frequency,

I appreciate that history might make it difficult to do this, but at least in user visible text, the configuration item Frequency Probe should be changed to the more correct Probe Interval and explanatory text changed appropriately, for example, the changed text replaced by

The Down time specifies the length of time before the gateway is marked as down, but the accuracy is controlled by the Probe Interval value. For example, if your Down time is 40 seconds and the Probe Interval 30 seconds, only one probe would have to fail before the gateway is marked down at the 40 second mark. By default, the gateway is considered down after 10 seconds, and the probe interval is 1 second, so 10 probes would have to fail before the gateway is marked down.

While I am in a pedantic mood, when does the Down Time start? Presumably when a probe is considered to have "failed". In that case for the example values, I expect two consecutive probes would have to "fail" (one to start the Down timer and one 30 seconds later) for the gateway to be considered down. But that assumes a particular mechanism.

There are significant consequences in pfSense considering a gateway down. I expect a significant proportion of pfSense users would not want "gateway down" processing triggered by a single failed gateway probe. So how does one choose values for these parameters so that a single failed probe won't cause a gateway to be considered "down"?

To be even more pedantic, it is, as defined, a frequency.. It is one probe per X seconds. It could also be called an interval, but calling it a frequency in this context is valid. :-)

"X seconds between probes" and "one probe per X seconds" are equivalent. Both are valid.

Down time starts exactly when a failed probe happens. If the frequency is less or equal to half the down time, then you'll get multiple probes before a failure.

Down = 10, Probe = 5 - two probes exactly to be down
Down = 5, Probe = 3 - one probe to be down
Down = 10, Probe = 1 - 10 failures to be down.
Down = 120, Probe = 10 - 12 failures to be down
Down = 120, Probe = 90 - 1 failure to be down

@adam65535:

Why not have the values/example mentioned in the comment be dynamically adjusted to what is set.

You set 40 Down time
you set 30 second probe interval

The Down time specifies the length of time before the gateway is marked as down, but the accuracy is controlled by the Probe Interval value. For example, the value that is currently set for Down time is 40 seconds and the Probe Interval is 30 seconds. Only 1 probe would have to fail before the gateway is marked down at the 40 second mark.
WARNING: A single probe failure would cause the gateway to be set to down with currently set values. At least 2 is recommended. Increase Down time or decrease probe interval.

Have some kind of warning text below added and yellow or red color if it is only 1 probe till failure stating that it is only 1 and 2 is recommended.

Yea… easy to say when I am not a php programmer so I wouldn't be coding it :).

I thought I had done that at one point, maybe it was that way before the note on 2.1 was removed, I'd have to go check the history. That defeats the purpose of an example though. Knowing the actual values is good, but an example is there to show what would happen with different settings than what you're using. Both useful, but in different ways.

wallabybob

@jimp:

To be even more pedantic, it is, as defined, a frequency.. It is one probe per X seconds. It could also be called an interval, but calling it a frequency in this context is valid. :-)

"X seconds between probes" and "one probe per X seconds" are equivalent. Both are valid.

I agree that "X seconds between probes" and "one probe per X seconds" are equivalent statements and that both are valid.

My issue is with calling the X in "one probe per X seconds" the frequency. In "common speech" a statement about frequency is a statement about how often something occurs. "once every three weeks" is a statement about how often something occurs and so is a statement about its frequency. However in Physics, Electrical Engineering, Electronics etc frequency has a more precise meaning: the number of events PER UNIT TIME (see, for example, the definition of frequency on http://dictionary.reference.com) . There is a defined unit (Hertz, abbreviated as Hz) for events per second.

Therefore an event occurring every 8 seconds has a frequency (in the "science" sense) of (1 event)/(8 seconds) = 0.125 events per second or 0.125Hz.

Frequency (in the science sense) of events and the interval between events are related mathematically by frequency being the reciprocal of the interval (frequency = 1/interval) and vice versa. The fact that frequency and interval are simply related does not mean the terms can be used interchangeably.

I expect that if you continue to call the interval between events its frequency you will continue to confuse your readers. For example, I am running pfSense

2.1-RC0 (i386)
built on Mon Jun 17 15:48:36 EDT 2013
FreeBSD 8.3-RELEASE-p8

On System -> Routing -> Gateways, edit a gateway and click on the Advanced button I have the opportunity to configure Frequency Probe. My decades in science leads me to think this requires me to specify the number of probes per second - that is I should type 0.1 to get 0.1 probes per second or one probe every 10 seconds. Reading on, I see that the frequency is supposed to be in SECONDS - if I haven't had the benefit of reading the appropriate forum topics I could take that to mean the unit time for the frequency calculation is seconds rather than minutes or hours.

@jimp:

Down time starts exactly when a failed probe happens. If the frequency is less or equal to half the down time, then you'll get multiple probes before a failure.

Down = 10, Probe = 5 - two probes exactly to be down
Down = 5, Probe = 3 - one probe to be down
Down = 10, Probe = 1 - 10 failures to be down.
Down = 120, Probe = 10 - 12 failures to be down
Down = 120, Probe = 90 - 1 failure to be down

This is an interesting explanation. Is it based on an analysis of the appropriate code?

This description suggests that the down timer is effectively ignored if the probe interval is more than half the down time. Since there was no mention made of the down timer ever being ignored I thought that the combination of probe interval 3, down timer 5 would have required two consecutive probe failures to trigger the down event: one to start the down timer and a second (at 3 seconds after the first probe) to maintain the down state for the duration of the down timer. But that glosses over the matter of what exactly is a probe failure? In the absence of an explanation I have assumed that a probe failure occurs when a response to a ping is not received within a certain interval (please don't call it a frequency!) but how long is that interval? Is it related to the probe interval? Does the down timer start when it has been determined that a probe failed OR considered to have started when the "failed" probe was sent OR something else again?

As I have been writing this reply I have been wondering again if the issue reported by a number of users of their WAN links flapping around for lengthy periods might be related to a failure to understand (and describe precisely) how gateway monitoring works. Could someone produce a timeline showing the relationship of the various parameters?

jimp

We'll have to agree to disagree on frequency vs interval. The definition I'm going by is "the number of occurrences within a given time period" (WordNet, M-W has a similar one) only we're defining the time period, not the number of occurrences. If you want to play scientific, we're both wrong, we'd really be defining the period. :-)

My examples are based on actual observations experimenting with the code and watching pings and the gateway events. I was a bit wrong though as I was failing to consider the probes as they really would have happened. Any probe time less than the down time will have to fail at least twice. Not sure how that slipped my mind as I was typing it up.

The down timer starts when a probe fails (a ping fails to return, though I don't recall the specific timeout on that), and it does NOT care about subsequent failures or probes. If it comes back up on the next probe, it will be marked online. A down time of 10 means it will be down in 10 seconds after a failure unless there is a successful probe.

So consider a down time of 40 and a probe of 30. If the gateway drops two pings but recovers the next, it goes like this:

0: ping sent, no response, down timer starts
30: ping sent, no response
40: Gateway marked down.
60: ping sent, response received, gateway marked as up.

So really it's:

Down = 10, Probe = 5 - 2 probes to be down (0, 5)
Down = 5, Probe = 3 - 2 probes to be down (0, 3)
Down = 10, Probe = 1 - 10 failures to be down. (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Down = 120, Probe = 10 - 12 failures to be down (0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110)
Down = 120, Probe = 90 - 2 failures to be down (0, 90)

The one ping/flap oddities start when you have a probe time longer than the down time, but we warn against that in the GUI.

NOYB

Frequency probe, probe frequency, and probe interval are not always the same things.

I'm with the others that assert probe interval is the better / more appropriate description / label for what is being configured here.

To me, frequency probe is a device that measures the frequency of something.

phil.davis

I can't resist replying to this. wallabybob is correct. A frequency is the number of events/occurrences/cycles/repeats per unit of time (humans nearly always use "per second"). A frequency of 2 in units of "per second" means it happens twice each second, i.e. the interval/period between each event/occurrence/cycle/repeat is 0.5 seconds. A frequency of 10 is an interval/period of 0.1. A frequency of 0.1 is an interval/period of 10.

The number being entered is simply NOT a frequency. That is the scientific and language reality. The description should use words like "interval" or "period" - take your pick.

I do think this should be fixed up on the GUI. There are enough mathematically/scientifically trained people who read the GUI text that see the word "frequency" and then get exactly the inverse idea about what to type in. (I did the first time until I read the rest of the notes and decided it actually must be an interval/period)

NOYB

Nope. Period is not correct either. Mathematical equivalents are not necessarily language equivalents.

Period indicates a duration or a time frame. Such as n seconds or a point in time to another point in time.

Saying “probe period” in this case would indicate either the duration or time frame of the probe. Not the interval between probes. Which is what is being configured here in this case.

Of the terminologies suggested thus far. “Probe Interval” is the most appropriate label for this setting.

For the English language, “Frequency Probe” is an incorrect labeling of this setting with respect to what is being set/configured. No if’s and’s or but’s about it. It just is. No wonder there is so much confusion surrounding its configuration. And changing it to "Probe Interval" will undoubtedly alleviate much of the confusion.

Please, I'm begging. Pretty please with mounds of chocolate on top. Change it. ;D

adam65535

It seems strange the values together that are used for this. Even people on this boards are having trouble understanding it it seems.

Why not set a time between probes time and then set a count for the number of retries to determine down. It would be easier to think about IMHO coming from someone like me without a math or physics background :). It is much more logical for any timeout to sync with the probe interval(or whatever you call it). Why allow someone to set a probe interval of 30 seconds and a timeout of 40 seconds. That seems confusing to me and would make me second guess what I am doing.

You could then display the resulting overall timeout time before down or that just might make it confusing again though :).

phil.davis

I agree with NOYB - period is appropriate for things like waveforms, that have a cycle. Interval is better for things that happen at a repeating point in time. The underlying field in the ccodde is already called 'interval'.
I have submitted pull request https://github.com/pfsense/pfsense/pull/673 to make it say "Probe Interval" everywhere. IMHO that is the way to go for now.

@adam65535 has just suggested that the down time should only be able to be a multiple of the probe interval. In that case, the user can specify the number of probes, rather than down time. That would remove any ambiguity. I can't think of any real-world use-cases where this more-restricted behaviour would be a problem.

However, I suspect it is well-and-truely too late to be doing that for 2.1.

biggsy

I must say that my preference is for "interval".

I like the suggestion from adam65535 and agree with phil.davis about real-world use.

If the down timer starts at the point where a probe is determined to have failed, then exceeding a simple limit on sequential failed probes should be sufficient to consider the interface as "down".

adam65535

What I like about expressing the count is that you are making it obvious how many probes will be sent by making the user explicitly inputting that instead of just inferring it by some overall timeout.

Between clearing this up and finding a way to allow the user to specify a carp/pfsync queue or figuring out how to bypass carp/pfsync traffic from queues I suspect there will be far less people with improper configs that cause gateway or HA flapping back and forth.

Discussion about carp/pfsync being forced into default queue:
http://forum.pfsense.org/index.php?topic=45045

wallabybob

There are quite a few different topics in the forums dealing with WAN links not recovering from "down". I am suspecting that a combination of values of gateway monitoring parameters and gateway behaviour might contribute to this.

Start of thought experiment
Suppose there is no mechanism to block a "gateway up" process until an already running "gateway down" process had terminated.

Having both "gateway down" and "gateway up" processes running concurrently would probably yield "ugly" results.

"Gateway up" processing starts on a single successful probe therefore the minimum time between start of "gateway down" and start of "gateway up" is something less than the probe interval. There could be cases where this minimum time is not long enough to ensure the "gateway down" processing completes before "gateway up" processing starts.

Requiring two consecutive successful probes would mean that at least the probe interval would elapse between start of "gateway down" and start of "gateway up". But if the system is "busy" will that be long enough?
End of thought experiment

Is there an effective mechanism to block "gateway up" processing until pending "gateway down" processing has completed? If there isn't, why is such a mechanism not needed?

Probably similar questions should be asked of the processing stimulated by hardware "link down" and "link up" events and their relationship with "gateway down" and 'gateway up" events. For event, if a hardware "link down" event happens during "gateway up" processing is the system guaranteed to be left in a "consistent" state?

Tease distraction question: Would the description of my thought experiment have been clearer if I had used "Frequency probe" rather than "probe interval"?

ZPrime

wallabybob, there's a lot of insight there :)

I've been occasionally hit by WAN link problems as you mentioned, so I've been searching all over for the best place to discuss and try to help resolve them. The one thing that is baffling to me is that I did not have any problems on 1.2.x or earlier 2.0 builds, I only started to see issues on 2.0.2 or 2.0.3, which makes me wonder if "stuff was changed."

I only have a single gateway, so my monitoring settings are at the defaults (interval 1 / 10 sec trigger)… so we shouldn't be too quick to point the finger at "ID_10T error" due to misunderstanding of the monitoring options. ;)

wallabybob

@bradenmcg:

I've been occasionally hit by WAN link problems as you mentioned, so I've been searching all over for the best place to discuss and try to help resolve them.

I see you have recently posted to http://forum.pfsense.org/index.php/topic,57258.0.html (DHCP on WAN suddenly started failing) and http://forum.pfsense.org/index.php/topic,63599.0.html (WAN link goes down every 12 hours (DHCP related?)) I suggest you continue the discussion in the more appropriate of those two topics.

@bradenmcg:

The one thing that is baffling to me is that I did not have any problems on 1.2.x or earlier 2.0 builds, I only started to see issues on 2.0.2 or 2.0.3, which makes me wonder if "stuff was changed."

Change in version number almost always means stuff was changed. There is a link on the pfSense doc pages (http://doc.pfsense.org) to "release notes" summarizing changes in recent releases.