Gateway Monitoring (Advanced - Down)

NOYB

Says down value is "The number of seconds of failed probes before the alarm will fire. Default is 10."

Then at the bottom of the Advanced section it says this. "The total time before a gateway is down is the product of the Frequency Probe and the Down fields. By default this is 1*10=10 seconds."

If the default frequency probe value (1) is used, then both of those statements are true. But if it is changed from the default value, then both of those statements cannot be true.

Which one is true? And shouldn't the other one be corrected or removed?

Some previous discussion:
Gateway Monitoring (Advanced - Down)
http://forum.pfsense.org/index.php/topic,51604.0.html

wallabybob

The frequency probe would be more accurately described as the Probe Interval - interval in seconds between probes.

phil.davis

It is definitely "The number of seconds of failed probes before the alarm will fire."
If you have a probe every 5 seconds, and down=60 then after 12 failed probes the alarm fires.
IMHO this commit is the problem: https://github.com/pfsense/pfsense/commit/dd6882695dce3a65891acdb442adf83025533eec
It put back that (incorrect) text about multiplying the interval and down times, which I had removed some time ago while getting all the validation correct for the advanced fields.

Here is a sample section of apinger.conf where the advanced parameters end up being written, the "down" is clearly a time in seconds:

alarm loss "WANGWloss" {
	percent_low 40
	percent_high 50
}
alarm delay "WANGWdelay" {
	delay_low 4000ms
	delay_high 5000ms
}
alarm down "WANGWdown" {
	time 30s
}
target "8.8.4.4" {
	description "WANGW"
	srcip "10.1.1.1"
	interval 2s
	alarms override "WANGWloss","WANGWdelay","WANGWdown";
	rrd file "/var/db/rrd/WANGW-quality.rrd"
}

phil.davis

@wallabybob:

The frequency probe would be more accurately described as the Probe Interval - interval in seconds between probes.

Yes, I agree - the units are "sec" whereas frequency is "1/sec". I wish I had fixed this when I did the validation code!
The question is, cmb and jimp are trying to finish a book with 2.1 screenshots. Every GUI text change means fixing up screenshots and text in the book. Is there a possibility to fix this text for 2.1?

NOYB

Is there a difference between “alarm will fire” and “gateway is down”? Seems that would be the only way both statements could be true when frequency probe (interval) is something other than 1.

Agree the probe interval would be better label.

phil.davis

"alarm will fire" is terminology in apinger. When the alarm fires, the pfSense code is invoked to implement "gateway is down" (doing whatever failover, restarting of VPN things… is configured). Thus any statements about timing apply the same to both phrases.

jimp

@NOYB:

Then at the bottom of the Advanced section it says this. "The total time before a gateway is down is the product of the Frequency Probe and the Down fields. By default this is 1*10=10 seconds."

If the default frequency probe value (1) is used, then both of those statements are true. But if it is changed from the default value, then both of those statements cannot be true.

Hence the reason why it used to say "By default this is…" before the sample calculation. :-) (But it does seem to be wrong in general)

@phil.davis:

It is definitely "The number of seconds of failed probes before the alarm will fire."
If you have a probe every 5 seconds, and down=60 then after 12 failed probes the alarm fires.
IMHO this commit is the problem: https://github.com/pfsense/pfsense/commit/dd6882695dce3a65891acdb442adf83025533eec
It put back that (incorrect) text about multiplying the interval and down times, which I had removed some time ago while getting all the validation correct for the advanced fields.

Here is a sample section of apinger.conf where the advanced parameters end up being written, the "down" is clearly a time in seconds:

If the note is wrong, the note should definitely be fixed, but not removed. It's a common enough question that giving as much detail there as possible is warranted.

How's this?
https://github.com/pfsense/pfsense/commit/94744c271070bfef6bdf0495120b5c2abce371c3

phil.davis

Yep, that note looks good. You wrote a whole essay on the topic - I hope it fits on the screen :)

jimp

I get asked about it often enough I figured it was worth the block of text and also to ease/erase confusion. It wouldn't be quite as bad if it wasn't in a narrow box :-)

NOYB

Thank you. Think that is clearer.

wallabybob

While I agree the revised text is an improvement I think it perpetuates the problem by calling an interval a frequency:

. . . your Down time is 40 seconds but on a 30 second frequency,

I appreciate that history might make it difficult to do this, but at least in user visible text, the configuration item Frequency Probe should be changed to the more correct Probe Interval and explanatory text changed appropriately, for example, the changed text replaced by

The Down time specifies the length of time before the gateway is marked as down, but the accuracy is controlled by the Probe Interval value. For example, if your Down time is 40 seconds and the Probe Interval 30 seconds, only one probe would have to fail before the gateway is marked down at the 40 second mark. By default, the gateway is considered down after 10 seconds, and the probe interval is 1 second, so 10 probes would have to fail before the gateway is marked down.

While I am in a pedantic mood, when does the Down Time start? Presumably when a probe is considered to have "failed". In that case for the example values, I expect two consecutive probes would have to "fail" (one to start the Down timer and one 30 seconds later) for the gateway to be considered down. But that assumes a particular mechanism.

There are significant consequences in pfSense considering a gateway down. I expect a significant proportion of pfSense users would not want "gateway down" processing triggered by a single failed gateway probe. So how does one choose values for these parameters so that a single failed probe won't cause a gateway to be considered "down"?

adam65535

Why not have the values/example mentioned in the comment be dynamically adjusted to what is set.

You set 40 Down time
you set 30 second probe interval

The Down time specifies the length of time before the gateway is marked as down, but the accuracy is controlled by the Probe Interval value. For example, the value that is currently set for Down time is 40 seconds and the Probe Interval is 30 seconds. Only 1 probe would have to fail before the gateway is marked down at the 40 second mark.
WARNING: A single probe failure would cause the gateway to be set to down with currently set values. At least 2 is recommended. Increase Down time or decrease probe interval.

Have some kind of warning text below added and yellow or red color if it is only 1 probe till failure stating that it is only 1 and that at least 2 is recommended.

Yea… easy to say when I am not a php programmer so I wouldn't be coding it :).

jimp

@wallabybob:

While I agree the revised text is an improvement I think it perpetuates the problem by calling an interval a frequency:

. . . your Down time is 40 seconds but on a 30 second frequency,

I appreciate that history might make it difficult to do this, but at least in user visible text, the configuration item Frequency Probe should be changed to the more correct Probe Interval and explanatory text changed appropriately, for example, the changed text replaced by

The Down time specifies the length of time before the gateway is marked as down, but the accuracy is controlled by the Probe Interval value. For example, if your Down time is 40 seconds and the Probe Interval 30 seconds, only one probe would have to fail before the gateway is marked down at the 40 second mark. By default, the gateway is considered down after 10 seconds, and the probe interval is 1 second, so 10 probes would have to fail before the gateway is marked down.

While I am in a pedantic mood, when does the Down Time start? Presumably when a probe is considered to have "failed". In that case for the example values, I expect two consecutive probes would have to "fail" (one to start the Down timer and one 30 seconds later) for the gateway to be considered down. But that assumes a particular mechanism.

There are significant consequences in pfSense considering a gateway down. I expect a significant proportion of pfSense users would not want "gateway down" processing triggered by a single failed gateway probe. So how does one choose values for these parameters so that a single failed probe won't cause a gateway to be considered "down"?

To be even more pedantic, it is, as defined, a frequency.. It is one probe per X seconds. It could also be called an interval, but calling it a frequency in this context is valid. :-)

"X seconds between probes" and "one probe per X seconds" are equivalent. Both are valid.

Down time starts exactly when a failed probe happens. If the frequency is less or equal to half the down time, then you'll get multiple probes before a failure.

Down = 10, Probe = 5 - two probes exactly to be down
Down = 5, Probe = 3 - one probe to be down
Down = 10, Probe = 1 - 10 failures to be down.
Down = 120, Probe = 10 - 12 failures to be down
Down = 120, Probe = 90 - 1 failure to be down

@adam65535:

Why not have the values/example mentioned in the comment be dynamically adjusted to what is set.

You set 40 Down time
you set 30 second probe interval

The Down time specifies the length of time before the gateway is marked as down, but the accuracy is controlled by the Probe Interval value. For example, the value that is currently set for Down time is 40 seconds and the Probe Interval is 30 seconds. Only 1 probe would have to fail before the gateway is marked down at the 40 second mark.
WARNING: A single probe failure would cause the gateway to be set to down with currently set values. At least 2 is recommended. Increase Down time or decrease probe interval.

Have some kind of warning text below added and yellow or red color if it is only 1 probe till failure stating that it is only 1 and 2 is recommended.

Yea… easy to say when I am not a php programmer so I wouldn't be coding it :).

I thought I had done that at one point, maybe it was that way before the note on 2.1 was removed, I'd have to go check the history. That defeats the purpose of an example though. Knowing the actual values is good, but an example is there to show what would happen with different settings than what you're using. Both useful, but in different ways.

wallabybob

@jimp:

To be even more pedantic, it is, as defined, a frequency.. It is one probe per X seconds. It could also be called an interval, but calling it a frequency in this context is valid. :-)

"X seconds between probes" and "one probe per X seconds" are equivalent. Both are valid.

I agree that "X seconds between probes" and "one probe per X seconds" are equivalent statements and that both are valid.

My issue is with calling the X in "one probe per X seconds" the frequency. In "common speech" a statement about frequency is a statement about how often something occurs. "once every three weeks" is a statement about how often something occurs and so is a statement about its frequency. However in Physics, Electrical Engineering, Electronics etc frequency has a more precise meaning: the number of events PER UNIT TIME (see, for example, the definition of frequency on http://dictionary.reference.com) . There is a defined unit (Hertz, abbreviated as Hz) for events per second.

Therefore an event occurring every 8 seconds has a frequency (in the "science" sense) of (1 event)/(8 seconds) = 0.125 events per second or 0.125Hz.

Frequency (in the science sense) of events and the interval between events are related mathematically by frequency being the reciprocal of the interval (frequency = 1/interval) and vice versa. The fact that frequency and interval are simply related does not mean the terms can be used interchangeably.

I expect that if you continue to call the interval between events its frequency you will continue to confuse your readers. For example, I am running pfSense

2.1-RC0 (i386)
built on Mon Jun 17 15:48:36 EDT 2013
FreeBSD 8.3-RELEASE-p8

On System -> Routing -> Gateways, edit a gateway and click on the Advanced button I have the opportunity to configure Frequency Probe. My decades in science leads me to think this requires me to specify the number of probes per second - that is I should type 0.1 to get 0.1 probes per second or one probe every 10 seconds. Reading on, I see that the frequency is supposed to be in SECONDS - if I haven't had the benefit of reading the appropriate forum topics I could take that to mean the unit time for the frequency calculation is seconds rather than minutes or hours.

@jimp:

Down time starts exactly when a failed probe happens. If the frequency is less or equal to half the down time, then you'll get multiple probes before a failure.

Down = 10, Probe = 5 - two probes exactly to be down
Down = 5, Probe = 3 - one probe to be down
Down = 10, Probe = 1 - 10 failures to be down.
Down = 120, Probe = 10 - 12 failures to be down
Down = 120, Probe = 90 - 1 failure to be down

This is an interesting explanation. Is it based on an analysis of the appropriate code?

This description suggests that the down timer is effectively ignored if the probe interval is more than half the down time. Since there was no mention made of the down timer ever being ignored I thought that the combination of probe interval 3, down timer 5 would have required two consecutive probe failures to trigger the down event: one to start the down timer and a second (at 3 seconds after the first probe) to maintain the down state for the duration of the down timer. But that glosses over the matter of what exactly is a probe failure? In the absence of an explanation I have assumed that a probe failure occurs when a response to a ping is not received within a certain interval (please don't call it a frequency!) but how long is that interval? Is it related to the probe interval? Does the down timer start when it has been determined that a probe failed OR considered to have started when the "failed" probe was sent OR something else again?

As I have been writing this reply I have been wondering again if the issue reported by a number of users of their WAN links flapping around for lengthy periods might be related to a failure to understand (and describe precisely) how gateway monitoring works. Could someone produce a timeline showing the relationship of the various parameters?

jimp

We'll have to agree to disagree on frequency vs interval. The definition I'm going by is "the number of occurrences within a given time period" (WordNet, M-W has a similar one) only we're defining the time period, not the number of occurrences. If you want to play scientific, we're both wrong, we'd really be defining the period. :-)

My examples are based on actual observations experimenting with the code and watching pings and the gateway events. I was a bit wrong though as I was failing to consider the probes as they really would have happened. Any probe time less than the down time will have to fail at least twice. Not sure how that slipped my mind as I was typing it up.

The down timer starts when a probe fails (a ping fails to return, though I don't recall the specific timeout on that), and it does NOT care about subsequent failures or probes. If it comes back up on the next probe, it will be marked online. A down time of 10 means it will be down in 10 seconds after a failure unless there is a successful probe.

So consider a down time of 40 and a probe of 30. If the gateway drops two pings but recovers the next, it goes like this:

0: ping sent, no response, down timer starts
30: ping sent, no response
40: Gateway marked down.
60: ping sent, response received, gateway marked as up.

So really it's:

Down = 10, Probe = 5 - 2 probes to be down (0, 5)
Down = 5, Probe = 3 - 2 probes to be down (0, 3)
Down = 10, Probe = 1 - 10 failures to be down. (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Down = 120, Probe = 10 - 12 failures to be down (0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110)
Down = 120, Probe = 90 - 2 failures to be down (0, 90)

The one ping/flap oddities start when you have a probe time longer than the down time, but we warn against that in the GUI.

NOYB

Frequency probe, probe frequency, and probe interval are not always the same things.

I'm with the others that assert probe interval is the better / more appropriate description / label for what is being configured here.

To me, frequency probe is a device that measures the frequency of something.

phil.davis

I can't resist replying to this. wallabybob is correct. A frequency is the number of events/occurrences/cycles/repeats per unit of time (humans nearly always use "per second"). A frequency of 2 in units of "per second" means it happens twice each second, i.e. the interval/period between each event/occurrence/cycle/repeat is 0.5 seconds. A frequency of 10 is an interval/period of 0.1. A frequency of 0.1 is an interval/period of 10.

The number being entered is simply NOT a frequency. That is the scientific and language reality. The description should use words like "interval" or "period" - take your pick.

I do think this should be fixed up on the GUI. There are enough mathematically/scientifically trained people who read the GUI text that see the word "frequency" and then get exactly the inverse idea about what to type in. (I did the first time until I read the rest of the notes and decided it actually must be an interval/period)

NOYB

Nope. Period is not correct either. Mathematical equivalents are not necessarily language equivalents.

Period indicates a duration or a time frame. Such as n seconds or a point in time to another point in time.

Saying “probe period” in this case would indicate either the duration or time frame of the probe. Not the interval between probes. Which is what is being configured here in this case.

Of the terminologies suggested thus far. “Probe Interval” is the most appropriate label for this setting.

For the English language, “Frequency Probe” is an incorrect labeling of this setting with respect to what is being set/configured. No if’s and’s or but’s about it. It just is. No wonder there is so much confusion surrounding its configuration. And changing it to "Probe Interval" will undoubtedly alleviate much of the confusion.

Please, I'm begging. Pretty please with mounds of chocolate on top. Change it. ;D

adam65535

It seems strange the values together that are used for this. Even people on this boards are having trouble understanding it it seems.

Why not set a time between probes time and then set a count for the number of retries to determine down. It would be easier to think about IMHO coming from someone like me without a math or physics background :). It is much more logical for any timeout to sync with the probe interval(or whatever you call it). Why allow someone to set a probe interval of 30 seconds and a timeout of 40 seconds. That seems confusing to me and would make me second guess what I am doing.

You could then display the resulting overall timeout time before down or that just might make it confusing again though :).

phil.davis

I agree with NOYB - period is appropriate for things like waveforms, that have a cycle. Interval is better for things that happen at a repeating point in time. The underlying field in the ccodde is already called 'interval'.
I have submitted pull request https://github.com/pfsense/pfsense/pull/673 to make it say "Probe Interval" everywhere. IMHO that is the way to go for now.

@adam65535 has just suggested that the down time should only be able to be a multiple of the probe interval. In that case, the user can specify the number of probes, rather than down time. That would remove any ambiguity. I can't think of any real-world use-cases where this more-restricted behaviour would be a problem.

However, I suspect it is well-and-truely too late to be doing that for 2.1.