Dpinger controls

phil.davis

The GUI still allows input of losslow and losshigh percentages for the acceptable packet loss. Only losshigh is passed on to dpinger, which uses it to alarm if the packet loss goes above losshigh%.
But in get_dpinger_status() the loss percentage is returned from dpinger and then is compared to losslow to determine if the gateway state should be marked "loss".
For example, with the default low/high 10/20, if the loss goes above 20% then there will be a real alarm that results in gateway failover. When the loss average drops below 20% dpinger will "unalarm" and gateway failback should happen. However the gateways status/widget will show gateway "loss" if the current loss is above 10%.

Is that intentional?
Or should losslow be removed, and everything depend just on losshigh now?

Same for latencylow and latencyhigh msec. latencyhigh is being passed through to dpinger as the alarm point. latencylow is being used in get_dpinger_status() to determine gateway status "delay".
Relationship between the new set of parameters Probe Interval, Loss Interval and Time Period. The text at Additional Information on system_gateways_edit.php needs to be rewritten to explain these. As I understand it,

a) Probe Interval - how often an echo request (ping) is sent out - default 250ms means there will be 4 pings every second.

b) Loss Interval - how long to wait for an echo reply before considering that an echo request has been lost. Default 500ms means that if echo replies are taking a while to come back, their might be 2 echo requests outstanding at one time. If you set it to something high, e.g. 2000ms then there might be up to 8 echo requests outstanding at one time. This parameter can be somewhat unrelated to Probe Interval. e.g. if the moitor IP is something very local and should always respond quickly (e.g. well inside 50ms) then you could set Loss Interval to 50ms and nothing bad would happen. Every Probe Interval (250ms) an echo request would be sent out. 99.9% of the time the echo reply will be received in only a few msecs. If the echo reply is not received within 50ms it will be counted as loss. dpinger will be effectively "idle" for another 200ms until it sends out the next echo request.
Implication: the validation does not need to check that Loss Interval > Probe Interval.

c) Time Period - the time over which a rolling average of the observed RTT and loss% is calculated. Default 25000ms means, at the default Probe Interval or 250ms, that it will be a rolling average of the results of the last 100 probes.
Implication: validation should check that (Time Period > Probe Interval) - and it would probably be odd to have (Time Period < 2 * Probe Interval) - that would make the gateway alarm on a single packet delay or loss, and unalarm again on a single packet success.

Comments please, do I understand this correctly? What is the proper explanation to go on the GUI? What is the proper validation to put in the code?

rbgarga

@phil.davis:

The GUI still allows input of losslow and losshigh percentages for the acceptable packet loss. Only losshigh is passed on to dpinger, which uses it to alarm if the packet loss goes above losshigh%.
But in get_dpinger_status() the loss percentage is returned from dpinger and then is compared to losslow to determine if the gateway state should be marked "loss".
For example, with the default low/high 10/20, if the loss goes above 20% then there will be a real alarm that results in gateway failover. When the loss average drops below 20% dpinger will "unalarm" and gateway failback should happen. However the gateways status/widget will show gateway "loss" if the current loss is above 10%.

Is that intentional?
Or should losslow be removed, and everything depend just on losshigh now?

Same for latencylow and latencyhigh msec. latencyhigh is being passed through to dpinger as the alarm point. latencylow is being used in get_dpinger_status() to determine gateway status "delay".

Relationship between the new set of parameters Probe Interval, Loss Interval and Time Period. The text at Additional Information on system_gateways_edit.php needs to be rewritten to explain these. As I understand it,

a) Probe Interval - how often an echo request (ping) is sent out - default 250ms means there will be 4 pings every second.

b) Loss Interval - how long to wait for an echo reply before considering that an echo request has been lost. Default 500ms means that if echo replies are taking a while to come back, their might be 2 echo requests outstanding at one time. If you set it to something high, e.g. 2000ms then there might be up to 8 echo requests outstanding at one time. This parameter can be somewhat unrelated to Probe Interval. e.g. if the moitor IP is something very local and should always respond quickly (e.g. well inside 50ms) then you could set Loss Interval to 50ms and nothing bad would happen. Every Probe Interval (250ms) an echo request would be sent out. 99.9% of the time the echo reply will be received in only a few msecs. If the echo reply is not received within 50ms it will be counted as loss. dpinger will be effectively "idle" for another 200ms until it sends out the next echo request.
Implication: the validation does not need to check that Loss Interval > Probe Interval.

c) Time Period - the time over which a rolling average of the observed RTT and loss% is calculated. Default 25000ms means, at the default Probe Interval or 250ms, that it will be a rolling average of the results of the last 100 probes.
Implication: validation should check that (Time Period > Probe Interval) - and it would probably be odd to have (Time Period < 2 * Probe Interval) - that would make the gateway alarm on a single packet delay or loss, and unalarm again on a single packet success.

Comments please, do I understand this correctly? What is the proper explanation to go on the GUI? What is the proper validation to put in the code?

#1 - losslow is used to show a warning on gateway widget / gateway status. If you have losslow = 10 and losshigh = 20:

loss < 10 - status = green (Online)
loss >= 10 and loss < 20 - status = yellow (loss)
loss >= 20 - status = red (down)

#2 - Same as #1. Dpinger only have a binary alarm (on/off), but php code deal with the alert start when alarm is off but latency or loss is higher than low value

#3

a) Correct

b) Yeah, you are right. Will you submit a PR or should I change it?

c) Your idea is good, I'm in favor of deny (Time Period < 2 * Probe Interval), same question, you do or I do? :)

And please submit text changes you judge necessary, your english is much better than mine.

Thank you!

grandrivers

getting a warning in gateways of latency its set at 200/500 and ping is coming back at 25.xxx ms tried moving 50 2000/5000 still shows up

Dec 11 09:17:17 dpinger send_interval 1000ms report_interval 1000ms loss_interval 10000ms time_period 25000ms alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.4.4 bind_addr 192.168.254.1 alert_cmd "/etc/rc.gateway_alarm DSL_DHCP"

also the know issue of no rrd data yet

phil.davis

a) Correct

b) Yeah, you are right. Will you submit a PR or should I change it?

c) Your idea is good, I'm in favor of deny (Time Period < 2 * Probe Interval), same question, you do or I do? :)

And please submit text changes you judge necessary, your english is much better than mine.

Pull request https://github.com/pfsense/pfsense/pull/2207

The remaining issue I see is the setting of "loss interval" and "latencyhigh". Currently the default loss interval is 500 and latencyhigh is also 500. This makes no sense to me. If a probe comes back in > 500ms then the thread that is waiting for the reply will have given up (loss interval has expired). So any packets with an RTT > 500ms will be considered lost. Therefore there will be no packets recorded with an RTT > 500ms. Therefore the average latency can never exceed 500ms = latencyhigh.
It seems to me that "loss interval" needs to be reasonably higher than "latencyhigh" in any sensible configuration.
Thoughts?

phil.davis

Looking at dpinger source code, the "loss interval" and "latencyhigh" is not as silly as I first thought. The recv_thread is waiting for all incoming echo replies, so they do not actually timeout after "loss interval". It is just when the alarm or report calculation is done, any entries in the array of packets sent that do not have a reply and are older than "loss interval" are counted as lost.

On a high-latency link, some of those "lost" reply packets might actually show up some time later, and at the next calculation they will be included in the average latency calculation and no longer be included in the "packet loss percentage" calculation.

But still it seems weird if these 2 parameters are set the same (or nearly the same) - it results in packets being counted as "lost" at first, then later they just turn out to be "high latency".

rbgarga

@grandrivers:

getting a warning in gateways of latency its set at 200/500 and ping is coming back at 25.xxx ms tried moving 50 2000/5000 still shows up

Dec 11 09:17:17 dpinger send_interval 1000ms report_interval 1000ms loss_interval 10000ms time_period 25000ms alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.4.4 bind_addr 192.168.254.1 alert_cmd "/etc/rc.gateway_alarm DSL_DHCP"

also the know issue of no rrd data yet

I pushed a fix for Latency and Loss bad math. Thanks for reporting.

Also, RRD is now working.

grandrivers

looking good now thanks
will keep an eye out for issues

luckman212

How do we get these latest fixes? just do a gitsync?

cmb

gitsync only if you're on a new enough snapshot that you have the dpinger binary. The latest snapshot now available should be new enough to have caught it all.