Latency Thresholds seem to be ignored!

perholmes

Hi,

I've gone over this a hundred different ways, but the values set for latency low/high seems to not be sticking, and I could really use some input.

I have 4 WAN connections. Normal range is about 70 ms latency, 0% packet loss. But at night, they can spike to 1,000 ms latency and 5% packet loss.

For testing, I've set each connection for latency at 20 ms / 21 ms and packet loss at 1% and 2% just to be sure that they fire all of the time. I ping neighborhood monitor IPs, and down is set to 3 intervals (really aggressive).

But none of the links go down, even with these extreme settings, and it's driving me nuts. I have 3 connections right now that are spending significant time in the +500ms range. But the Gateway Status is just sitting on Warning: 500 ms Latency.

Why are these values not sticking? It seems like the load balancer is simply not using the values at all, but using other values that I can't find.

What puzzles me is that apinger.conf says:

target "216.146.35.35" {
        description "WAN4_DHCP"
        srcip "192.168.67.3"
        alarms override "WAN4_DHCPloss","WAN4_DHCPdelay","WAN4_DHCPdown";
        rrd file "/var/db/rrd/WAN4_DHCP-quality.rrd"
}

Does that mean that the latency thresholds aren't even used when you use Gateway monitoring? Then where do you set the values? Are there no values at all you can configure when you use Gateway Monitoring? And then, why isn't that explained or even hinted at?

I'm super confused, and I've tried every combination of every setting.

I'm on PF Sense 2.1.4_RELEASE.

Thanks!

Per

perholmes

Hi,

Just to show how nuts this is, right now one connection is having over 1,000 ms latency, with thresholds at just 20 ms and 21 ms, but it's not shutting down the link. See images.

Best,

Per

img1.png_thumb

img2.png_thumb

phil.davis

Look in apinger.conf for the definitions of WAN4_DHCPloss etc. For example I have this in apinger.conf on a test system:

alarm loss "WAN_DHCPloss" {
	percent_low 40
	percent_high 50
}
alarm delay "WAN_DHCPdelay" {
	delay_low 4000ms
	delay_high 5000ms
}
target "192.168.1.1" {
	description "WAN_DHCP"
	srcip "192.168.1.116"
	alarms override "WAN_DHCPloss","WAN_DHCPdelay","down";
	rrd file "/var/db/rrd/WAN_DHCP-quality.rrd"
}

Gateway groups should remove the "down" gateways, so traffic using those gateway groups will not be directed out the "down" gateway/s any more.
Are there messages about that in the system logs?
What gateway groups do you have, and what rules are feeding into them?
(without that, you won't see any difference when a gateway is "down")

perholmes

Hi,

So, it's very simple. All 4 gateways are part of the same group, and it's set to Packet Loss or High Latency (although I've tried all the other settings as well). Screenshot is attached, there's no hocus pocus here, this is a standard setup straight out of the manual.

Editing Apinger.conf is not an option, because it's overwritten by the system. Those changes won't last 5 seconds.

There are plenty of alarms in the system log, but the gateways DO NOT go down according to the parameters. With latency set at 20/21 ms (which way too good for the ADSL connections), the links should be down ALL THE TIME. But nothing is happening, the latency thresholds are clear not being used.

Nov 18 07:53:54 	apinger: ALARM: WAN2_DHCP(2.118.128.9) *** WAN2_DHCPdown ***
Nov 18 07:53:55 	apinger: alarm canceled: WAN2_DHCP(2.118.128.9) *** WAN2_DHCPdown ***
Nov 18 07:54:06 	apinger: ALARM: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 07:54:08 	apinger: alarm canceled: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:02:59 	apinger: ALARM: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:03:00 	apinger: alarm canceled: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:03:07 	apinger: ALARM: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:03:08 	apinger: alarm canceled: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:04:18 	apinger: ALARM: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:04:19 	apinger: alarm canceled: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:07:38 	apinger: ALARM: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:07:46 	apinger: alarm canceled: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:11:31 	apinger: ALARM: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:11:50 	apinger: alarm canceled: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:12:00 	apinger: ALARM: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***
Nov 18 08:12:10 	apinger: alarm canceled: WAN4_DHCP(216.146.35.35) *** WAN4_DHCPdelay ***

img3.png_thumb

phil.davis

Editing Apinger.conf is not an option, because it's overwritten by the system. Those changes won't last 5 seconds.

I did not mean for you to change apinger.conf - just look in it and see that good-looking alarm definitions are in there. Based on your logs, it seems they are.
If all the gateways are declared down then the system by default fails over to default routing: System->Advanced->Miscellaneous, Gateway Monitoring, Skip rules when gateway is down - that is off by default.
To test, maybe you should let 1 of the gateways have monitor parameters that let it stay up. Then check your apparent public IP from a client to see what fails over to what.

Then of course the question is why is the alarm getting cancelled so soon after it goes off?

Do you also have 216.146.35.35 as a DNS server? (That is DynDNS public DNS)
If so, the system is probably having a route to that due to the DNS server definition, and as soon as the alarm goes off, it might be failing over the route to some other WAN, then of course apinger sees 216.146.35.35 up again, cancels the alarm, route goes back to WAN4 and around we go.
I think there is some issue there that means you have to use monitor IPs that are different to the defined DNS servers.

perholmes

Hi,

Thanks for your advice. I'll try to make sure that there's no overlap between DNS servers and monitor IPs. There could be a skeleton buried there.

As for Apinger.conf, it does reflect all the CP settings correctly.

I'm right not only testing on a single WAN port with extreme settings in order to try to force it down. I'll start with making sure that this WAN interface is monitoring a unique IP.

perholmes

Hi,

There's unfortunately no overlap. My DNS servers are:

208.67.222.222 - WAN1
208.67.220.220 - WAN2
208.67.222.222 - WAN3
208.67.220.220 - WAN4

I'm testing on WAN 3, which has these settings:

Monitor IP: 95.174.20.211 (not used anywhere else)
Latency Low: 20ms
Latency High: 21ms
Packet Loss Low: 1
Packet Loss High: 2
Interval: 1 Second
Down: 3 Seconds

By any stretch of the imagination, this link should fail, but it stays up with sometimes over 500ms of latency.