Member Down triggering with 0% loss

SteveITS

I thought we'd be good now but I am still confused. With the gateway group set to use Packet Loss, and changes applied yesterday, we got these notifications today:

9:47:06 MONITOR: WANGW is down, omitting from routing group GWGROUP
8.8.4.4|50.x.x.x|WANGW|506.332ms|600.226ms|0.0%|down

9:47:39 28411MONITOR: WANGW is available now, adding to routing group GWGROUP
8.8.4.4|50.x.x.x|WANGW|400.057ms|594.767ms|0.0%|delay

SteveITS

I upgraded the device to 2.4.4_3 five days ago and while we did get a real packet loss momentary outage yesterday with the gateway group set to Packet Loss, we just got another alert with 0% loss and high latency. As these seem to be momentary spikes in latency, I guess I will set it back to Member Down and raise the higher latency limit up above the default 500, and will report back if it recurs. Overall not sure the Packet Loss setting is triggering on packet loss...

FWIW this is on an SG-2440.

SteveITS

We still see this occasionally. We thought we caught one culprit, with a MacBook uploading something large, perhaps a backup? We tried putting that device in a lower priority queue but that doesn't seem to have helped much...likely other devices are generating traffic bursts (it's a church so lots of roving devices on the "guest" wireless). We tried giving ICMP a higher priority queue as well.

Am l just misunderstanding and the Trigger Level setting in the gateway group doesn't actually have anything to do with when the gateway is considered down? Because it sure seems like latency can trigger the down state regardless of the Trigger Level choice, as I posted above and have been experiencing.

Any suggestions beyond just cranking the latency threshold setting up until the failovers end? It's set to 1200 right now which has less failovers, but it seems like a traffic burst shouldn't really cause a failover. Unless I'm chasing the wrong thing...

For reference the emails today were:
13:10:45 MONITOR: WANGW is down, omitting from routing group GWGROUP
8.8.4.4|50.x.x.x|WANGW|1220.378ms|2222.053ms|0.0%|down
13:11:45 77236MONITOR: WANGW is available now, adding to routing group GWGROUP
8.8.4.4|50.x.x.x|WANGW|869.386ms|986.797ms|3%|delay

Derelict

You can trigger on latency, loss, or either. See the advanced settings in the gateway.

SteveITS

@Derelict I see the latency and packet loss threshold settings there, that's what I've been adjusting. Are you saying that 1) there's no way to choose between the two, and/or 2) Trigger Level in the gateway group isn't used?
Thanks,

Derelict

Sorry. Look at the trigger level in the gateway group.

SteveITS

I see the Trigger Level setting, but per my earlier posts it seems to have no effect, e.g., set to Packet Loss it triggered at "8.8.4.4|50.x.x.x|WANGW|506.332ms|600.226ms|0.0%|down". Is that not 0.0% packet loss or am I misreading?

I suppose I can set it to Packet Loss and 5000 ms but it seems like that shouldn't be necessary to do both. :)

Edit: do I need to do something besides saving and Applying the changes on the gateway groups page to apply the Trigger Level?

Derelict

It triggered on 600ms latency there.

SteveITS

@Derelict said in Member Down triggering with 0% loss:

It triggered on 600ms latency there.

I figured that, but if Trigger Level is set to Packet Loss shouldn't it allow any latency number?

SteveITS

Had to wait a bit due to the generally lower activity but it triggered again today set to Packet Loss:

13:46:51 MONITOR: WANGW is down, omitting from routing group GWGROUP
8.8.4.4|50.x.x.x|WANGW|1232.018ms|1382.056ms|0.0%|down

Derelict

I suppose if you can reproduce it readily file a bug report at https://redmine.pfsense.org/

You are going to be in a very small club wanting that gateway to remain viable at 1300ms latency.

SteveITS

Unfortunately I'm not sure how to reproduce it on demand. It seems to be transient but it's just long enough for it to failover and then fail back within a few seconds. Our best guess is the one time I was able to log in within a few minutes and see any sort of high traffic, there was high upload traffic from a Macbook so maybe some sort of backup, and then the upload fails at the gateway change. I tried to make that device lower priority but it doesn't seem to have helped much. At the moment it's still every month or so since we raised the latency threshold a few times and changed back to Packet Loss.

Do you happen to know if the Packet Loss trigger has a time period, like 5 seconds or 60 seconds?

I figure there is something in the code for "x% loss OR 1000ms" like no one would ever get to that point, and it's just not stated anywhere...

Derelict

@teamits I would read all the settings at the bottom of the gateway configuration page.

SteveITS

Yeah...I wasn't looking at a router at the time and I hadn't looked at this one in a month. Oops.

Although that did light a bulb for me. Loss Interval says "Time interval in milliseconds before packets are treated as lost. Default is 2000." Do "treated as" packets actually get marked in the percentage lost? With an average of 1300 perhaps a few are taking longer than 2000ms and are considered "lost" although they arrive in, say, 2100ms and thus the 0% loss shown? I think I'll try using 120s for the time interval to see if that "provides smoother results."

Overall the goal was just to not have the connection drop/failover now and again, with 0% loss shown. High latency isn't great but moving the traffic from cable to DSL isn't generally going to improve that if it's due to traffic.