"Member Down" problem

kevindd992002

Down == above the defined thresholds you have on the gateway for what should be considered down.

Exactly my point. Any idea why the issue is happening on my end?

luckman212

What does your System Logs > Gateways look like when this happens?

kevindd992002

Nov 11 13:12:29 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 11 13:21:23 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 11 13:22:04 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 11 14:28:51 apinger: ALARM: WAN2_DHCP(8.8.4.4) *** delay ***
Nov 11 14:28:59 apinger: alarm canceled: WAN2_DHCP(8.8.4.4) *** delay ***
Nov 11 21:42:44 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:42:54 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:48:24 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:49:04 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:51:18 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:52:10 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:52:46 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:53:01 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 12 06:06:03 apinger: ALARM: WAN2_DHCP(8.8.4.4) *** loss ***
Nov 12 06:06:11 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 06:06:44 apinger: alarm canceled: WAN2_DHCP(8.8.4.4) *** loss ***
Nov 12 06:06:59 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 06:28:57 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 06:29:43 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 17:38:58 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** loss ***
Nov 12 17:38:59 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 17:39:38 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 17:40:58 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** loss ***
Nov 12 19:28:12 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:30:50 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:30:58 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:38:44 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:38:59 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:39:28 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:43:09 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:48:12 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 13 13:20:26 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** WAN1_DHCPdown ***
Nov 13 13:20:26 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** WAN3_DHCPdown ***
Nov 13 13:20:26 apinger: ALARM: WAN2_DHCP(8.8.4.4) *** WAN2_DHCPdown ***
Nov 13 13:23:35 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** WAN3_DHCPdown ***
Nov 13 13:23:36 apinger: alarm canceled: WAN2_DHCP(8.8.4.4) *** WAN2_DHCPdown ***
Nov 13 13:23:36 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** WAN1_DHCPdown ***
Nov 13 13:25:47 apinger: ALARM: WAN2_DHCP(8.8.4.4) *** loss ***
Nov 13 13:25:50 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 13 13:26:33 apinger: alarm canceled: WAN2_DHCP(8.8.4.4) *** loss ***
Nov 13 13:26:34 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 15 04:28:55 apinger: Starting Alarm Pinger, apinger(23592)
Nov 15 04:28:59 apinger: SIGHUP received, reloading configuration.
Nov 15 04:29:00 apinger: SIGHUP received, reloading configuration.
Nov 15 04:29:03 apinger: SIGHUP received, reloading configuration.
Nov 15 17:22:49 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 15 17:22:51 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 15 17:23:01 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 15 17:23:03 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 15 17:23:14 apinger: SIGHUP received, reloading configuration.

cmb

How are your latency/loss settings configured in your gateway? What latency and loss is Status>Gateways showing when that happens, or check the quality RRD Graph (Status>RRD Graph) to see in the past.

kevindd992002

@cmb:

How are your latency/loss settings configured in your gateway? What latency and loss is Status>Gateways showing when that happens, or check the quality RRD Graph (Status>RRD Graph) to see in the past.

My latency and loss settings in all three gateways are blank (default). What exact infromation do I need to check in the RRD Graphs? There are a ton of information there.

EDIT: I've attached the RRD graph that I think is relevant. I just got another notification from pfsense that my WAN2_DHCP gateway went down and it seems that the packet loss and latency at that time is quite high but why would that affect the probing of the interface to cause it to be tagged as "down"?

Capture.JPG_thumb

kejianshi

I just turned off gateway monitoring on one of mine not long ago because it was more important that my pfsense work than that I have a pretty graph.

cmb

Your averaged out loss is upwards of 18%, you're definitely getting cycles where it's over 20%, and 20% will take down the WAN. Increase the loss threshold if that's normal behavior for your WAN. I suspect you either have shaping or limiters configured in such a way that you're de-prioritizing and dropping your monitor pings, or you have an issue of some sort with that connection if it gets that bad under load.

kevindd992002

@cmb:

Your averaged out loss is upwards of 18%, you're definitely getting cycles where it's over 20%, and 20% will take down the WAN. Increase the loss threshold if that's normal behavior for your WAN. I suspect you either have shaping or limiters configured in such a way that you're de-prioritizing and dropping your monitor pings, or you have an issue of some sort with that connection if it gets that bad under load.

No shaping or limiters configured, I guess it's just the normal behavior of our ISP since I'm from the Philippines. So the packet loss there can translate to a failed "probe" for the member down criterion?

kejianshi

Mine here is globe DSL. For sure they do LOTS of really poorly executed traffic shaping.
Especially where UDP VPNs are concerned. Pretty much only TCP 80 and 443 are reliable.

kevindd992002

@kejianshi:

Mine here is globe DSL. For sure they do LOTS of really poorly executed traffic shaping.
Especially where UDP VPNs are concerned. Pretty much only TCP 80 and 443 are reliable.

That's on their side. What we're talking about here is a simple probe of IP address (in my case public DNS servers). cmb is talking about traffic shaping on the pfsense side itself and not by the ISP.

kejianshi

Could be.
But I didn't necessarily see it that way.
I think traffic shaping on the ISP side done badly is just as bad.
BTW - I have same problem as yours with one of these running in texas on Time Warner Cable.
No shaping on pfsense. Definitely the ISP. Just crap latency. Terrible network.
Thats the one that I gave up on, turned of gateway monitor and things were then much improved.

BTW - My globe dsl router has a few things I had to change.
One of which was DDOS protection. Particularly "ping to death" protection.
That was screwing things up here.

kevindd992002

@kejianshi:

Could be.
But I didn't necessarily see it that way.
I think traffic shaping on the ISP side done badly is just as bad.
BTW - I have same problem as yours with one of these running in texas on Time Warner Cable.
No shaping on pfsense. Definitely the ISP. Just crap latency. Terrible network.
Thats the one that I gave up on, turned of gateway monitor and things were then much improved.

Yeah but I wouldn't want to disable gateway monitoring altogether as failover won't work if you do that. Increasing the thresholds should fix this problem, no brainer.

kejianshi

Sounds good - Then they can mark all the apinger threads as "solved" (-;

kevindd992002

@kejianshi:

Sounds good - Then they can mark all the apinger threads as "solved" (-;

What do you mean all apinger threads solved? Why would they do that?

kevindd992002

@cmb, what value should I set for the packet loss threshold? Would 20/30 be a good try?

phil.davis

The packet loss threshold really depends on the typical link performance. For example, I have some links where if the link is being used heavily, a lot of the monitor pings get lost for whatever reason, but actually the link is passing traffic at full speed. (I probably should put some traffic shaping on that and give the ICMP some high priority and see if I can improve that behaviour…). So for those I even put 40%/50% so that the link is only declared down if it really gets bad.
For links in less remote places than me, but with this kind of symptoms when saturated with traffic, I guess that 20%/30% will be OK.
You really need to run a few downloads in parallel on clients and observe "normal" numbers, then set higher.

kevindd992002

Gotcha, thanks!

kejianshi

I think my gateways don't like being pinged all the time. You would think my gateway would allow as much traffic as I feel like sending of any type I chose up to my bandwidth limit, but I have seen on two seperate boxes now where if I'm pinging the gateway every second, my pings eventually get blocked and the gateway reads as down all the time.

I've seen it separately on one IPV4 and one IPV6 gateway on two totally separate boxes. I know this has nothing at all to do with pfsense and is just a case of the ISP being stupid, but in both cases pinging only every 10 seconds seems to result in me not getting blocked or having my pings dropped.

Separately, I also raised the thresholds as you described.

I have one IP I tried for gateway monitoring that drops allows 2 pings, drops the third, allows two more, drops the third… Consistently.
I'm not using that for gateway monitor, of course, but it took me a while to discover it and the behavior seems nonsense to me because, after all, why should pings from a legitimate source get thrown into bit heaven for no apparent reason?

All this silliness from those ISPs effects pfsense stability, but its not pfsense fault.

kevindd992002

Interesting. But in my case, the monitor IP's I'm using are DNS servers, opendns' and google's so I don't think pinging them every second would do the same behavior you're seeing, right?

kejianshi

Not sure - Its been plenty of time since my post and slowing down my pings seems to have made a lot of difference on my gateways at least.