"Member Down" problem
-
Sometimes, when I download from usenet using my tri-WAN setup, I get a packet loss on one or more of my interfaces which I think is normal with the default threshold settings of pfsense. The thing is that I setup my Load_Balance and Failover routing groups to have a trigger level of "Member Down" yet I still receive an email notification that the modem with the packet loss status is "being removed from both routing groups because they are "down"
My question is why is a packet loss translating to a member down instance? By default, pfsense is set to probe every second and will only consider the member down if it gets 10 failed probes. I'm assuming pinging the monitor IP of each modem is the probing technique, right?
-
BUMP!
-
BUMP!
-
You're not gonna make too many friends doing that…
-
hehe - I was waiting for someone to make a viagra joke.
I wish I knew the answer to this problem, but I have no suggestions.
-
I'm sorry, I thought it's fine to do a daily bump?
-
I always thought "Member down" meant you had to take away the electrical signals on the physical port (unplug the cable, power off the thing at the other end of the cable…) for pfSense to consider the interface down.
I will also be happy to hear from someone who knows what the intend behaviour of "Member Down" is. -
I always thought "Member down" meant you had to take away the electrical signals on the physical port (unplug the cable, power off the thing at the other end of the cable…) for pfSense to consider the interface down.
I will also be happy to hear from someone who knows what the intend behaviour of "Member Down" is.Well, there are "thresholds" in "System: Gateways: Edit gateway" advanced section that you can set for the member down feature. So it constantly probes the monitor IP THROUGH that specific interface for replies before it considers it as member down.
-
Anybody has more ideas?
-
In my testing I was also under the impression that a "member down" event was only triggered by a physical interruption i.e. the attached device was powered down or the cable was unplugged etc. That's why I usually choose "packet loss or high latency" when setting up my gateway groups- as far as I understand it unplugging the cable certainly causes packet loss so that usually covers both cases.
-
In my testing I was also under the impression that a "member down" event was only triggered by a physical interruption i.e. the attached device was powered down or the cable was unplugged etc. That's why I usually choose "packet loss or high latency" when setting up my gateway groups- as far as I understand it unplugging the cable certainly causes packet loss so that usually covers both cases.
That is the first impression, it was mine too at first. But if you look at the threshold settings in the monitor IP settings, you'll see something like the information in the screenshot I've just attached here and you'll realize that there is still probing that will happen first before it considers a member as down.
-
Down == above the defined thresholds you have on the gateway for what should be considered down.
-
Chris thanks for the clarification … very good to know. I've definitely been misinterpreting this for a long time!
-
@cmb:
Down == above the defined thresholds you have on the gateway for what should be considered down.
Exactly my point. Any idea why the issue is happening on my end?
-
What does your System Logs > Gateways look like when this happens?
-
Nov 11 13:12:29 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 11 13:21:23 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 11 13:22:04 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 11 14:28:51 apinger: ALARM: WAN2_DHCP(8.8.4.4) *** delay ***
Nov 11 14:28:59 apinger: alarm canceled: WAN2_DHCP(8.8.4.4) *** delay ***
Nov 11 21:42:44 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:42:54 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:48:24 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:49:04 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:51:18 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:52:10 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:52:46 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 11 21:53:01 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 12 06:06:03 apinger: ALARM: WAN2_DHCP(8.8.4.4) *** loss ***
Nov 12 06:06:11 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 06:06:44 apinger: alarm canceled: WAN2_DHCP(8.8.4.4) *** loss ***
Nov 12 06:06:59 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 06:28:57 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 06:29:43 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 17:38:58 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** loss ***
Nov 12 17:38:59 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 17:39:38 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 12 17:40:58 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** loss ***
Nov 12 19:28:12 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:30:50 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:30:58 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:38:44 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:38:59 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:39:28 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:43:09 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 12 19:48:12 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 13 13:20:26 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** WAN1_DHCPdown ***
Nov 13 13:20:26 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** WAN3_DHCPdown ***
Nov 13 13:20:26 apinger: ALARM: WAN2_DHCP(8.8.4.4) *** WAN2_DHCPdown ***
Nov 13 13:23:35 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** WAN3_DHCPdown ***
Nov 13 13:23:36 apinger: alarm canceled: WAN2_DHCP(8.8.4.4) *** WAN2_DHCPdown ***
Nov 13 13:23:36 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** WAN1_DHCPdown ***
Nov 13 13:25:47 apinger: ALARM: WAN2_DHCP(8.8.4.4) *** loss ***
Nov 13 13:25:50 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 13 13:26:33 apinger: alarm canceled: WAN2_DHCP(8.8.4.4) *** loss ***
Nov 13 13:26:34 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** loss ***
Nov 15 04:28:55 apinger: Starting Alarm Pinger, apinger(23592)
Nov 15 04:28:59 apinger: SIGHUP received, reloading configuration.
Nov 15 04:29:00 apinger: SIGHUP received, reloading configuration.
Nov 15 04:29:03 apinger: SIGHUP received, reloading configuration.
Nov 15 17:22:49 apinger: ALARM: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 15 17:22:51 apinger: ALARM: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 15 17:23:01 apinger: alarm canceled: WAN3_DHCP(208.67.222.222) *** delay ***
Nov 15 17:23:03 apinger: alarm canceled: WAN1_DHCP(8.8.8.8) *** delay ***
Nov 15 17:23:14 apinger: SIGHUP received, reloading configuration. -
How are your latency/loss settings configured in your gateway? What latency and loss is Status>Gateways showing when that happens, or check the quality RRD Graph (Status>RRD Graph) to see in the past.
-
@cmb:
How are your latency/loss settings configured in your gateway? What latency and loss is Status>Gateways showing when that happens, or check the quality RRD Graph (Status>RRD Graph) to see in the past.
My latency and loss settings in all three gateways are blank (default). What exact infromation do I need to check in the RRD Graphs? There are a ton of information there.
EDIT: I've attached the RRD graph that I think is relevant. I just got another notification from pfsense that my WAN2_DHCP gateway went down and it seems that the packet loss and latency at that time is quite high but why would that affect the probing of the interface to cause it to be tagged as "down"?
-
I just turned off gateway monitoring on one of mine not long ago because it was more important that my pfsense work than that I have a pretty graph.
-
Your averaged out loss is upwards of 18%, you're definitely getting cycles where it's over 20%, and 20% will take down the WAN. Increase the loss threshold if that's normal behavior for your WAN. I suspect you either have shaping or limiters configured in such a way that you're de-prioritizing and dropping your monitor pings, or you have an issue of some sort with that connection if it gets that bad under load.