Upstream unreachable but no ISP connection loss?

Andargor

Hello,

For several weeks, we have had intermittent connection loss several times per day from remote users connecting to an internal server through our pfSense 2.4.2 firewall. There would be several users connected at once, and they would all drop at the same time. LAN users would not be disconnected from the server. We have on the internal network a Nagios monitor which reports that the ISP gateway is unreachable at the same moment, for example (in reverse order, with recovery on the next poll):

Host Up[01-24-2018 05:15:21] HOST ALERT: Amazon-West;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 90.51 ms
Host Up[01-24-2018 05:15:11] HOST ALERT: Amazon-East;UP;SOFT;3;PING OK - Packet loss = 0%, RTA = 24.22 ms
Host Up[01-24-2018 05:15:11] HOST ALERT: videotron-gw;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.83 ms
Service Ok[01-24-2018 05:15:01] SERVICE ALERT: videotron-gw;PING-ISP;OK;SOFT;2;PING OK - Packet loss = 0%, RTA = 0.83 ms
Host Unreachable[01-24-2018 05:14:21] HOST ALERT: Amazon-West;UNREACHABLE;SOFT;1;CRITICAL - Host Unreachable
Host Unreachable[01-24-2018 05:14:21] HOST ALERT: Amazon-East;UNREACHABLE;SOFT;2;CRITICAL - Host Unreachable
Host Down[01-24-2018 05:14:11] HOST ALERT: videotron-gw;DOWN;SOFT;1;CRITICAL - Host Unreachable
Service Critical[01-24-2018 05:14:01] SERVICE ALERT: videotron-gw;PING-ISP;CRITICAL;SOFT;1;CRITICAL - Host Unreachable
Host Down[01-24-2018 05:14:01] HOST ALERT: Amazon-East;DOWN;SOFT;1;CRITICAL - Host Unreachable

Amazon-East/West are our datacenters, and videotron-gw is the ISP gateway. The firewall address is fixed IPv4, with a static default gateway.

Initially, we thought it might be the ISP, which we investigated. However, looking at the pfSense monitoring (Status > Monitoring), there are no Quality issues reported. For example, for the same timeline as above (WANGW is videotron-gw):

Nagios does not report internal loss of connectivity to the firewall, which eliminates LAN issues. There is no other activity on the firewall at the times of the connection losses, and they occur at different times during the day. There are no interface errors, and traffic is light. The system CPU, memory and states are normal. Our only conclusion is that the pfSense itself is intermittently blocking traffic for unknown reasons.

Does anyone have an idea how to resolve this or is this a bug?

Here are more graphs from Monitoring, for the same timeline as above:

Harvy66

Just making sure I'm reading this correctly. You said

However, looking at the pfSense monitoring (Status > Monitoring), there are no Quality issues reported.

then immediately after have a quality graph showing what looks like 100% packetloss around the time of the error log.

How is 100% loss not a quality issue?

Andargor

@Harvy66:

Just making sure I'm reading this correctly. You said

However, looking at the pfSense monitoring (Status > Monitoring), there are no Quality issues reported.

then immediately after have a quality graph showing what looks like 100% packetloss around the time of the error log.

How is 100% loss not a quality issue?

The Nagios alert is at 5:14, the Quality graph shows a drop at 5:50, both system clocks are synchronized. Unless the monitoring app is bugged and showing the wrong time?

Andargor

Looking at the system logs more closely, I am seeing a link down event at that time, strange!

Jan 24 10:14:08 kernel re1: link state changed to DOWN

(Note: the time in monitoring is local time, EST, in the system log it's UTC, 5:14 EST = 10:14 UTC)

The firewall is connected to an ISP switch, to which the ISP's cable modem is also connected.

I've swapped cables and ports, and will monitor what happens.

Andargor

@Harvy66:

Just making sure I'm reading this correctly. You said

However, looking at the pfSense monitoring (Status > Monitoring), there are no Quality issues reported.

then immediately after have a quality graph showing what looks like 100% packetloss around the time of the error log.

How is 100% loss not a quality issue?

Argh, the monitor shows local time at the bottom, but the times on the graph are UTC! I was confused on the times there. Here's the correct graph, and yes it seems the local link to the ISP went down. Narrowing the possibilities…