Help troubleshooting loss of connection

xthursdayx

Hi all, I've been using pfSense happily for about six months now on a Dell Optiplex 9020 with an Intel i350 quad-port NIC card, a i5-4590 CPU (@ 3.30GHz), and 8Gb of RAM. Recently, with the pandemic social isolation measures in place, there have been four people regularly working in my home and we've been experiencing regular internet outages. Sometimes we lose service once a day or once every other day, sometimes it is multiple times a day. It does seem to anecdotally be related to heavier usage (e.g. multiple people on video calls), but not always.

I am trying to figure out the source of the problem, but am not really sure how to diagnose it. I never had this problem previously, and my pfSense configuration hasn't changed recently, so I am not sure if the problem is just from more intense usage or what. I can usually fix the connection by resetting the states table or rebooting the pfSense box. I have two different LANs set up from two different ports on the i350 NIC, with a unique WIFI network associated with each. They both experience the same problems.

I am hoping for some advice about how to go about best diagnosing the cause of these outages, and, hopefully, fix the situation. I have ntopng installed, but to be honest, I'm entirely sure how to use it to analyze this problem, other than just to look through the connections. Thanks very much in advance!

stephenw10

If it affects both internal interfaces it's probably WAN side. Can you still reach the pfSense webgui OK when this happens?

Can pfSense itself connect out? For example from Diag > Ping can it ping 8.8.8.8? Or google.com?

Check the system and gateway logs, it will probably show you what's happening there.

Steve

xthursdayx

Thanks for the reply. Yeah, I was thinking it might be the WAN side as well, since both interfaces are affected, however, if that is the case I'm confused why rebooting the firewall (and not necessarily the modem) should fix the problem.

When the outage happens I am still able to reach the pfSense webgui without any problem. If I recall correctly, last time I tried to ping google.com from the webui when the network was down I was unable to connect, however it was a while ago so I will try it again next time the problem arrises.

I've tried looking at the system and gateway logs but didn't see anything that seemed to specifically indicate what happened, though I'm looking with relatively untrained eyes. When I check the gateway log for the time of recent outage I see entries like these:

Apr 9 08:02:07	dpinger		WAN_DHCP 99.230.192.1: sendto error: 64

Repeat hundreds of times.

and

Apr 9 08:11:35	dpinger		WAN_DHCP 192.168.100.1: sendto error: 65
Apr 9 08:07:46	dpinger		THURSDAY_VPN 8.8.8.8: Alarm latency 0us stddev 0us loss 100%
Apr 9 08:07:44	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 8.8.8.8 bind_addr 10.66.70.1 identifier "THURSDAY_VPN "
Apr 9 08:07:44	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.100.1 bind_addr 192.168.100.10 identifier "WAN_DHCP "
Apr 9 08:07:44	dpinger		THURSDAY_VPN 8.8.8.8: Alarm latency 0us stddev 0us loss 100%
Apr 9 08:07:42	dpinger		WAN_DHCP 192.168.100.1: Alarm latency 825281us stddev 0us loss 0%

Unfortunately I don't have a system log from that time because more than 2000 entries have filled the log since the last outage. I checked /var/logs via ssh, but I didn't see a more complete archived log. I've now increased my log file byte size to have longer logs in the future.

One thing I'll mention is that I have been running a number of packages, including ntopng, Suricata, and pfBlockerNG (with DNSBL), as well as the unbound DNS resolver. This hasn't changed recently, and I feel like my hardware should be able to handle it. However, I read this troubleshooting guide which led me to run the command top -aSH from console to check CPU usage. From my Dashboard I can see that while memory usage is pretty high with all of those packages running (around 70-80%), CPU usage is usually quite low - usually around 5-15%. However, after checking Diagnostics > System Activity and running top -aSH via ssh I do see IRQ entries associated with my network card. For example:

PID USERNAME   PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND   
12   root       -92    -     0K   672K WAIT    0   1:52   0.00% [intr{irq272: igb1:que 0}]

That guide mentions that "if you observe an IRQ (interrupts) process for the network card, then the hardware you are using may be almost or completely saturated, or the NIC driver may need to be optimized." So is that possibly a problem? Again, my hardware seems more robust than what many pfSense boxes are running, so I figured I shouldn't have a problem, but now I'm a bit confused.

Thanks for the help!

stephenw10

Yeah unless you have some huge WAN, like 10Gbps, then that CPU should handle it. Those packages will increase CPU load significantly but you would see that under testing before it failed.

Those errors indicate a failure at low level. Error 64 -> the gateway stopped responding to ARP. Error 65 -> there is no route to the monitor IP.

You should check Diag > Routes when you hit this. Is it missing a default route or using the wrong one?

If so make sure the WANGW is set as defaut in System > Routing > Gateways rather than auto.

Steve

xthursdayx

Yeah, my connection is only a 1Gbps connection, not a 10Gbps. As I mentioned, I never really saw much strain on my CPU anyway, only much higher memory usage, so I figured that wasn't the problem. I was just a little nervous after I saw those irq entries in my system activity, even though they didn't seem to be using much or any CPU power.

I've been keeping ntopng and Suricata off, just in case. Do you think I should turn them back on? I haven't experienced any outages since having them off, but that could be coincidental, since the outages were sporadic anyway.

I checked Diagnostics > Routes and didn't see a "default" entry, so set by WAN_DHCP gateway as default (IPv4 at least, I left IPv6 as auto since I don't use IPv6), just in case. I do have a few gateways, including a local OpenVPN gateway and High Availability gateway group of three external VPN gateways, so perhaps that was the issue.

There's not much strain on the system right now since it's the weekend, but we'll see how things go when folks are back to work next week.

Thanks again for your help and advice!

stephenw10

The irg entries in top are the queues on the network cards as the OS services them to route traffic. That is also where the actual pf load is shown so that is completely expected.

Yeah with several gateway entries I imagine the system selected bogus default route. I'd be confident that was the cause here.

The only other thing, potentially, that might behave like that would be Suricata in in-line mode if you were running that on the WAN only.

Steve