Intermittent loss of connections

RadOD

I am having periodic variable and inconsistent disruptions in connectivity on both of two different pfsense boxes and I cannot find define cause. Further, I cannot find anything conclusive in the logs to tell me where to problem is.

Sometimes a single workstation tries to access the internet and cannot while others continue to work and have no problems. Sometimes for one user, web pages that are already open can be browsed but new web pages cannot be accessed. Outages can be just parts of a page. Or sometimes everything stops working. Most often this occurs with high traffic but normal function does not usually resume when traffic is stopped and reboot is required. When this happens often the gateway logs have:

dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr xx.xx.xx.xx bind_addr xx.xx.xx.xx identifier "WAN_DHCP

less often:

dpinger		WAN_DHCP xx.xx.xx.xx: Alarm latency 32231us stddev 22459us loss 22%

and sometimes even that has subsequently been cleared. But sometimes there is nothing listed under dpinger. But in any case, this looks like recognition of the symptoms not an explanation of the cause. Disable Gateway Monitoring Action has no effect.

This usually occurs when there is heavy user traffic. An IPSEC connection is running between two sites each running 2.4.5 with similar hardware. One has a 3215U CPU and is far more susceptible to this than the site with an i5 CPU although both seem to have CPU use in the 5-15% range. I have adjusted setting for igb hardware in config.conf.local based on reading here (https://forum.netgate.com/topic/137835/suricata-inline-with-igb-nics/49) and some of the links listed there as follows, but I can't say that I can tell if it has made any difference. (Much of the attempts to tweak the hardware settings was because the suricata service stops on inline although seems to run without problems in legacy.)

kern.ipc.nmbclusters="1048576"
hw.pci.enable_msix=1

hw.em.msix=1
hw.em.smart_pwr_down=0
hw.em.num_queues=1                  # https://suricata.readthedocs.io/en/suricata-4.0.5/performance/packet-capture.html#rss


# below this line is all from: https://calomel.org/freebsd_network_tuning.html


if_igb_load="YES"
hw.igb.enable_msix="1"
hw.igb.enable_aim="1"
hw.igb.rx_process_limit="100"                           # default
hw.igb.num_queues="3"                                   # (default 0 , queues equal the number of CPU real cores if queues available on card)
hw.igb.max_interrupt_rate="16000"                       # double 
defaultcoretemp_load="YES"
hw.intr_storm_threshold="9000"                          # default

if_em_load="YES"
hw.em.enable_msix="1"
hw.em.msix=1

autoboot_delay="-1"
net.isr.maxthreads="-1"
net.isr.bindthreads="1"                                 # (default 0, runs randomly on any one cpu core)

#Larger buffers and TCP Large Window Extensions
net.inet.tcp.rfc1323=1
net.inet.tcp.recvbuf_inc=65536          # (default 16384)
net.inet.tcp.sendbuf_inc=65536          # (default 8192)
net.inet.tcp.sendspace=65536            # (default 32768)
net.inet.tcp.mssdflt=1460               # Option 1 (default 536)
net.tcp.minmss=536                      # (default 216)

#syn protection
net.inet.tcp.syncache.rexmtlimit=0      # (default 3)

There is a lot of data constantly going site to site often saturating the upstream bandwith. Traffic shaper is used (PRIQ/codel on each interface/queue) to lower the priority of the autonomous background data and raise the priority of VOIP, neither of which seem to have any problems - at least as far as I can see. Packet loss in traffic shaper status was highly correlated with the outages. I have gradually increased the sizes of the queues (most default to 50) by 50 or 100 every time I see loss and now it is usually 0. This has most definitely vastly improved reliability but problems persist.

I can't find the problem and I can't find anything in the syslog that is much help. I've even created a Splunk server to sift through thing without much luck. There seems to be more in /var/log than what I see in the web interface but I have not been successful in getting clog to work over SSH to copy over as text. Can anyone tell me where to start looking?