Issue with high latency started suddenly last week

MEichner

Hello,

We are currently running an HA setup with a lot of VLANs and CARP.
We are on 2.5.2, and have started arranging for tests prior to a 2.7.x upgrade. It will impact a series of IPSEC VPN links, as well as require client updates for OpenVPN users, so it needs to be planned well in advance.

The reason for the post, is that we have been completely bombarded with high latency and package drops the last days. Since the problem has been appearing suddenly, we are trying to establish what may have been the cause. But any amount of reviewing changes comes up blank.
My only thought is that a particularly agressive multicast traffic from one of our new networks have been forwarded and set off a cascade of replies that has been bouncing off the firewall.

Recently 32 new VLANS were established as we took over running an office space for different businesses on-site, so different interfaces were set up for each.
Today I found that a set of floating rules was created for ipv6 and ipv4, to save on work setting up WAN rules for each network, causing all multicast traffic to also be forwarded, or at least attempted to be forwarded to the wan gateways.

I removed these and set up rules on each individual subnet, limiting the rules to the interface net instead.
This lowered the constant traffic onto the firewalls, however I see a RTT of 12-20ms and RTTsd of 56-80ms with a loss going between 0 and 4%.

This is a dual 1gbit link with low latency, the highest ping to 1.1.1.1 and google dns is 8, to put into perspective. The ping to monitor ip is sub-1ms even under high load.

Getting back to the issue, while it has settled down, I do see these spikes of ping times to our CPE which is the monitor IP's:

Reply from a.b.c.d: bytes=32 time=1ms TTL=254
Reply from a.b.c.d: bytes=32 time=2ms TTL=254
Reply from a.b.c.d: bytes=32 time=2ms TTL=254
Reply from a.b.c.d: bytes=32 time=1ms TTL=254
Reply from a.b.c.d: bytes=32 time=1ms TTL=254
Reply from a.b.c.d: bytes=32 time=4ms TTL=254
Reply from a.b.c.d: bytes=32 time=1ms TTL=254
Reply from a.b.c.d: bytes=32 time=226ms TTL=254
Reply from a.b.c.d: bytes=32 time=2ms TTL=254
Reply from a.b.c.d: bytes=32 time=2ms TTL=254

Pings from the firewall show the same, <1ms and then those spikes every once in a while, being somewhere between 30ms and 500ms. We have ruled out internal switching through trial and error of pinging and testing pings across the network without going through pfsense. A constant clean ping with no deviation at all of <1-2ms at most is the result, even across multiple switches. There is no switch between the CPE and firewall. The fail-over happens between the CPE's, which also check if the firewall is responding.

Currently my observations brought me to the searches to the state table from pfinfo:

Status: Enabled for 5 days 13:44:13           Debug: Urgent

State Table                          Total             Rate
  current entries                    25770               
  searches                     28717041580        59646.6/s
  inserts                        119215397          247.6/s
  removals                       119189640          247.6/s

That is a whopper of a number, the I/O doesn't justify nearly 60k searches/s,

I found my colleague had blocked all fe00::/8 and ff00::/8 traffic in response to raising traffic, while it may have patched the floating rule issue earlier, it caused the system to be completely overloaded. Cleaning the states smoothed it out to current statistics, but the searches are still high.

I would love to hear any thoughts on this. Besides the obvious "upgrade to current version" as we are working on this.

MEichner

We found a link aggregation that looped rather than aggregating in a switch as well as some ports that was untagged for multiple networks that are now disabled. This has lowered the load, but not fixed the problem.

I saw this in the gateway log this morning:

Feb 28 04:04:32	dpinger	81222	WANGWv6_GC a:b:c::1: sendto error: 55
Feb 27 21:06:24	dpinger	80779	WAN_GlobalConnectGW a.b.c.193: sendto error: 55
Feb 27 21:06:24	dpinger	81222	WANGWv6_GC a:b:c::1: sendto error: 55
Feb 27 21:06:15	dpinger	81222	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a:b:c::1 bind_addr a:b:c::3 identifier "WANGWv6_GC "
Feb 27 21:06:15	dpinger	80779	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a.b.c.193 bind_addr a.b.c.195 identifier "WAN_GlobalConnectGW "
Feb 27 21:02:37	dpinger	40064	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a:b:c::1 bind_addr a:b:c::3 identifier "WANGWv6_GC "
Feb 27 21:02:37	dpinger	36210	send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a.b.c.193 bind_addr a.b.c.195 identifier "WAN_GlobalConnectGW "

Unlike before it seems to affect cpu and ram a lot too. around 8GB ram and a load average of 4.29, 5.47, 5.91. Swap is unused.

specs are:
Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
16 CPUs: 1 package(s) x 8 core(s) x 2 hardware threads

I did a state reset to clean up, however no change to latency.

I am starting to think this is merely traffic hitting a "threshold" that we haven't hit before, which is impacting the processing speed, yet we are not really processing any more data than we did previously, as far as I can see.

I am unsure how to proceed, can anyone give any good ways I could troubleshoot and locate the source of the problems?