Issue with high latency started suddenly last week
-
Hello,
We are currently running an HA setup with a lot of VLANs and CARP.
We are on 2.5.2, and have started arranging for tests prior to a 2.7.x upgrade. It will impact a series of IPSEC VPN links, as well as require client updates for OpenVPN users, so it needs to be planned well in advance.The reason for the post, is that we have been completely bombarded with high latency and package drops the last days. Since the problem has been appearing suddenly, we are trying to establish what may have been the cause. But any amount of reviewing changes comes up blank.
My only thought is that a particularly agressive multicast traffic from one of our new networks have been forwarded and set off a cascade of replies that has been bouncing off the firewall.Recently 32 new VLANS were established as we took over running an office space for different businesses on-site, so different interfaces were set up for each.
Today I found that a set of floating rules was created for ipv6 and ipv4, to save on work setting up WAN rules for each network, causing all multicast traffic to also be forwarded, or at least attempted to be forwarded to the wan gateways.I removed these and set up rules on each individual subnet, limiting the rules to the interface net instead.
This lowered the constant traffic onto the firewalls, however I see a RTT of 12-20ms and RTTsd of 56-80ms with a loss going between 0 and 4%.This is a dual 1gbit link with low latency, the highest ping to 1.1.1.1 and google dns is 8, to put into perspective. The ping to monitor ip is sub-1ms even under high load.
Getting back to the issue, while it has settled down, I do see these spikes of ping times to our CPE which is the monitor IP's:
Reply from a.b.c.d: bytes=32 time=1ms TTL=254 Reply from a.b.c.d: bytes=32 time=2ms TTL=254 Reply from a.b.c.d: bytes=32 time=2ms TTL=254 Reply from a.b.c.d: bytes=32 time=1ms TTL=254 Reply from a.b.c.d: bytes=32 time=1ms TTL=254 Reply from a.b.c.d: bytes=32 time=4ms TTL=254 Reply from a.b.c.d: bytes=32 time=1ms TTL=254 Reply from a.b.c.d: bytes=32 time=226ms TTL=254 Reply from a.b.c.d: bytes=32 time=2ms TTL=254 Reply from a.b.c.d: bytes=32 time=2ms TTL=254
Pings from the firewall show the same, <1ms and then those spikes every once in a while, being somewhere between 30ms and 500ms. We have ruled out internal switching through trial and error of pinging and testing pings across the network without going through pfsense. A constant clean ping with no deviation at all of <1-2ms at most is the result, even across multiple switches. There is no switch between the CPE and firewall. The fail-over happens between the CPE's, which also check if the firewall is responding.
Currently my observations brought me to the searches to the state table from pfinfo:
Status: Enabled for 5 days 13:44:13 Debug: Urgent State Table Total Rate current entries 25770 searches 28717041580 59646.6/s inserts 119215397 247.6/s removals 119189640 247.6/s
That is a whopper of a number, the I/O doesn't justify nearly 60k searches/s,
I found my colleague had blocked all fe00::/8 and ff00::/8 traffic in response to raising traffic, while it may have patched the floating rule issue earlier, it caused the system to be completely overloaded. Cleaning the states smoothed it out to current statistics, but the searches are still high.
I would love to hear any thoughts on this. Besides the obvious "upgrade to current version" as we are working on this.
-
We found a link aggregation that looped rather than aggregating in a switch as well as some ports that was untagged for multiple networks that are now disabled. This has lowered the load, but not fixed the problem.
I saw this in the gateway log this morning:
Feb 28 04:04:32 dpinger 81222 WANGWv6_GC a:b:c::1: sendto error: 55 Feb 27 21:06:24 dpinger 80779 WAN_GlobalConnectGW a.b.c.193: sendto error: 55 Feb 27 21:06:24 dpinger 81222 WANGWv6_GC a:b:c::1: sendto error: 55 Feb 27 21:06:15 dpinger 81222 send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a:b:c::1 bind_addr a:b:c::3 identifier "WANGWv6_GC " Feb 27 21:06:15 dpinger 80779 send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a.b.c.193 bind_addr a.b.c.195 identifier "WAN_GlobalConnectGW " Feb 27 21:02:37 dpinger 40064 send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a:b:c::1 bind_addr a:b:c::3 identifier "WANGWv6_GC " Feb 27 21:02:37 dpinger 36210 send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 1 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr a.b.c.193 bind_addr a.b.c.195 identifier "WAN_GlobalConnectGW "
Unlike before it seems to affect cpu and ram a lot too. around 8GB ram and a load average of 4.29, 5.47, 5.91. Swap is unused.
specs are:
Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
16 CPUs: 1 package(s) x 8 core(s) x 2 hardware threadsI did a state reset to clean up, however no change to latency.
I am starting to think this is merely traffic hitting a "threshold" that we haven't hit before, which is impacting the processing speed, yet we are not really processing any more data than we did previously, as far as I can see.
I am unsure how to proceed, can anyone give any good ways I could troubleshoot and locate the source of the problems?