Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5

A Former User

@getcom I've not seen any indication that a fresh install of 2.4.5 is different than an updated one. I would do a fresh install of 2.4.5 if I had some indication that fresh installs don't have the issue. Otherwise going back to 2.4.4-p3 is, sadly, the way to go.

getcom

@jwj said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

VM's and bare metal both have the problem. How badly the problem is felt is dependent on how capable your system is. Anything that causes pfctl to run will cause the problem in my experience. I'm on good hardware so it's not a show stopper, more of an annoyance.

Mmh., me too, an eight core D1541 Xeon CPU with 32GB RAM is a lot of power, but nevertheless the system has load peaks up to 23 with 100% CPU and up to 80% memory consumption.
This is way too much. The response times are worst in such a case. There is zero packet loss but 5 seconds latency. It sounds like a service is blocked or overloaded which is necessary for e.g. packet inspection. This would then explain a) high CPU usage, b) memory consumption and c) no packet loss. More traffic is causing a higher load. This is the case if pfB runs the updates but it can also happen if user payload is higher. In my regular setup a pfSense can have 5-12 or more VLANs and the pfS is the gateway for all. Normally I`m using trunk ports over 10GbE SFP+ laggs as parent interfaces for VLANs directly connected to Cisco core switches. If pfS hangs, the complete infrastructure which is using the pfS route has a problem and this is what I have at the moment. It is traffic related, not hardware or packet related. Some tests in a lab with a stress test tool like Gatling shooting to endpoints in different subnets and a debugging enabled pfS should enlighten the darkness.
The FreeBSD changes of 11.3 have a lot potential candidates for such issues: https://www.freebsd.org/releases/11.3R/relnotes.html#kernel-general.

tman222

@getcom said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

@jwj said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

VM's and bare metal both have the problem. How badly the problem is felt is dependent on how capable your system is. Anything that causes pfctl to run will cause the problem in my experience. I'm on good hardware so it's not a show stopper, more of an annoyance.

Mmh., me too, an eight core D1541 Xeon CPU with 32GB RAM is a lot of power, but nevertheless the system has load peaks up to 23 with 100% CPU and up to 80% memory consumption.
This is way too much. The response times are worst in such a case. There is zero packet loss but 5 seconds latency. It sounds like a service is blocked or overloaded which is necessary for e.g. packet inspection. This would then explain a) high CPU usage, b) memory consumption and c) no packet loss. More traffic is causing a higher load. This is the case if pfB runs the updates but it can also happen if user payload is higher. In my regular setup a pfSense can have 5-12 or more VLANs and the pfS is the gateway for all. Normally I`m using trunk ports over 10GbE SFP+ laggs as parent interfaces for VLANs directly connected to Cisco core switches. If pfS hangs, the complete infrastructure which is using the pfS route has a problem and this is what I have at the moment. It is traffic related, not hardware or packet related. Some tests in a lab with a stress test tool like Gatling shooting to endpoints in different subnets and a debugging enabled pfS should enlighten the darkness.
The FreeBSD changes of 11.3 have a lot potential candidates for such issues: https://www.freebsd.org/releases/11.3R/relnotes.html#kernel-general.

Similar situation here as well: On a bare metal 4 core Xeon D 1518 based system with 16GB RAM running 2.4.5, latencies will spike up up into the 1000's of milliseconds for a few seconds with load spiking before it finally reduces. This happens every time e.g. I change or update a firewall rule (and a reload occurs). No such issues observed on 2.4.4 P3 or prior versions.

getcom

@jwj said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

@getcom I've not seen any indication that a fresh install of 2.4.5 is different than an updated one. I would do a fresh install of 2.4.5 if I had some indication that fresh installs don't have the issue. Otherwise going back to 2.4.4-p3 is, sadly, the way to go.

There is no indication for that from my perspective but to be sure it could be done in such a way. As it happens also on fresh installed system the indication points to another direction.
It is more likely that this is either a bug or a wrong/too small sysctl value in conjunction with a specific circumstance like a high TCP payload or traffic.

A Former User

Just did a fresh install of 2.4.5. No difference. Supermicro 5018D-FN4T.

What's left to determine is is it a tunable or something more fundamentally broken in FreeBSD 11.3 as mentioned by @getcom

It's time for Netgate to put the 2.4.4-p3 images back online. Amazing given the number of times it was stated that 2.4.5 would be ready when it was ready. Better late than untested as so on. I trusted them that when it was out it would be ready. I defended them when others whined about the time between releases. I will not make that mistake again.

bmeeks

I currently have no dog in this fight since I don't run pfBlockerNG-devel and I have not yet upgraded my SG-5100 to 2.4.5 (but I do plan to).

I found this Errata Notice in the Release Notes for FreeBSD 11/STABLE -- https://www.freebsd.org/security/advisories/FreeBSD-EN-20:04.pfctl.asc. Since several of you, and some posters in other threads, are complaining about high utilization from pfctl, the patch this errata notice talks about might be of interest.

A Former User

@bmeeks Thanks! I will add that the issue exists without pfblocker. pfblocker just adds a bunch of large alias's that exacerbate the issue.

bmeeks

@jwj said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

@bmeeks Thanks! I will add that the issue exists without pfblocker. pfblocker just adds a bunch of large alias's that exacerbate the issue.

Notice in the header for the posted Errata who is credited with the "fix" and read what the fix was about -- increasing table size limits (I think for IPv6 bogons and a side effect of allowing bigger pfBlockerNG lists). I have not looked at the actual patch, so maybe it is not related -- but it definitely might be worth looking into. Perhaps there are unintended adverse side effects from the change ???

BBcan177

@jwj and others
Try to reduce the setting in: pfSense > Advanced > Firewall & NAT > Firewall Maximum Table Entries as per this thread:
https://www.reddit.com/r/PFSENSE/comments/fsx4mx/fix_for_tmprulesdebug28_cannot_define_table_pfb/

A Former User

So, we already have the "fix".

tman222

@BBcan177 said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

@jwj and others
Try to reduce the setting in: pfSense > Advanced > Firewall & NAT > Firewall Maximum Table Entries as per this thread:
https://www.reddit.com/r/PFSENSE/comments/fsx4mx/fix_for_tmprulesdebug28_cannot_define_table_pfb/

Thanks @BBcan177 - mine is currently set at the default (2 Million). Should it be reduced below that to help mitigate the load and latency spikes?

Thanks again.

BBcan177

@tman222
I have no idea. This is why I am asking users to try lower values and see how it responds.

bmeeks

@jwj said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

So, we already have the "fix".

It appears to me, from reading the FreeBSD 11/STABLE Release Notes, that the patch submitted for pfctl was accepted and incorporated. So I would assume that patch also made its way into pfSense-2.4.5.

It's possible that patch has some unintended side effects. And since the problems seem to be triggered by users doing things that cause pfctl to come into play (reloading filters or updating alias tables), that is more circumstantial evidence of a possible link to recent pfctl changes and the utilization/latency issue.

A Former User

@tman222 I just played about with this. Removed block bogons on my WAN. Tried various table size settings from 100000 to, what for me was the default. 4000000. Didn't notice any difference. Checked the system log to look for memory allocation issues, there were none at each setting.

BBcan177

@jwj
You need to run a "Filter Reload" and/or a Reboot to fully ensure it took effect.

A Former User

@BBcan177 I did, filter reload. It warns you if you increase it such that a reboot is required.

A Former User

@BBcan177 @bmeeks Thanks for all your help and involvement since 2.4.5 was released. Your packages are fine but you have done your best to help as many as possible.

nzkiwi68

I'm experiencing the same issues as reported here under my post "Upgrade HA cluster 2.4.4-p3 to 2.4.5 - persistent CARP maintenance mode causes gateway instability" https://forum.netgate.com/topic/151698/upgrade-ha-cluster-2-4-4-p3-to-2-4-5-persistent-carp-maintenance-mode-causes-gateway-instability

2 sites now running really badly. Both sites running 10Gbase-SR with multi VLANs on the 10 Gb interfaces.

SiteA - main problem
Cannot have both firewalls up, primary and backup. If you do, zero VPN traffic passes over direct traditional site to site IPSEC or over the VTI routed FRR interfaces.
Left with the backup firewall powered off and the site is working.

SiteB - main problem
Massive instability following a reboot, and it just carries on and on, with all three gateways on both the primary and secondary firewall going nuts. The firewalls stagger and drop packets. In the end left the backup firewall powered off and after about 10-15 minutes following a reboot, the gateways stop going offline and the firewall settles down and becomes stable.

A Former User

@bmeeks @BBcan177

I turned off bogons. I deactivated pfblocker. I set the max table size to 60000. Rebooted. The reboot was quick again, like 2.4.4-p3. I ping6 google.com from a lan side device while doing a filter reload. The ping times don't change by any meaningful amount.

Makes me wonder if the pfctl fix was overlooked when the release was built?

Set max table size to 100000. Reboot. Reboot is longer again. Turn on pfblocker and some ip lists that add up to more than 60000 but less than 100000. Problem returns. Latency when reloading the filters. Bigger tables bigger problem.

Edited to add: the max table size doesn't appear to make any difference, other than setting a hard limit on table size. The actual size of the aliases/tables is what triggers the problem.

It sure looks to me that anything over that sixty some thousand mark causes the issue. I could be wrong, wouldn't be the first time, but this sure looks like the issue.

bmeeks

@jwj said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

@bmeeks @BBcan177

I turned off bogons. I deactivated pfblocker. I set the max table size to 60000. Rebooted. The reboot was quick again, like 2.4.4-p3. I ping6 google.com from a lan side device while doing a filter reload. The ping times don't change by any meaningful amount.

Makes me wonder if the pfctl fix was overlooked when the release was built?

Set max table size to 100000. Reboot. Reboot is longer again. Turn on pfblocker and some ip lists that add up to more than 60000 but less than 100000. Problem returns. Latency when reloading the filters. Bigger tables bigger problem.

Edited to add: the max table size doesn't appear to make any difference, other than setting a hard limit on table size. The actual size of the aliases/tables is what triggers the problem.

It sure looks to me that anything over that sixty some thousand mark causes the issue. I could be wrong, wouldn't be the first time, but this sure looks like the issue.

The actual patch file can be accessed here: https://security.FreeBSD.org/patches/EN-20:04/pfctl.patch. What the patch does is remove the former arbitrary hardcoded limit of 65,535 (defined as PF_TABLES_MAX_REQUEST) and allows the use of a sysctl parameter instead. Deeper research into the other pf related source code would be required to determine if allowing that larger PF_TABLES_MAX_REQUEST value has an adverse impact.

Looking a bit farther into what the patch actually does gives me a theory. The 65,535 number does not appear to be a limit on the number of IP addresses in a given table. It appears, instead, to be a limit on the number of tables or addresses you can add to the firewall during a single call to the corresponding ioctl() function. That limit was formerly hardcoded to 65,535. Now, with the addition of a sysctl variable for customizing this limit, I can envision a scenario where with a very high value for this new sysctl value that you are overloading the other pf areas. In particular, you would be requesting "too many tables and/or addresses in a single ioctl() call". So this may well be why lowering the value improves performance! You are no longer "overloading" the other ioctl routines that are actually creating the tables or addresses in RAM.