Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5

asan

I'm also affected.
HW: SG-4860

If the process pfctl has a 100% peak, ping latency is also very high.

Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=1125ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=1613ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=1190ms TTL=55
Reply from 9.9.9.9: bytes=32 time=5ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55

Magma82

Very similar CPU performance issues with another topic on Hyper-V as a VM which I have contributed to. As others have said, it is definitely not a PFBLOCKER issue.

Hyper-V performance
https://forum.netgate.com/topic/149595/2-4-5-a-20200110-1421-and-earlier-high-cpu-usage-from-pfctl/10

Physical server performance
https://forum.netgate.com/topic/151819/2-4-5-high-latency-and-packet-loss-not-in-a-vm/2

Some users have reported assigning only 1 CPU within the VM resolves the problem but this would suggest there is a multi core issue with the build at this time?

Has anyone had any feedback from Netgate as yet?

Gektor

I have wrote a few days ago that i am ok with pfBlocker and pfSense 2.45 version, but today i reboot pfSense VM and problem is returns, BUT, i have found temporary CRAZY solution, just need to manually update feeds of pfblocker and problem is gone (CPU and RAM is much more using, than on 2.4.4 but it not 100% freeze every few seconds). 2.4.5 is problematic version, and it's more complicated with enabled pfBocker on fresh boot...

getcom

@Gektor said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

I have wrote a few days ago that i am ok with pfBlocker and pfSense 2.45 version, but today i reboot pfSense VM and problem is returns, BUT, i have found temporary CRAZY solution, just need to manually update feeds of pfblocker and problem is gone (CPU and RAM is much more using, than on 2.4.4 but it not 100% freeze every few seconds). 2.4.5 is problematic version, and it's more complicated with enabled pfBocker on fresh boot...

I do that after every update to be sure that everything is working. In my case this was not changing anything.

My previous post related to the "Firewall Maximum Table Entries" and downsizing the value below 65535 does not change the behavior if something has to be changed.
It is not only this pfctl patch which is causing the issues. We have check more patches.

A Former User

@cuco I have updated #10414 offering to help them reproduce the issue, which I can reproduce at will.

Krisbe

Same issue here on a bare metal cluster. Upgraded to 2.4.5 and the CPU spikes and unstable HA. After each change on the primary node, the secondary node becomes master for a second or two and then becomes a slave again. Why? Because the primary node doesn't send a heartbeat ping for 5 seconds while the filter reloads.

It removed the values of "Firewall Maximum States" and "Firewall Maximum Table Entries" so that they fall back to the default (for me states = 3236000 and table entries = 200000). After a reload of the filter all went back to normal (also after a reboot of both machines all still good).

I now have pfBlocker removed and block bogon tables unchecked (if I check them, I have to increase table entries to 400000). When I increase to 500000 and enable blogon networks on the secondary node: the problem returns. When I uncheck bogon networks and remove the value of 'table entries' again: problem solved.

So for now what works for me as a temporary solution: don't increase table entries but use default values, don't enable bogon networks and don't enable pfBlocker.

A view of the CPU spikes on the master node. Around 9am I moved the table entries value to the default.

How that node looked the last days (updated on april 6 around 10am):

Woodsomeister

Any news on this?
The whole thing is a serious desaster and no minor problem in my opinion.

Gektor

I think that we did not get any news with this issue until the final 2.5.0 version (which will be released not earlier than in a year or maybe - two).

ViniciusBr

@Woodsomeister said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

Any news on this?
The whole thing is a serious desaster and no minor problem in my opinion.

Unfortunately the current "solution" is to stay with the 2.4.4 version until we see a stable version.

digdug3

@Woodsomeister I agree. It's not a minor problem. Looks like at least all major virtual environments have this problem. Even physical machines with a larger network and/or ip-table setup.

A Former User

It's not being ignored.

2-4-5-high-latency-and-packet-loss-not-in-a-vm

If I were a betting man I would bet it's a regression in pf in 11.3Stable. I don't see that getting priority from the FreeBSD project. 12.x is the priority. Hope I am wrong.

ScottCall

I can verify this is happening on my Hardware based system (XG-1541 in CARP setup purchased from Netgate)

I upgrade my standby to 2.4.5 and immediately saw the CPU spiking and bad ping results, so I did not update my primary. EDIT: as soon as I submitted this I had packet loss and 3-4 second pings from the standby, so I'm happy I didn't update the primary.

After updaing the pfBlocker-ng package the ping/cpu on the standby router has returned to normal, but I'm a bit wary to proceed with upgrading the primary.

Looking at the redmine case and referenced freebsd errata, I'm not sure if the errata is being reference as a cause or solution?

If it is a potential solution will an updated pfctl be made available?

Many thanks

EDIT: Also I noticed the cpu spikes/packet timeouts were causing the secondary firewall to promote itself so I had to disable CARP on it until this is sorted.

nzkiwi68

@ScottCall said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

EDIT: Also I noticed the cpu spikes/packet timeouts were causing the secondary firewall to promote itself so I had to disable CARP on it until this is sorted.

In my 2 cases I have simply had to turn off the backup firewall and switch off HA settings because it's just unusable.

Netgate XG-1537 HA pair
Netgate C-2758 HA pair

forbiddenera

I too have been having the same issues and as other users described it tends to be initiated by loading a web page.

I was getting 502: Bad Gateway when trying to visit the CP. I did a restore recent config; no change. Then I restarted php-fpm from console and it seems OK for now. But it's been coming/going since the upgrade to 2.4.5...!

I should note my setup is pretty simple; just a home setup with no HA or big lists of anything.

daNutz

Ive finally managed to downgrade to 2.4.4

Infective

So I read the entire thread. Problem as far as I understand is the new pfctl.patch in 2.4.5. I am trying to go back to 2.4.4. The problem is I do not have a backup of 2.4.4. Can I create a back up of 2.4.5 and restore those configs to 2.4.4?
Also where can I find the 2.4.4 version of pfsense and do I just do a fresh boot from it and restore using the 2.4.5 backup?

jdeloach

@Infective said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

Also where can I find the 2.4.4 version of pfsense

A search of the messages on this forum would have given you the answer to your question, but if you open a ticket with Netgate support, they will give you a link to the 2.4.4_p3 image so you can downgrade.

q54e3w

Is there any updates Netgate can, or have shared, re possible solutions for the increased latency issues with 2.4.5? I'm evaluating options re stick with it vs roll-back.

A Former User

@q54e3w said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

Is there any updates Netgate can, or have shared, re possible solutions for the increased latency issues with 2.4.5? I'm evaluating options re stick with it vs roll-back.

I don't think Netgate have said anything official. Your best solutions to try are the following:

Greatly reduce the amount of entries in your Firewall table (by removing blacklists/blocklists etc)
If virtualised, change to 1 vCPU. This seems to have fixed it for almost all Hyper-V users I've seen
Rollback to 2.4.4p3

q54e3w

Thanks, I didn't think I'd missed anything but wanted to check. I'm not actually expecting any definitive updates, but an indication of confidence would be useful for me.
I'm running on metal and have stripped back as far as I can go re rules and anything that invokes updates, but even so, with a fairly busy albeit bursty firewall I see increased latency to around the 150-250ms level which starts to affect video call quality etc.