Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5



  • Firstly, I huge thanks to Netgate and pfSense team and all those testers out there. I appreciate what you do and continue to do. Please don't read the following as a complaint, just feedback and my experience with a hope that someone can help.

    Since upgraded to 2.4.5 I noticed a large increase in memory usage. I noticed this because I used to happily run 2.4.4p3 in a 512M Virtual Machine, when booting 2.4.5 the firewall rules failed to load with out of memory errors. Increasing the VM's memory to 1G resolved this quickly enough. I can understand a memory increase requirement from 11.2 to 11.3 so no major issues there.

    However the bigger problem is the large increase in CPU requirements. Again, this is a Virtual Machine (KVM / Proxmox 6.1). The VM is given 2 CPUs of an i5-5250U CPU and this has always ticked along at about 5%-10% usage with 2.4.4p3 and previous releases.

    You can see today at about 7am where I upgraded my pfSense instance:

    c6e20904-0773-40fe-af79-65637210612d-image.png

    There's a few reboots in there too obviously as I struggled with why pf wouldn't load my rules (the memory problem)

    You can see the big spikes though and those happen when I reload one of my OpenVPN clients, just by pressing the little reload icon on the services status widget.
    When I do this, pfctl goes up to 100% CPU and just gobbles it for about 4 minutes, though it seems to drop off a few times then come back. But for this 4 minutes, latency goes through the roof. dpinger reports that all gateways are seeing packet loss and latency.
    After 4 minutes, it comes right again.

    The only thing I think that's different about my pfSense from the average user is that I have 4 large ipfire blacklists I load, using the Aliases->URL feature.

    Does anyone else see similar? Even in normal usage (where ping times are fine and performance is good) the CPU usage is much higher than it used to be as you can see on my librenms graph.

    This is standard, expected? Does anyone else see the big latency/cpu spikes/dropped/delayed packets when reloading services that cause pfctl to reload?

    Here's the rules I have on my WAN interface (a pppoe interface)

    981b1168-1fdc-472b-9589-050a19e9617b-image.png

    Here's the aliases I am using:

    ff807cb0-b02f-4e5e-87ca-28ba5af6b6d5-image.png

    There are 18k entries in the L3 list for example

    {12:39}~ ➭ cat firehol_level3.netset| wc -l
    18290
    

    But as mentioned, this all worked perfect with 2.4.4p3. I could make any changes I wanted and the CPU wouldn't be swamped. It's like pfctl has had a large performance regression.

    Anyone else see similar?

    Thanks!
    Edit to add: I am using vtnet interfaces (virtio ethernet interfaces) not e1000.
    Edit2: Hi Abzstrak.



  • I haven’t quite figured out what’s going on yet, but I upgraded a 2.4.4-p3 installation that was working fine, and the CPU is absolutely pegged with pfctl using 99% of the CPU nonstop, as well.

    I’m running OpenVPN but no blacklists in aliases that large.



  • Are you using KVM and/or virtio interfaces?



  • No. The pfSense instance is on Hyper-V (Windows Server 2019). The interfaces are synthetic NICs with VMQ. Been running this way for at least 4 or 5 months on 2.4.4 builds without ever seeing this kind of CPU usage.

    For what it’s worth, I killed the OpenVPN processes along with everything except “bare essentials” like dhcpd and unbound. Pfctl is still running away with all the CPU. The only way I can get it to stop is to roll back to my 2.4.4-p3 snapshot.



  • Interesting. Must be something to do with virtualisation then. Mine settles down after a while, does yours not? I take it a simple reboot doesn't fix the CPU problem?



  • I rebooted it twice so far. It hangs booting up from the point that it initializes the queues onward. Every step of the bootup after that takes far longer than it used to with 2.4.4-p3. The entire bootup process is over 2 minutes. With 2.4.4-p3, it’s about 40 seconds.

    And nope, it never settles down. I let it sit untouched for over 30 minutes at one point and pfctl was still churning away at 99%.



  • @xpxp2002 I'm tried upgrading this afternoon and saw the same thing. pfctl - 100%. I'm also using Hyper-V on Server 2019 and had to revert back to a snapshot. Been using pfSense under Hyper-V for years and this is the first time I've had to revert an upgrade.



  • @swinn Perhaps there is something about these virtualized instances that is a problem, as @muppet suggested. Looking through the release notes, I don’t see anything specifically calling out virtualization that seems like it would cause this. Unless simply going to FreeBSD 11 is the issue.

    I’ve also run pfSense on Server 2016 for years without issues prior to this. Ran into an issue with Server 2019 and receive segment coalescing causing weird packet drop issues when I first went to 2019, but once I disabled RSC the issue went away. But this is the first time a pfSense upgrade didn’t go smoothly for me either.



  • We haven't had a big FreeBSD jump though, 11.2p10 to 11.3

    I can't find any release notes either that say something major/odd has changed with Virtualization in 11.3



  • I’ve been using the snapshot to test individual settings and packages that seem like possible culprits. As I mentioned before, the bootup hangs on the first “configuring firewall...” so I’ve tried removing settings and packages that I expect would be initialized when the filter rule load is occurring, then performed the upgrade.

    So far, I’ve ruled out queues/limiters, pfBlocker-NG-devel, and Service Watchdog.



  • The only packages I have installed are:

    • Avahi
    • OpenVPN-Export

    Oh and I have fq_codel configured.



  • @muppet I also have both of those. I’m out of time for testing but either one of those could be the culprit.

    My first thought goes to Avahi trying to come up on an interface where it isn’t supported, but it could also be the OVPN export package struggling with the new version of OVPN.



  • Maybe avahi could cause problems, that I could understand.
    OpenVPN export isn't even called until you visit that page.



  • I upgrade six (6) pfsense production server at the same time from 2.4.4_p3, and I had problem with the conectivity. The ping time is very high above 7.000ms.

    I tried upgrade my pfsense server at home from 2.4.4_p3, but in this case I did a snapshot on vmware, and the problem is same. The ping time is very high and the navigation have a lot of problems.

    I restored the snapshot, and all return to normally

    At all server I have installed this packages:

    Open-VM-Tools
    openvpn-client-export
    squid
    snort
    zabbix-agent4

    I tried reinstall all packages, but the problem persist



  • Same troubles after upgrade from 2.4.4 to 2.4.5 on Hyper-V Windows Server 2019, 100% CPU usage (by pfctl process), long boot, and pfSense works with spikes and hangs.
    It seems that 2.4.5 not compatible with Hyper-V Windows Server 2019.
    Maybe it related:
    https://forum.netgate.com/topic/149595/2-4-5-a-20200110-1421-and-earlier-high-cpu-usage-from-pfctl/8



  • I'm having the same problem. I'm running 2.4.4.-p3 on Server 2016 with Hyper-V. I tried upgrading my 2nd CARP node to 2.4.5 yesterday, but it pegged the CPU and never became stable. I reverted that snapshot, shut it down and tried to upgrade my 1st CARP node, but the same problem. I've reverted both nodes to the snapshots.

    pfSense on Hyper-V has been rock solid up until now and all previous upgrades have been flawless.

    If I have time, I'll try installing a 2.4.5 VM from scratch to see if the problem occurs there too.



  • I have made clean reinstall system with catching config from updated system, first time boot was fast, then all packagers was restored (installed), after that system stuck at boot and lags after.
    Then i have found a source of problem — pfBlockerNG! When it's disabled, all works good, after enabling pfBlockerNG system lags totally.



  • @Gektor This is interesting. I had pfBlockerNG-devel installed on 2.4.4-p3. One of my earlier tests was to roll back to 2.4.4-p3, uninstall that package, then upgrade; and my system was still slow. Did you simply disable it, or uninstall the package?

    I will try this later today when I have an outage window.



  • Mine is pfBlockerNG version 2.1.4_21, with this setting all works good:
    7574e7a6-a678-4ee0-b6d6-5da00e69d698-изображение.png
    Then i have disable all GeoIP lists, but enable DNSBL, and enable pfBlockerNG, and for now there is no problems with pfSense 2.4.5 on Hyper-V. System makes "crazy" when GeoIP lists is enabled in pfBlockerNG.
    Have make post, maybe it will be helpful:
    https://forum.netgate.com/topic/151726/pfblockerng-2-1-4_21-totally-lag-system-after-pfsense-upgrade-from-2-4-4-to-2-4-5



  • @Gektor I deleted all the installed packages:

    Open-VM-Tools
    openvpn-client-export
    squid
    snort
    zabbix-agent4

    and I disabled OpenVPN links unpriority; and the system conectivity was restored



  • @gusfersa On another production server with the same installed packages, only I disabled OpenVPN link to an another pfsense server 2.4.5, and the system conectivity restored



  • I've noticed something similar in terms of memory usage, but in my case cpu nice dropped in half and otherwise everything else seems status quo.

    I'm not however noticing any latency outages or anything of that nature, but i've got plenty of free RAM so maybe that's the difference.

    memory usage



  • @digitalgimpus said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

    I've noticed something similar in terms of memory usage, but in my case cpu nice dropped in half and otherwise everything else seems status quo.

    I'm not however noticing any latency outages or anything of that nature, but i've got plenty of free RAM so maybe that's the difference.

    memory usage

    Same here, memory utilization spikes up from <20% before upgrade to 2.4.5 (w/all the same settings and packages) to 65-80% after upgrade.

    Miscreant isolated to pfBlockerNG-devel (when uninstalled, memory use goes back to <20%) - running on netgate amd64 hardware, 8gb ram.

    @BBcan177 any ideas on this, did this come up in the extensive testing done for 2.4.5? Any setting that could be tweaked (memory, feeds) or is this something that will require some coding/patching?



  • in quick testing here, it appears related to the pfblocker "maxmind GeoIP settings", either deleting the key or checking the box "disable maxmind csv database updates" makes the pfblocker pages respond near instantly again and gets rid of the long boot hang-time, which I'm assuming is breaking everything else and causing flapping in a loop as it keeps trying to reload it for high latency and other things!
    I haven't tested further than that and cannot guarantee that's the only issue at hand, tested on minimal configured vm with nearly no traffic, but it slows it way down in many functions.


  • Moderator

    @t41k2m3
    You are running on a physical machine and it looks like you are not experiencing any issues other than higher memory usage. That can be attributed to how many entries are in DNSBL, especially with TLD enabled. I assume it was the same as before but you didn't notice it. DNSBL in Unbound will create a pointer in memory for each domain and it can eat memory. Nothing I can do about that. The upcoming Unbound python integration will make a significant improvement in memory usage tho.


  • Moderator

    @taz3146
    Are you in a virtualized environment as the others in this thread? There seems to be some issue with pfctl (which is used to create and update the IP aliases for the firewall rules) and with some virtualization software.
    I have tested with VMware ESXi and can't reproduce these issues. Sent a message to the devs to see if the have any other guidance. Alternatively, setup a physical box with the same configuration and see if the problem exists without virtualization. Then we can attest narrow down the issue.
    The deselection of settings in the IP tab should have no affect on anything. When you save that page it just writes settings to the config.xml and the nothing else. Probably you have something else happening in the background.
    Would also suggest that everyone review the system.log and the pfblockerng.log for any other clues.



  • @BBcan177 I agree. People blaming pfBlocker are missing the root cause of the problem, pfctl, not those apps/addons that feed it rules.



  • Just to add another data point : following upgrade to 2.4.5 from 2.4.4p3, I've noticed an increase in memory usage on a pfSense instance installed on a physical machine, but not any drastic increase in CPU usage. Memory usage jumped from ~7% to ~64% with no other changes bar the pfSense upgrade.

    Machine info : Intel J3160, 4GB DDR3, Dual Intel 82576EB NIC.

    Packages installed : openvpn-client-export, pfBlockerNG

    4c61e85d-a315-413e-a07e-28d0145b2e9b-image.png

    If any more info desired please just let me know.


  • Moderator

    @ScottishTom
    What version of the package? Would recommend devel and also try a reboot and see if that persists.
    Can also run these two commands to see what particular process is involved:

    ps auxwww 
    top -aSH
    


  • @BBcan177 said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

    @t41k2m3
    You are running on a physical machine and it looks like you are not experiencing any issues other than higher memory usage. That can be attributed to how many entries are in DNSBL, especially with TLD enabled. I assume it was the same as before but you didn't notice it. DNSBL in Unbound will create a pointer in memory for each domain and it can eat memory. Nothing I can do about that. The upcoming Unbound python integration will make a significant improvement in memory usage tho.

    @BBcan177 you are correct as to the summary of the situation, including same DNSBL entries, TLD on, only memory usage spikes (and not some of the other issues that seem to appear in virtualized environments). Not sure I'm following the theory of the case though. Meaning, given all else is equal (same pfS settings, same packages and their settings) but for the addition of pfS 2.4.5, it would reasonably follow (in fact proven by process of elimination) that some combination thereof (pfS 2.4.5 and pfB code/settings/others) begot a context writ large favoring these types of issues on different platforms. In fairness, there may be other contributing factors than pfB, though in this particular case, that is ostensibly not the case.

    So, question is what could/should/would we do about it? Re: unbound, the python integration is listed as a new feature/change (i.e. not upcoming, but present) and the Unbound 1.9.6 seems to be compiled with support for python. If that was/is intended to be the help/fix, not sure that it is performing quite as hoped. Recognizing this is brand new and may need some burnishing, wanted to get it on the radar screen for you and pfS devs. Thanks for all your efforts.



  • @ScottishTom said in Increased Memory and CPU Spikes (causing latency/outage) with 2.4.5:

    Just to add another data point : following upgrade to 2.4.5 from 2.4.4p3, I've noticed an increase in memory usage on a pfSense instance installed on a physical machine, but not any drastic increase in CPU usage. Memory usage jumped from ~7% to ~64% with no other changes bar the pfSense upgrade.

    Machine info : Intel J3160, 4GB DDR3, Dual Intel 82576EB NIC.

    Packages installed : openvpn-client-export, pfBlockerNG

    4c61e85d-a315-413e-a07e-28d0145b2e9b-image.png

    If any more info desired please just let me know.

    This seems like virtually the same or similar setup and problem as previously described (with qualification that a process at fault was not yet identified/hypothesized).


  • Moderator

    @t41k2m3
    I posted above two commands that you can use to find what is using memory. Report back with what you find. I haven't spent much time with the release of 2.4.5 as things have been hectic. I haven't checked if the version of Unbound has changed from 2.4.3/4. That might be a reason if something has changed in the Resolver code.
    In regards to the upcoming Unbound python integration, what you see in the Resolver settings will allow for a future release to integrate with the Resolver. It's just the plumbing and nothing else. There is no Python integration released yet.



  • @BBcan177

    Hi, thanks for the prompt reply.

    Currently running version 2.1.4_21 as installed from pfSense's package manager.

    Output of ps auxwww sorted by memory usage:

    2d6a158e-ea39-4c98-b4f8-93c0ddf44ef4-image.png

    Output of top -aSH sorted by size

    37fd45f0-cf1f-4730-ae22-dfecdcb7d61a-image.png

    Reboot does not appear to change anything, will just go try the devel version. Hope that helps.


  • Moderator

    @ScottishTom
    Would recommend to uninstall pfBlockerNG and install pfBlockerNG-devel. Then see how that goes. Ensure "Keep settings" is enabled. You will need to re-enter the MaxMind key into the IP tab. Also best to reboot to clear Unbound.



  • @BBcan177

    Have done as requested, now running pfBlockerNG-devel 2.2.5_29.

    • uninstalled pfBlockerNG

    • installed pfBlockerNG-devel

    • rebooted

    • force-updated DNSBL as it was complaining about being out of date.

    • rebooted again

    Still appear to be at ~64-65% memory usage

    FWIW I'm not complaining at all personally, the software is working fine for me and I'm seeing packets being intercepted by the block lists. Just seems strange to have had an almost 10x memory usage increase.

    Appreciate the blocker software and your work on it very much, really simplifies things :)


  • Moderator

    @ScottishTom
    Thanks for reporting back. Will check it out tomorrow.



  • I'm getting the same thing with Hyper-V host, 2 deticated NICs all offloading turned off.

    Interesting tidbit is that with me if I make any settings change and save it, it hanges for 20-30 seconds then pins the CPU, pings even over the lan to pfSense won't respond. Then it slowly comes back.

    I have OpenVPN running but had no issues with 2.4.4. In the OpenVPN logs there is now an endless stream of: AEAD Decrypt error: bad packet ID (may be a replay): that weren't there before. The System time is correct.

    AES-NI On or off doesn't make a difference (even after restart)

    Internet is PPPoE if that makes a difference.

    Download speed is normal. Upload speed is about half of what it should be compared to 2.4.4.

    Note traffic shaper is enabled and I have some floating rules for Google Hangouts.



  • One thing to try would be to disable the Spectre/Meltdown mitigation, reboot, and see if that improves things.
    I wonder if improvements have been made to it in FreeBSD 11.3 but those improvements maybe don't play well with virtualisation.

    I'm not in a position to test at the moment, can anyone else give this idea a go?



  • @muppet I just disabled the meltdown mitigation, rebooted. Still have 100% CPU



  • I'm experiencing similar issues with 2.4.5. My CPU usage spikes to 100% anytime I access a website and causes an outage for about 20 seconds.

    Specs:
    Proxmox 6.1-8
    CPU: 8 vCPUs
    Memory: 8GB

    Packages:
    acme 0.6.5
    Avahi 2.1_1
    openvpn-client-export 1.4.20
    pfBlockerNG 2.1.4_21
    Service_Watchdog 1.8.6
    snort 3.2.9.10_2
    softflowd 1.2.6
    squid 0.4.44_15
    Status_Traffic_Totals 2.3.1


Log in to reply