2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl



  • Just a heads-up to a possible problem; I had version 2.4.5.a.20200110.1421 of pfSense installed as a (Generation 2) VM in Hyper-V. CPU usage would go to 100 percent and the system became very slow and unresponsive. On a couple of instances I was able to get into the shell from the console and ran top, where it would show pfctl using lots of CPU time. It would eventually clear up about 10 to 15 minutes after boot, but this is something that would happen after every restart. It doesn't matter if the system is an upgrade or a fresh install. I have since rolled back to 2.4.4_3 for now. I did notice a closed issue on this problem in the bug tracker. This has occurred with the 2.4.5 builds from at least the past four days or so, when I initially gave the new snapshots a try.


  • Rebel Alliance Developer Netgate

    What packages do you have installed? Features enabled?

    I don't see any issues, open or closed, against 2.4.5 that were for high CPU in pfctl on any platform, nor any bugs relating to Hyper-V.

    When I boot a current snapshot after everything starts it immediately goes to sitting idle like it usually does. In the process list, pfctl is using 0.0%.

    Granted mine is in ESX, not Hyper-V, but I don't have anything running with Hyper-V to test against currently.



  • For packages, I have installed:

    iperf       3.0.2_2
    nmap        1.4.4_1
    ntopng      0.8.13_3
    pfBlockerNG 2.1.4_20
    snort       3.2.9.10
    

    I just tried installing from scratch with the latest 2.4.5 build, and got the same problem with default settings applied after the first initial setup connection to the Web interface, no other configuration restored, and no other packages installed. The only non-default I selected during install was choosing ZFS as a file system.

    When I try spinning up a test installation from scratch (but with the 2.4.4 pfSense firewall on the LAN at 192.168.1.1) and change the LAN IP address to 192.168.1.100 in the console, this problem does not reappear, and doesn't after a reboot either. When I try to restore my main firewall's configuration to this test install, but without the packages, everything succeeds, and the problem does not reappear. I have since restored everything, including packages, to this test install, and so far so good. Upgrading to the latest daily build also works without problems.

    Do you suppose this could be a problem with the hypervisor? The hypervisor is Microsoft Windows Server 2019, Hyper-V configuration version 9.0. Dynamic memory is enabled with 4096 MB initial, and two virtual switches, one WAN and one LAN are being used. The only thing I can think of that I did on the old installation's VM was upgrading the Hyper-V configuration version from 8.0 to 9.0.



  • Looks like I spoke too soon. The new test pfSense VM running 2.4.5 is now getting high CPU usage as well, but only for a long period of time when I have Hyper-V give pfSense four virtual CPUs (physical server is a single physical CPU, quad core Xeon E-2124 without Hyperthreading). With only one virtual CPU, pfSense only uses a high amount of CPU when mounting the file system during startup.



  • I continue to encounter this problem with Hyper-V VM test installs running the latest RC version of 2.4.5, even with no other packages installed and nothing else configured, on both UFS and ZFS filesystems.

    I caught a glimpse of Diagnostics->System Activity during one of these instances:

    16355 root       101    0  8828K  4960K CPU1    1   0:27  96.00% /sbin/pfctl -o basic -f /tmp/rules.debug
        0 root       -92    -     0K  4928K -       0   2:08  92.97% [kernel{hvevent0}]
        0 root       -92    -     0K  4928K -       2   1:11  89.99% [kernel{hvevent2}]
    

    The rest were around zero.

    Should I limit my "production" pfSense VM (2.4.4p3 for just a home network, problem does not occur on this one) to only one virtual CPU before 2.4.5 is released? (If I give the VM four virtual CPUs, pfSense considers the four to be individual cores, not separate CPUs.)



  • I do not have a lot of familiarity with Hyper-V, but I know that it can have NUMA or Virtual NUMA turned on by default and, I believe, FreeBSD in general, and gateways/Routers in particular, are not very happy with it. I would try disabling NUMA for that VM, and ensure a 1:1 cpu ratio, and see if that makes a difference. Assuming it is even on, of course, as you seem to have pretty good handle on your Hyper-V.



  • @jimp I noticed that during boot heavy cpu usage with hyper-v

    The packages I have installed are:
    Acme certificate
    Anahi
    Opnvpn client

    I do have fqcodel limiters set up

    It seems to spend a lot of time 100% cpu just after:

    load_dn_aqm dn_aqm PIE loaded

    EC3386DA-8DAC-4B2E-823F-55C2AE8ECF29.jpeg



  • @jimp

    The update below seemed to fix the boot time delay:
    Current Base System2.4.5.r.20200211.0854
    Latest Base System2.4.5.r.20200212.1633

    This seems to have fixed the speed on boot. Went from about 2 minutes to 1 minute boot time.



  • Same troubles after upgrade from 2.4.4 to 2.4.5 under Windows Server 2019 Hyper-V, very long boot, sometimes pfct processl CPU usage almost 100% and system became very slow and unresponsive.



  • Same issue.





  • No. It has a serious problem, even at boot. With a clean install, it is extremely slow and stock for minutes on "Firewall configuration," even on a fresh installed OS. I think this new version has a problem with hyper-v



  • Same issue on Hyper-V. Only workaround is drop CPU to 1.



  • I can also confirm Hyper-V server core 2019 (Dell T20 Xeon) after upgrading my 2.4.4 VM to 2.4.5, I now have extremely High CPU when assigning all of my 4 cores to the Pfsense VM at boot. This is also being caused by the pfctl process (it seems) and consumes the boot process (up to 10 minutes to actually boot past configuring interface WAN > LAN etc)
    After reducing the cores to 2 to the VM, it still uses high CPU at boot (2 minutes to boot), but I can at least use the internet and webgui (CPU under 20% - still high for my setup usually 1 - 3%).

    I can also observe the new OpenVPN version service when utilising my WAN (4mb/sec down only) the CPU is 23% for the openvpn process in top which is unusually high.

    It seems the OS is not optimised for Hyper-V usage so I am reverting back to 2.4.4 and hoping this thread on the forum will be updated by Netgate Dev's.



  • @Magma82
    Now i am using almost 2 days with 2 CPU cores pfSense 2.4.5 under Hyper-V Server 2019, no issues with high CPU usage at all after disabling pfBlockerNG GEOip lists.





  • I am seeing this too. Running Hyper-V on a Dell R720. VM had 4 CPU assigned. No packages installed. CPU on the VM would spike to 100% for a few minutes at a time, then drop to normal briefly (no more than 30 seconds or so), then back to 100%.

    Following recommendations earlier in the thread, I dropped the VM down to 1 CPU and that made everything operate normally again, as far as I can tell. Because it's not really a busy firewall, this is no real issue for me to have 1 CPU. Therefore I don't have a really urgent need to roll back to 2.4.4. I'll stay where I am until a patch is issued.

    Sounds like the problem is specific to multiple CPUs on Hyper-V only.



  • I am seeing this too. We are running Qemu 4.1.1, Kernel 5.3 (KVM) and CPU emulation Skylake-Client.

    Problem started with 2.4.4 and a upgrade to 2.4.5 did not solve the issue.
    Workaround is to downgrade to one (1) core.



  • Netgate DEVS, the CPU performance in HyperV is definitely broken in 2.4.5 - are there any Hyper-V integration tools or libraries that are perhaps missing in the OS build?

    Boot up CPU with 4 cores assigned = 100% constant at an early stage of the boot process and is barely accessible once booted.

    OpenVPN = CPU is also considerably higher under load (as if the CPU isn't optimised for the VM)

    I have reverted to 2.4.4 and its rock solid and under 3% CPU in use and just feels a lot more optimised.

    Hardware
    stable 2.4.4-RELEASE-p3 (amd64)
    Intel(R) Xeon(R) CPU E3-1225 v3 @ 3.20GHz
    4 CPUs: 1 package(s) x 4 core(s)
    AES-NI CPU Crypto: Yes (active)
    Hardware crypto AES-CBC,AES-XTS,AES-GCM,AES-ICM



  • Same issue, Server 2019 and Hyper-V, no packages installed on custom HW (Ryzen 2700) after upgrade. Pegs CPU upon boot and is basically unusable.

    Set VM to 1 virtual processor to get it working but it is sub-optimal for OpenVPN clients. Even experimented with just assigning 2 virtual processors - it runs sluggish.

    Will look to revert to 2.4.4-p3 snapshot in the near future.

    Edit: since I had nothing to lose and this is in a test lab, I bumped up to 2.5.0 development (2.5.0.a.20200403.1017). 2.5.0 does not seem to have the Hyper-V CPU issue.



  • Its the same in a VM on Vsphere. I run 32 cores on a test system and they all go to almost 100% shortly after boot.

    I noticed that the server started spinning its fans a lot harder and looked in the hypervisor and sure enough. Almost 100% and not handling traffic at all....

    I was running 2.4.4 p3 and no issues until Suricata wont start. Then I had to upgrade and it died....


  • LAYER 8

    i made a clean install on my esxi with 4 cpu
    and upgraded from 2.4.4-p3 to 2.4.5 on another server with qemu/kvm with 4 cpu westmere
    both have suricata installed, never had such a problem. and i'm unable to reproduce on my test lab, must be some settings



  • Same problem here too Hvper V 2016 version 2.4.5
    5GB RAM
    4 CPU
    pfblocker NG

    Sits for ages on 'firewall' & Also DHCPv6 before booting really sluggish dropped packets galore

    Dropped back to single CPU and all ok on 2.4.5



  • Same problem, pfsense 2.4.4 installed on Vmware Esxi. I have suricata, pfblockerng, squid, squidguard and lightsquid installed. After upgrading to 2.4.5 the latency went haywire. However, I've managed resolve my problem, I reduced 8 vcpu to 1vcpu then did the upgrade to 2.4.5. So far everything worked fine except suricata wouldn't start, so i did a Forced pkg Reinstall. Everything worked fine after that, then I added an additional 3vcpu and it's been working fine ever since.



  • Same problem here but with a Proxmox VM on pfSense 2.4.5.
    2 CPU, 2 core
    8GB RAM
    NUMA disabled

    High CPU on "/sbin/pfctl -o basic -f /tmp/rules.debug" effectively killed my networks and VLANS, and both incoming WAN connections. pfSense would often crash and reboot automatically, which produces a crash report.

    Dropping to 1 CPU, 1 core fixes it but it's running hard due to my network. 2.4.4_3 ran just peachy!



  • @Uncle_Bacon Have you tried adding cpu later (after the upgrade)? I noticed that maximum vcpu is 4 before it starts going crazy.



  • @slim2016 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

    @Uncle_Bacon Have you tried adding cpu later (after the upgrade)? I noticed that maximum vcpu is 4 before it starts going crazy.

    I have upped it to 8 so far and it runs pretty stable. Havent noticed a crash report yet.



  • It doesn't work properly with more than one vCPU (in my experience)



  • @Cool_Corona You are right, iv'e just added a total of 8 vcpu and gave it time to settle down after a boot, it seems to stabilise itself after a short while.



  • @slim2016 The point is its completely unstable with more than one cpu (when it doesn't work) including dropped packets.

    This isn't acceptable to simply 'wait for it' to settle down. Also the boot times with multiple CPU are magnitudes slower that it should be, again not acceptable for a firewall.

    If the root cause isn't determined are you happy for the firewall to randomly drop packets and generally die?

    It's not happening for everyone but it is a bug and it needs to resolved.

    The silence from NetGate is deafening. I understand its not happening on NetGate hardware - Does anyone have a subscription on a virtual machine that NetGate can address?



  • @timboau-0 I was responding to Cool_Corona



  • @timboau-0 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

    @slim2016 The point is its completely unstable with more than one cpu (when it doesn't work) including dropped packets.

    This isn't acceptable to simply 'wait for it' to settle down. Also the boot times with multiple CPU are magnitudes slower that it should be, again not acceptable for a firewall.

    If the root cause isn't determined are you happy for the firewall to randomly drop packets and generally die?

    It's not happening for everyone but it is a bug and it needs to resolved.

    The silence from NetGate is deafening. I understand its not happening on NetGate hardware - Does anyone have a subscription on a virtual machine that NetGate can address?

    Its happening on Netgate hardware as well. They are not so fortunate to have the workaround reducing the number of cores as are the VM's.

    Reducing it to 1 core and get it up and running stable is no problem. Then add cores as you like.

    Yes the boot time is quicker with 1 core then with 8 cores.

    Yes I would like it to be resolved as well. I think its an BSD issue and therefore needs to be forwarded in the ECO system of BSD.

    I am running 8 cores as of now and no issues so far.



  • @slim2016 I haven't tried that. Unfortunately my backups don't run as deep as they should so I have no 2.4.4 backup. I am going to try a fresh install and restore config from 2.4.5 to see if that helps. Thank you for the suggestion. It's nice to have the ability to create/re-create as many instances of it that I want. I'll post back.



  • @Uncle_Bacon I haven't used Proxmox for many years and when I did it was for a short while. With Esxi you just create a snapshot before you upgrade or update and if something goes wrong you just restore the snapshot.



  • @Uncle_Bacon said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

    @slim2016 I haven't tried that. Unfortunately my backups don't run as deep as they should so I have no 2.4.4 backup. I am going to try a fresh install and restore config from 2.4.5 to see if that helps. Thank you for the suggestion. It's nice to have the ability to create/re-create as many instances of it that I want. I'll post back.

    Install everything with 1 CORE only! After you are done and the backup is on the box, then reboot, install packages, reboot and upgrade number of cores.



  • @slim2016 It's the same for Proxmox but I guess I need to get back in to the habit of doing that before any updates, especially to pfSense.

    @Cool_Corona Done and done. Back up and running and have my metric server monitoring and will notify of any issues that may arise. Fingers crossed!



  • @Uncle_Bacon Keep us updated about the stability of your system.



  • I just noticed that pfsense doesn't start properly after installing arp-watch, some of the services wouldn't start, after removing it everything started fine.



  • So just a quick update.

    For some context, this problem originally only started when one of my WAN connections dropped and pfSense failed over to the other. So I went to do some testing. I disconnected one gateway and it switched to the other with moderate CPU use and then continued on to normal levels. Upon reconnecting it however, pfSense switched back to my main WAN and I noticed the pfctl process running high CPU and all sorts of notifications about that from my metrics. Latency on all of my network connections increased 10 fold as well and barely anything was getting through the network.

    If I recall when I installed this newest update, the router ran fine for a while, at least since the start of April before this problem came up more recently.

    I am not overly confident that either a) my having to reinstall and restore from a 2.4.5 config was successful or b) the issue isn't solved by only adding more CPUs/cores after the upgrade. or OR my configuration is unique/flawed.

    I'll keep posting updates as they come in.



  • pfsense environment:
    VM - hyper-v gen2 (uefi)
    zfs
    4-cpu
    6GB-RAM
    40GB-hdd
    VM on windows server 2016.

    I upgraded this pfsense 2.4.4p2 to 2.4.5. In this now upgraded vm instance, clicking "Filter Reload" causes a full outage of pfsense's OS for 20 seconds. 100% cpu spike of all 4 cores, and no network activity gets through. Console session watching 'top' also locks up, after the freeze I briefly see "PFCTL" with very high CPU, but goes away after the first top refresh after the un-freeze.

    Now the weird part, I then reset to factory defaults to eliminate my config and packages as the cause on 2.4.5. But the issue persists, even in a default config. Clicking "Filter Reload" gives total outage and cpu spike to 100%, console goes unresponsive, all traffic to all interfaces stop.

    Now I suspect the VM/Hardware has issues with 2.4.5. So in the same VM I re-install and re-format to 2.4.5 this time from ISO instead of upgrade. I do not import config. This newly installed 2.4.5 instance of pfsense on the same vm has no issues. Reset to factory default, still no issues. Able to filter reload any time without any CPU spike, nor outage, filter reloads instantly.

    In the clean install of 2.4.5, importing the config from the "bad install" also works without issue. The freeze issue doesn't re-appear after importing config.

    Also if I factory reset the VM on 2.4.4p2 to defaults, then upgrade to 2.4.5, the issue is not present. So it implies there's something in my config that causes permanent OS damage if upgraded to 2.4.5.
    Only a format/full-reinstall of pfsense can "clear" the issue once it's present.

    I've copied this vm to a few different Hyper-V servers, Win 10 1909, Server 2016, Server 2019, the issue occurs on all of them with this VM. I don't have install media for 2.4.4p2 (not sure it ever existed), so I can't accurately recreate the scenario on arbitrary vms. But I have two unique pfsense VMs in different parts of the country that have both exhibited this issue during upgrade to 2.4.5.

    The freeze issue manifests at seemingly random times too, but can always be re-created on demand with a "Filter Reload".

    Because a clean install works perfect, I don't think there's any fundamental issue with 2.4.5. But during the upgrade process, it's like something isn't being installed/replaced correctly, but only in some instances with a particular config item.

    I have copies of the VM pre-upgrade (40GB) I can share for debugging if needed. Upgrading this vm always results in the pfctl freeze bug.

    I've now worked-around the issue by exporting config pre-upgrade, clean install 2.4.5 into same vm, import config. No issues in 2.4.5 on these clean installs. This implies there's something being missed by the installer during upgrade.

    Related issues:
    https://forum.netgate.com/topic/151949/2-4-5-new-install-slow-to-boot-on-hyper-v-2019/16
    https://forum.netgate.com/topic/149595/2-4-5-a-20200110-1421-and-earlier-high-cpu-usage-from-pfctl/25
    https://forum.netgate.com/topic/152131/hyper-v-vm-constant-100-cpu-load-in-2-4-5/5


Log in to reply