2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl
-
Same issue, Server 2019 and Hyper-V, no packages installed on custom HW (Ryzen 2700) after upgrade. Pegs CPU upon boot and is basically unusable.
Set VM to 1 virtual processor to get it working but it is sub-optimal for OpenVPN clients. Even experimented with just assigning 2 virtual processors - it runs sluggish.
Will look to revert to 2.4.4-p3 snapshot in the near future.Edit: since I had nothing to lose and this is in a test lab, I bumped up to 2.5.0 development (2.5.0.a.20200403.1017). 2.5.0 does not seem to have the Hyper-V CPU issue.
-
Its the same in a VM on Vsphere. I run 32 cores on a test system and they all go to almost 100% shortly after boot.
I noticed that the server started spinning its fans a lot harder and looked in the hypervisor and sure enough. Almost 100% and not handling traffic at all....
I was running 2.4.4 p3 and no issues until Suricata wont start. Then I had to upgrade and it died....
-
i made a clean install on my esxi with 4 cpu
and upgraded from 2.4.4-p3 to 2.4.5 on another server with qemu/kvm with 4 cpu westmere
both have suricata installed, never had such a problem. and i'm unable to reproduce on my test lab, must be some settings -
Same problem here too Hvper V 2016 version 2.4.5
5GB RAM
4 CPU
pfblocker NGSits for ages on 'firewall' & Also DHCPv6 before booting really sluggish dropped packets galore
Dropped back to single CPU and all ok on 2.4.5
-
Same problem, pfsense 2.4.4 installed on Vmware Esxi. I have suricata, pfblockerng, squid, squidguard and lightsquid installed. After upgrading to 2.4.5 the latency went haywire. However, I've managed resolve my problem, I reduced 8 vcpu to 1vcpu then did the upgrade to 2.4.5. So far everything worked fine except suricata wouldn't start, so i did a Forced pkg Reinstall. Everything worked fine after that, then I added an additional 3vcpu and it's been working fine ever since.
-
Same problem here but with a Proxmox VM on pfSense 2.4.5.
2 CPU, 2 core
8GB RAM
NUMA disabledHigh CPU on "/sbin/pfctl -o basic -f /tmp/rules.debug" effectively killed my networks and VLANS, and both incoming WAN connections. pfSense would often crash and reboot automatically, which produces a crash report.
Dropping to 1 CPU, 1 core fixes it but it's running hard due to my network. 2.4.4_3 ran just peachy!
-
@Uncle_Bacon Have you tried adding cpu later (after the upgrade)? I noticed that maximum vcpu is 4 before it starts going crazy.
-
@slim2016 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:
@Uncle_Bacon Have you tried adding cpu later (after the upgrade)? I noticed that maximum vcpu is 4 before it starts going crazy.
I have upped it to 8 so far and it runs pretty stable. Havent noticed a crash report yet.
-
It doesn't work properly with more than one vCPU (in my experience)
-
@Cool_Corona You are right, iv'e just added a total of 8 vcpu and gave it time to settle down after a boot, it seems to stabilise itself after a short while.
-
@slim2016 The point is its completely unstable with more than one cpu (when it doesn't work) including dropped packets.
This isn't acceptable to simply 'wait for it' to settle down. Also the boot times with multiple CPU are magnitudes slower that it should be, again not acceptable for a firewall.
If the root cause isn't determined are you happy for the firewall to randomly drop packets and generally die?
It's not happening for everyone but it is a bug and it needs to resolved.
The silence from NetGate is deafening. I understand its not happening on NetGate hardware - Does anyone have a subscription on a virtual machine that NetGate can address?
-
@timboau-0 I was responding to Cool_Corona
-
@timboau-0 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:
@slim2016 The point is its completely unstable with more than one cpu (when it doesn't work) including dropped packets.
This isn't acceptable to simply 'wait for it' to settle down. Also the boot times with multiple CPU are magnitudes slower that it should be, again not acceptable for a firewall.
If the root cause isn't determined are you happy for the firewall to randomly drop packets and generally die?
It's not happening for everyone but it is a bug and it needs to resolved.
The silence from NetGate is deafening. I understand its not happening on NetGate hardware - Does anyone have a subscription on a virtual machine that NetGate can address?
Its happening on Netgate hardware as well. They are not so fortunate to have the workaround reducing the number of cores as are the VM's.
Reducing it to 1 core and get it up and running stable is no problem. Then add cores as you like.
Yes the boot time is quicker with 1 core then with 8 cores.
Yes I would like it to be resolved as well. I think its an BSD issue and therefore needs to be forwarded in the ECO system of BSD.
I am running 8 cores as of now and no issues so far.
-
@slim2016 I haven't tried that. Unfortunately my backups don't run as deep as they should so I have no 2.4.4 backup. I am going to try a fresh install and restore config from 2.4.5 to see if that helps. Thank you for the suggestion. It's nice to have the ability to create/re-create as many instances of it that I want. I'll post back.
-
@Uncle_Bacon I haven't used Proxmox for many years and when I did it was for a short while. With Esxi you just create a snapshot before you upgrade or update and if something goes wrong you just restore the snapshot.
-
@Uncle_Bacon said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:
@slim2016 I haven't tried that. Unfortunately my backups don't run as deep as they should so I have no 2.4.4 backup. I am going to try a fresh install and restore config from 2.4.5 to see if that helps. Thank you for the suggestion. It's nice to have the ability to create/re-create as many instances of it that I want. I'll post back.
Install everything with 1 CORE only! After you are done and the backup is on the box, then reboot, install packages, reboot and upgrade number of cores.
-
@slim2016 It's the same for Proxmox but I guess I need to get back in to the habit of doing that before any updates, especially to pfSense.
@Cool_Corona Done and done. Back up and running and have my metric server monitoring and will notify of any issues that may arise. Fingers crossed!
-
@Uncle_Bacon Keep us updated about the stability of your system.
-
I just noticed that pfsense doesn't start properly after installing arp-watch, some of the services wouldn't start, after removing it everything started fine.
-
So just a quick update.
For some context, this problem originally only started when one of my WAN connections dropped and pfSense failed over to the other. So I went to do some testing. I disconnected one gateway and it switched to the other with moderate CPU use and then continued on to normal levels. Upon reconnecting it however, pfSense switched back to my main WAN and I noticed the pfctl process running high CPU and all sorts of notifications about that from my metrics. Latency on all of my network connections increased 10 fold as well and barely anything was getting through the network.
If I recall when I installed this newest update, the router ran fine for a while, at least since the start of April before this problem came up more recently.
I am not overly confident that either a) my having to reinstall and restore from a 2.4.5 config was successful or b) the issue isn't solved by only adding more CPUs/cores after the upgrade. or OR my configuration is unique/flawed.
I'll keep posting updates as they come in.
-
pfsense environment:
VM - hyper-v gen2 (uefi)
zfs
4-cpu
6GB-RAM
40GB-hdd
VM on windows server 2016.I upgraded this pfsense 2.4.4p2 to 2.4.5. In this now upgraded vm instance, clicking "Filter Reload" causes a full outage of pfsense's OS for 20 seconds. 100% cpu spike of all 4 cores, and no network activity gets through. Console session watching 'top' also locks up, after the freeze I briefly see "PFCTL" with very high CPU, but goes away after the first top refresh after the un-freeze.
Now the weird part, I then reset to factory defaults to eliminate my config and packages as the cause on 2.4.5. But the issue persists, even in a default config. Clicking "Filter Reload" gives total outage and cpu spike to 100%, console goes unresponsive, all traffic to all interfaces stop.
Now I suspect the VM/Hardware has issues with 2.4.5. So in the same VM I re-install and re-format to 2.4.5 this time from ISO instead of upgrade. I do not import config. This newly installed 2.4.5 instance of pfsense on the same vm has no issues. Reset to factory default, still no issues. Able to filter reload any time without any CPU spike, nor outage, filter reloads instantly.
In the clean install of 2.4.5, importing the config from the "bad install" also works without issue. The freeze issue doesn't re-appear after importing config.
Also if I factory reset the VM on 2.4.4p2 to defaults, then upgrade to 2.4.5, the issue is not present. So it implies there's something in my config that causes permanent OS damage if upgraded to 2.4.5.
Only a format/full-reinstall of pfsense can "clear" the issue once it's present.I've copied this vm to a few different Hyper-V servers, Win 10 1909, Server 2016, Server 2019, the issue occurs on all of them with this VM. I don't have install media for 2.4.4p2 (not sure it ever existed), so I can't accurately recreate the scenario on arbitrary vms. But I have two unique pfsense VMs in different parts of the country that have both exhibited this issue during upgrade to 2.4.5.
The freeze issue manifests at seemingly random times too, but can always be re-created on demand with a "Filter Reload".
Because a clean install works perfect, I don't think there's any fundamental issue with 2.4.5. But during the upgrade process, it's like something isn't being installed/replaced correctly, but only in some instances with a particular config item.
I have copies of the VM pre-upgrade (40GB) I can share for debugging if needed. Upgrading this vm always results in the pfctl freeze bug.
I've now worked-around the issue by exporting config pre-upgrade, clean install 2.4.5 into same vm, import config. No issues in 2.4.5 on these clean installs. This implies there's something being missed by the installer during upgrade.
Related issues:
https://forum.netgate.com/topic/151949/2-4-5-new-install-slow-to-boot-on-hyper-v-2019/16
https://forum.netgate.com/topic/149595/2-4-5-a-20200110-1421-and-earlier-high-cpu-usage-from-pfctl/25
https://forum.netgate.com/topic/152131/hyper-v-vm-constant-100-cpu-load-in-2-4-5/5 -
I cant get an 2.4.5 installed in a VM.
It always says no boot loader....
To get a Pfsense going, I needed to use 2.4.4p3 and then upgrade.
I havent tried installing it on 1 core VM yet. Only 32 cores as it was before...
-
@Cool_Corona if using hyper-v, Disable "Secure Boot" to fix that issue. Or switch to a Gen1 VM. the 2.4.5 iso boots fine in gen2 hyper-v vm when secure boot is disabled.
-
I run it on ESXi.... :)
-
Great suggestion, a fresh install of 2.4.5 on my Proxmox VM 2 sockets, 2 cores 8GB RAM. Should have tried it sooner.
Initial tests (filter reload, manual drop of gateways for failover) have yielded positive results. CPU spiked, pfctl did it's thing and returned to running levels. There was no detectable loss of connection anywhere on the network, a seamless swap.
It would seem @carl2187 is on to something with the problem only existing from an upgrade and not a fresh install. Plus, my config was restored from an upgraded 2.4.4p3 to 2.4.5 system (as my non-upgraded 2.4.4 backups were quite dated). Boot speed appears to be comparable if not the same to 2.4.4 as well.
I'm going to run this for now, continue with load tests and see what happens.
-
@Uncle_Bacon good to hear your getting going again with a clean install. It seems like clean install is the only way to go right now for 2.4.5.
There's something really wrong somewhere on the file-system or boot config post-upgrade to 2.4.5 that doesn't occur in a clean install of 2.4.5.
I have an upgraded-broken 2.4.5 and a clean install of 2.4.5 where both have now been factory reset. Upgraded VM has the issue, clean install does not have the issue, even after the factory reset. I'm going to try to compare file system contents and boot loader settings to see if something didn't actually get upgraded during the 2.4.4->2.4.5 upgrade process.
Just this morning I got report of "intermittent" outages in a branch office that was upgraded to 2.4.5 about a week ago. I jumped on the gui, did a filter reload, lost all connection for about 20 seconds, that's all i need to see to know this is the botched upgrade condition. So I exported the config, re-installed clean 2.4.5, imported the config, now things are perfect. I can't cause the issue on this VM anymore. Same VM, same PFsense version, same config, but upgrade to 2.4.5=bad. Clean install of 2.4.5=good.
So I've done lots of tests, now the first "production" issue I had to help with, and the fix was exactly the same as the labs i've been testing in. Upgrades to 2.4.5 can cause full CPU/outages. Clean installs of 2.4.5 seem to work great with any imported config.
Its important for anyone else troubleshooting this issue to know that "clean install" really means full format and re-install from ISO. A factory reset after the issue is present is NOT enough to clear out this bug. Even default config will have this problem after the upgrade and factory reset. It seems something is different between upgraded instances and clean installs of 2.4.5 at the file system or bootloader level.
In attempting to accurately re-create the scenario that leads to this problem, i find out i can't actually download old versions of PFsense. It really is too bad Netgate stopped allowing downloads of previous version ISO's, it makes troubleshooting this type of upgrade issue in a lab very difficult. The community has been effectively disallowed to help effectively troubleshoot the underlying issue here.
https://forum.netgate.com/topic/135395/where-to-download-old-versions -
@carl2187 I think this might be a false flag because I did a fresh install on hyper-v and that was when I noticed the long reboot time started.
My old install that was based on upgrading to the RCs and then finally the Release version had less problems.
My fresh install hangs at “Mounting filesystems . . .” for 30 seconds. Net boot time averages about 2 minutes.
-
We need to dig deeper. ESXi, Hyper-V and Proxmox...
How do they differ in regards to this??
-
@IsaacFL In my use case, I don't consider 2 minutes to be very long at all. I don't get the 30 second mounting time like you. Most of my boot time is used by the firewall but I do have many rules on many VLANs. I timed it with a stopwatch and from power on to being able to login to the GUI was 1:20, and to boot complete was 2:01.
Specs on the host for reference:
CPU(s): 40 x Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz (2 Sockets)
RAM: 192GB
Kernel Version: Linux 5.3.18-2-pve (modified Ubuntu)
PVE Manager: pve-manager/6.1-8/806edfe1 -
Mounting filesystem has had 30 sec delay for many versions, especially if using zfs. This is not the issue I'm worried about nor reporting on and testing.
To find out if this upgrade bug affects you, go run a constant ping (-t) to your pfsense ip, then do a filter reload. If you get 20 seconds of no response, you have the upgrade bug.
This upgrade bug has been seen on Hyper-v and proxmox, it may not be specific to a specific hypervisor at all. It may affect physical hardware too. Too early to say.
But my experience is that at 1min 4sec after boot, the filesystem mount is done, bootup totally done at about 1:55. This is consistent across many of my deployed systems accross many versions of pfsense, all use zfs.
-
We're narrowing in on what the problem is. It's easiest to trigger with
pfctl
loading large tables (not flushing or updating, but adding), though it does not actually seem to bepf
itself to blame. Current theory is a problem with allocation of large chunks of kernel memory, but we're still debugging.On an affected system you can trigger it manually by doing this:
Flush a large table first:
# pfctl -t bogonsv6 -T flush
Load the table contents (this is where the huge delay happens):
# pfctl -t bogonsv6 -T add -f /etc/bogonsv6
Compare with updating an existing table which is also fast by running the same command again:
# pfctl -t bogonsv6 -T add -f /etc/bogonsv6
Different systems are impacted in different ways. Hyper-V seems to be the worst, with the whole system becoming unresponsive for 30-60s even at the console. Proxmox is similar but not quite as bad. Some systems only see high
pfctl
usage and some slowness but are otherwise responsive. All that said, most of the systems in our lab show no symptoms whatsoever, which is why it's been difficult to track down. -
tested it on my esxi but no problem,
tested on hyperv under windows 10, i have no problem with 1 cpu, with 2 cpu or more it completely freeze the vm
the boot process is overall slow, it stuck for some times to "configuring firewall..." , but it eventually complete the process but it freeze again when it reach the console menu -
@kiokoman said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:
tested it on my esxi but no problem,
tested on hyperv under windows 10, i have no problem with 1 cpu, with 2 cpu or more it completely freeze the vm
the boot process is overall slow, it stuck for some times to "configuring firewall..." , but it eventually complete the process but it freeze again when it reach the console menuMultiple CPUs appears to be one major contributing factor, but not all multi-core/multi-CPU configurations are affected. I don't think anyone has seen it on a single core system.
-
Same here with Hyper-V, server 2019.
6 cores, jimp commands kills my firewall for 30-45 seconds. -
on my "bad" upgraded to 2.4.5 systems, the @jimp commands listed do indeed cause the problem to manifest right away, same as doing "filter reload" from the gui. 20-30 second full lockup.
Re-installing a clean build of 2.4.5 on the SAME "bad" system fixes the issue entirely. @jimp commands no longer cause any trouble, and the "filter reload" doesn't cause any trouble.
So this bug really isn't a matter of hyper-v, or proxmox, or CPU counts and voodoo. This is something that is different at the kernel/filesystem level of an upgrade vs a clean install. Config doesn't matter either, upgraded with the problem, then factory reset, the problem still exists. Clean install and format, no problem, import config, still no problem.
Clean installs do not have this issue, this indicates that fundamentally 2.4.5 is working great on all systems when it is installed "correctly". At this time the only SURE way to get a "correct" 2.4.5 install is to do a clean install and not an upgrade.
The fact that "Different systems are impacted in different ways" is a false path, because the bug shouldn't exist on any system to begin with, and even "affected systems" like hyper-v, are actually NOT affected at all when a clean-format install of 2.4.5 is done.
-
And that assertion is demonstrably incorrect as the only place I can replicate the bug is on a fresh installation in Hyper-V.
-
If you are testing with bogons, the difference is probably that on a fresh install the bogon lists aren't populated yet until the first bogon update. Manually update bogons and try that again.
-
sounds good will try that and see, i hope your wrong because your implying my currently "good" 2.4.5 systems will essentially self-destruct once they update their bogon list. :)
-
So I've tried on my fresh 2.4.5 with config restored from old "affected" 2.4.5 install. Those commands produce an increase in CPU certainly but no loss of connection/complete lockup. System returns to normal within seconds. The table I used has just shy of 130,000 addresses.
-
@jimp uh oh, looks like your spot on in your analysis. manually updating the bogon list has now brought the bug into my once-working clean install 2.4.5 environments.
so my statement of "only a clean install doesn't have the bug" is true, BUT only until the bogon list updates automatically ;)
The bogon file is present already after an upgrade, so that explains the false trail that I was on.
Good news is this lends a workaround to anyone with this problem, delete the bogon file for now and you'll be ok for a little while again, could get really aggressive and disable the update bogon script if really necessary for production right now.
Thanks for setting us up on the right track @jimp and hopefully we can find a fix for this pfctl table flush/add problem.