Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl

    Scheduled Pinned Locked Moved Development
    112 Posts 33 Posters 31.9k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • U
      Uncle_Bacon @carl2187
      last edited by

      Great suggestion, a fresh install of 2.4.5 on my Proxmox VM 2 sockets, 2 cores 8GB RAM. Should have tried it sooner.

      Initial tests (filter reload, manual drop of gateways for failover) have yielded positive results. CPU spiked, pfctl did it's thing and returned to running levels. There was no detectable loss of connection anywhere on the network, a seamless swap.

      It would seem @carl2187 is on to something with the problem only existing from an upgrade and not a fresh install. Plus, my config was restored from an upgraded 2.4.4p3 to 2.4.5 system (as my non-upgraded 2.4.4 backups were quite dated). Boot speed appears to be comparable if not the same to 2.4.4 as well.

      I'm going to run this for now, continue with load tests and see what happens.

      C 1 Reply Last reply Reply Quote 1
      • C
        carl2187 @Uncle_Bacon
        last edited by

        @Uncle_Bacon good to hear your getting going again with a clean install. It seems like clean install is the only way to go right now for 2.4.5.

        There's something really wrong somewhere on the file-system or boot config post-upgrade to 2.4.5 that doesn't occur in a clean install of 2.4.5.

        I have an upgraded-broken 2.4.5 and a clean install of 2.4.5 where both have now been factory reset. Upgraded VM has the issue, clean install does not have the issue, even after the factory reset. I'm going to try to compare file system contents and boot loader settings to see if something didn't actually get upgraded during the 2.4.4->2.4.5 upgrade process.

        Just this morning I got report of "intermittent" outages in a branch office that was upgraded to 2.4.5 about a week ago. I jumped on the gui, did a filter reload, lost all connection for about 20 seconds, that's all i need to see to know this is the botched upgrade condition. So I exported the config, re-installed clean 2.4.5, imported the config, now things are perfect. I can't cause the issue on this VM anymore. Same VM, same PFsense version, same config, but upgrade to 2.4.5=bad. Clean install of 2.4.5=good.

        So I've done lots of tests, now the first "production" issue I had to help with, and the fix was exactly the same as the labs i've been testing in. Upgrades to 2.4.5 can cause full CPU/outages. Clean installs of 2.4.5 seem to work great with any imported config.

        Its important for anyone else troubleshooting this issue to know that "clean install" really means full format and re-install from ISO. A factory reset after the issue is present is NOT enough to clear out this bug. Even default config will have this problem after the upgrade and factory reset. It seems something is different between upgraded instances and clean installs of 2.4.5 at the file system or bootloader level.

        In attempting to accurately re-create the scenario that leads to this problem, i find out i can't actually download old versions of PFsense. It really is too bad Netgate stopped allowing downloads of previous version ISO's, it makes troubleshooting this type of upgrade issue in a lab very difficult. The community has been effectively disallowed to help effectively troubleshoot the underlying issue here.
        https://forum.netgate.com/topic/135395/where-to-download-old-versions

        IsaacFLI 1 Reply Last reply Reply Quote 1
        • IsaacFLI
          IsaacFL @carl2187
          last edited by

          @carl2187 I think this might be a false flag because I did a fresh install on hyper-v and that was when I noticed the long reboot time started.

          My old install that was based on upgrading to the RCs and then finally the Release version had less problems.

          My fresh install hangs at “Mounting filesystems . . .” for 30 seconds. Net boot time averages about 2 minutes.

          Cool_CoronaC U 2 Replies Last reply Reply Quote 0
          • Cool_CoronaC
            Cool_Corona @IsaacFL
            last edited by

            @IsaacFL

            We need to dig deeper. ESXi, Hyper-V and Proxmox...

            How do they differ in regards to this??

            1 Reply Last reply Reply Quote 0
            • U
              Uncle_Bacon @IsaacFL
              last edited by

              @IsaacFL In my use case, I don't consider 2 minutes to be very long at all. I don't get the 30 second mounting time like you. Most of my boot time is used by the firewall but I do have many rules on many VLANs. I timed it with a stopwatch and from power on to being able to login to the GUI was 1:20, and to boot complete was 2:01.

              Specs on the host for reference:
              CPU(s): 40 x Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz (2 Sockets)
              RAM: 192GB
              Kernel Version: Linux 5.3.18-2-pve (modified Ubuntu)
              PVE Manager: pve-manager/6.1-8/806edfe1

              1 Reply Last reply Reply Quote 0
              • C
                carl2187
                last edited by

                Mounting filesystem has had 30 sec delay for many versions, especially if using zfs. This is not the issue I'm worried about nor reporting on and testing.

                To find out if this upgrade bug affects you, go run a constant ping (-t) to your pfsense ip, then do a filter reload. If you get 20 seconds of no response, you have the upgrade bug.

                This upgrade bug has been seen on Hyper-v and proxmox, it may not be specific to a specific hypervisor at all. It may affect physical hardware too. Too early to say.

                But my experience is that at 1min 4sec after boot, the filesystem mount is done, bootup totally done at about 1:55. This is consistent across many of my deployed systems accross many versions of pfsense, all use zfs.

                1 Reply Last reply Reply Quote 0
                • jimpJ
                  jimp Rebel Alliance Developer Netgate
                  last edited by jimp

                  We're narrowing in on what the problem is. It's easiest to trigger with pfctl loading large tables (not flushing or updating, but adding), though it does not actually seem to be pf itself to blame. Current theory is a problem with allocation of large chunks of kernel memory, but we're still debugging.

                  On an affected system you can trigger it manually by doing this:

                  Flush a large table first:

                  # pfctl -t bogonsv6 -T flush
                  

                  Load the table contents (this is where the huge delay happens):

                  # pfctl -t bogonsv6 -T add -f /etc/bogonsv6
                  

                  Compare with updating an existing table which is also fast by running the same command again:

                  # pfctl -t bogonsv6 -T add -f /etc/bogonsv6
                  

                  Different systems are impacted in different ways. Hyper-V seems to be the worst, with the whole system becoming unresponsive for 30-60s even at the console. Proxmox is similar but not quite as bad. Some systems only see high pfctl usage and some slowness but are otherwise responsive. All that said, most of the systems in our lab show no symptoms whatsoever, which is why it's been difficult to track down.

                  Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                  Need help fast? Netgate Global Support!

                  Do not Chat/PM for help!

                  1 Reply Last reply Reply Quote 2
                  • kiokomanK
                    kiokoman LAYER 8
                    last edited by

                    tested it on my esxi but no problem,
                    tested on hyperv under windows 10, i have no problem with 1 cpu, with 2 cpu or more it completely freeze the vm
                    the boot process is overall slow, it stuck for some times to "configuring firewall..." , but it eventually complete the process but it freeze again when it reach the console menu

                    ̿' ̿'\̵͇̿̿\з=(◕_◕)=ε/̵͇̿̿/'̿'̿ ̿
                    Please do not use chat/PM to ask for help
                    we must focus on silencing this @guest character. we must make up lies and alter the copyrights !
                    Don't forget to Upvote with the 👍 button for any post you find to be helpful.

                    jimpJ 1 Reply Last reply Reply Quote 0
                    • jimpJ
                      jimp Rebel Alliance Developer Netgate @kiokoman
                      last edited by

                      @kiokoman said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

                      tested it on my esxi but no problem,
                      tested on hyperv under windows 10, i have no problem with 1 cpu, with 2 cpu or more it completely freeze the vm
                      the boot process is overall slow, it stuck for some times to "configuring firewall..." , but it eventually complete the process but it freeze again when it reach the console menu

                      Multiple CPUs appears to be one major contributing factor, but not all multi-core/multi-CPU configurations are affected. I don't think anyone has seen it on a single core system.

                      Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                      Need help fast? Netgate Global Support!

                      Do not Chat/PM for help!

                      1 Reply Last reply Reply Quote 0
                      • M
                        maverick_slo
                        last edited by

                        Same here with Hyper-V, server 2019.
                        6 cores, jimp commands kills my firewall for 30-45 seconds.

                        1 Reply Last reply Reply Quote 0
                        • C
                          carl2187
                          last edited by

                          on my "bad" upgraded to 2.4.5 systems, the @jimp commands listed do indeed cause the problem to manifest right away, same as doing "filter reload" from the gui. 20-30 second full lockup.

                          Re-installing a clean build of 2.4.5 on the SAME "bad" system fixes the issue entirely. @jimp commands no longer cause any trouble, and the "filter reload" doesn't cause any trouble.

                          So this bug really isn't a matter of hyper-v, or proxmox, or CPU counts and voodoo. This is something that is different at the kernel/filesystem level of an upgrade vs a clean install. Config doesn't matter either, upgraded with the problem, then factory reset, the problem still exists. Clean install and format, no problem, import config, still no problem.

                          Clean installs do not have this issue, this indicates that fundamentally 2.4.5 is working great on all systems when it is installed "correctly". At this time the only SURE way to get a "correct" 2.4.5 install is to do a clean install and not an upgrade.

                          The fact that "Different systems are impacted in different ways" is a false path, because the bug shouldn't exist on any system to begin with, and even "affected systems" like hyper-v, are actually NOT affected at all when a clean-format install of 2.4.5 is done.

                          1 Reply Last reply Reply Quote 0
                          • jimpJ
                            jimp Rebel Alliance Developer Netgate
                            last edited by

                            And that assertion is demonstrably incorrect as the only place I can replicate the bug is on a fresh installation in Hyper-V.

                            Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                            Need help fast? Netgate Global Support!

                            Do not Chat/PM for help!

                            C 1 Reply Last reply Reply Quote 0
                            • jimpJ
                              jimp Rebel Alliance Developer Netgate
                              last edited by

                              If you are testing with bogons, the difference is probably that on a fresh install the bogon lists aren't populated yet until the first bogon update. Manually update bogons and try that again.

                              Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                              Need help fast? Netgate Global Support!

                              Do not Chat/PM for help!

                              1 Reply Last reply Reply Quote 0
                              • C
                                carl2187 @jimp
                                last edited by

                                @jimp

                                sounds good will try that and see, i hope your wrong because your implying my currently "good" 2.4.5 systems will essentially self-destruct once they update their bogon list. :)

                                1 Reply Last reply Reply Quote 0
                                • U
                                  Uncle_Bacon
                                  last edited by

                                  So I've tried on my fresh 2.4.5 with config restored from old "affected" 2.4.5 install. Those commands produce an increase in CPU certainly but no loss of connection/complete lockup. System returns to normal within seconds. The table I used has just shy of 130,000 addresses.

                                  1 Reply Last reply Reply Quote 0
                                  • C
                                    carl2187
                                    last edited by

                                    @jimp uh oh, looks like your spot on in your analysis. manually updating the bogon list has now brought the bug into my once-working clean install 2.4.5 environments.

                                    so my statement of "only a clean install doesn't have the bug" is true, BUT only until the bogon list updates automatically ;)

                                    The bogon file is present already after an upgrade, so that explains the false trail that I was on.

                                    Good news is this lends a workaround to anyone with this problem, delete the bogon file for now and you'll be ok for a little while again, could get really aggressive and disable the update bogon script if really necessary for production right now.

                                    Thanks for setting us up on the right track @jimp and hopefully we can find a fix for this pfctl table flush/add problem.

                                    jimpJ 1 Reply Last reply Reply Quote 0
                                    • jimpJ
                                      jimp Rebel Alliance Developer Netgate @carl2187
                                      last edited by

                                      @carl2187 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

                                      Good news is this lends a workaround to anyone with this problem, delete the bogon file for now and you'll be ok for a little while again, could get really aggressive and disable the update bogon script if really necessary for production right now

                                      Or just disable bogon blocking on all interfaces. Though a fair amount of people are not using bogons but pfBlockerNG features which use large tables, and those would need to be disabled instead.

                                      Remember: Upvote with the 👍 button for any user/post you find to be helpful, informative, or deserving of recognition!

                                      Need help fast? Netgate Global Support!

                                      Do not Chat/PM for help!

                                      1 Reply Last reply Reply Quote 0
                                      • C
                                        carl2187
                                        last edited by

                                        I found that Virtualbox 6.1.6 vms have the issue as well. Clean install of 2.4.5 into a virtualbox vm with 4-cores, 6GB of ram, perfect at first, then manually update the Bogons from the shell. Then reload filter or use @jimp commands from shell to drop and reload the bogon table results in cpu spike and full outage for about 1 minute.

                                        So this virtualbox pfsense instance has the issue even worse than Hyper-v vms on the exact same underlying physical hardware that I was using to test with Hyper-v.

                                        Physical hardware i've done all my testing on both virtualbox and hyper-v:
                                        Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz, lenovo w540 laptop. The vm settings in virtualbox took all the defaults for "freebsd x64" template except changing from 1 cpu to 4 cpu.

                                        I've seen the (now quickly repeatable) issue on hyper-v versions 2016, 2019, and Windows 10 1909 hyper-v, and now Virtualbox running on win10 1909.

                                        To repro yourself for testing:
                                        Clean install 2.4.5
                                        make sure you have the big bogon file downloaded first, goto shell of pfsense, run:
                                        /etc/rc.update_bogons.sh 1

                                        then flush and then add the bogon table from the firewall:
                                        pfctl -t bogonsv6 -T flush
                                        pfctl -t bogonsv6 -T add -f /etc/bogonsv6

                                        pfsense is now locked up for a bit while it processes the bogon file, network traffic stalls to/from/through the firewall for about 20 seconds on hyper-v, about 1 minute on Virtualbox, cpu of the VM goes to 100%, console goes unresponsive.

                                        1 Reply Last reply Reply Quote 1
                                        • C
                                          carl2187
                                          last edited by

                                          VMWare workstation 15 has the exact same issue and timing characteristics of Hyper-v. about 20 seconds of full cpu and full network outage when doing the bogon download, flush, add commands from the shell:

                                          /etc/rc.update_bogons.sh 1
                                          <usually see the issue here, depending on if the bogon file has been downloaded already or not>
                                          pfctl -t bogonsv6 -T flush
                                          pfctl -t bogonsv6 -T add -f /etc/bogonsv6
                                          <always see the issue here, assuming the update_bogon script was able to complete successfully>

                                          (make sure you see 111672 addresses deleted/added when running those commands, otherwise your bogon list is still empty and the bug wont manifest)

                                          So far tested with repeatability each of these hypervisors have the issue: clean install pfsense 2.4.5 from iso, 4-cpu 6GB ram in various hypervisors: Hyper-v (2016, 2019, win10-1909), Virtualbox 6.1.6, VMWare Workstation 15.

                                          1 Reply Last reply Reply Quote 0
                                          • C
                                            carl2187
                                            last edited by

                                            just tested vmware esxi, only got one packet lost during the "add" command, so 1-2 seconds of outage on vmware esxi 7.0.0

                                            pfctl -t bogonsv6 -T add -f /etc/bogonsv6

                                            Results in:

                                            Hyper-v 2016: 20 sec outage
                                            Hyper-v 2019: 20 sec outage
                                            Hyper-v win10-1909: 20 sec outage
                                            VMWare Workstation 15: 20 sec outage
                                            Virtualbox 6.1.6: 50 sec outage
                                            VMWare esxi 7.0.0: 1 sec outage

                                            I have an old netgate SG-1000, it reloads the bogonv6 table in about 1 second, without any downtime or lost packets. So all virtualized environments seem to have at least 1 second of downtime, ranging up from there. The tiny cpu in the SG-1000 handles it without any outage.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.