Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl

    Scheduled Pinned Locked Moved Development
    112 Posts 33 Posters 32.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • C
      carl2187
      last edited by

      I found that Virtualbox 6.1.6 vms have the issue as well. Clean install of 2.4.5 into a virtualbox vm with 4-cores, 6GB of ram, perfect at first, then manually update the Bogons from the shell. Then reload filter or use @jimp commands from shell to drop and reload the bogon table results in cpu spike and full outage for about 1 minute.

      So this virtualbox pfsense instance has the issue even worse than Hyper-v vms on the exact same underlying physical hardware that I was using to test with Hyper-v.

      Physical hardware i've done all my testing on both virtualbox and hyper-v:
      Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz, lenovo w540 laptop. The vm settings in virtualbox took all the defaults for "freebsd x64" template except changing from 1 cpu to 4 cpu.

      I've seen the (now quickly repeatable) issue on hyper-v versions 2016, 2019, and Windows 10 1909 hyper-v, and now Virtualbox running on win10 1909.

      To repro yourself for testing:
      Clean install 2.4.5
      make sure you have the big bogon file downloaded first, goto shell of pfsense, run:
      /etc/rc.update_bogons.sh 1

      then flush and then add the bogon table from the firewall:
      pfctl -t bogonsv6 -T flush
      pfctl -t bogonsv6 -T add -f /etc/bogonsv6

      pfsense is now locked up for a bit while it processes the bogon file, network traffic stalls to/from/through the firewall for about 20 seconds on hyper-v, about 1 minute on Virtualbox, cpu of the VM goes to 100%, console goes unresponsive.

      1 Reply Last reply Reply Quote 1
      • C
        carl2187
        last edited by

        VMWare workstation 15 has the exact same issue and timing characteristics of Hyper-v. about 20 seconds of full cpu and full network outage when doing the bogon download, flush, add commands from the shell:

        /etc/rc.update_bogons.sh 1
        <usually see the issue here, depending on if the bogon file has been downloaded already or not>
        pfctl -t bogonsv6 -T flush
        pfctl -t bogonsv6 -T add -f /etc/bogonsv6
        <always see the issue here, assuming the update_bogon script was able to complete successfully>

        (make sure you see 111672 addresses deleted/added when running those commands, otherwise your bogon list is still empty and the bug wont manifest)

        So far tested with repeatability each of these hypervisors have the issue: clean install pfsense 2.4.5 from iso, 4-cpu 6GB ram in various hypervisors: Hyper-v (2016, 2019, win10-1909), Virtualbox 6.1.6, VMWare Workstation 15.

        1 Reply Last reply Reply Quote 0
        • C
          carl2187
          last edited by

          just tested vmware esxi, only got one packet lost during the "add" command, so 1-2 seconds of outage on vmware esxi 7.0.0

          pfctl -t bogonsv6 -T add -f /etc/bogonsv6

          Results in:

          Hyper-v 2016: 20 sec outage
          Hyper-v 2019: 20 sec outage
          Hyper-v win10-1909: 20 sec outage
          VMWare Workstation 15: 20 sec outage
          Virtualbox 6.1.6: 50 sec outage
          VMWare esxi 7.0.0: 1 sec outage

          I have an old netgate SG-1000, it reloads the bogonv6 table in about 1 second, without any downtime or lost packets. So all virtualized environments seem to have at least 1 second of downtime, ranging up from there. The tiny cpu in the SG-1000 handles it without any outage.

          1 Reply Last reply Reply Quote 0
          • jimpJ
            jimp Rebel Alliance Developer Netgate
            last edited by

            Less about CPU power, more about CPU count. Knock any of the VMs down to a single core and they probably won't show the same symptoms.

            Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

            Need help fast? Netgate Global Support!

            Do not Chat/PM for help!

            1 Reply Last reply Reply Quote 2
            • jimpJ
              jimp Rebel Alliance Developer Netgate
              last edited by

              We have identified the cause of the problem, it is a change made in FreeBSD for a PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230619

              On a test kernel with r345177 reverted, there is no delay, lock, or other disruption on a multi-core Hyper-V VM:

              : pfctl -t bogonsv6 -T flush
              111171 addresses deleted.
              : time pfctl -t bogonsv6 -T add -f /etc/bogonsv6
              111171/111171 addresses added.
              0.149u 0.196s 0:00.34 97.0%	373+192k 0+0io 0pf+0w
              : pfctl -t bogonsv6 -T flush
              111171 addresses deleted.
              : time pfctl -t bogonsv6 -T add -f /etc/bogonsv6
              111171/111171 addresses added.
              0.175u 0.199s 0:00.37 97.2%	365+188k 0+0io 0pf+0w
              

              On a stock 2.4.5 kernel that same system experienced a 60-second lock where the console and everything else was unresponsive.

              We're still assessing the next steps.

              Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

              Need help fast? Netgate Global Support!

              Do not Chat/PM for help!

              luckman212L C 2 Replies Last reply Reply Quote 13
              • luckman212L
                luckman212 LAYER 8 @jimp
                last edited by

                @jimp This is wonderful news! Good luck on the endeavor. Can I ask 2 questions?

                1. is there any way to remotely downgrade to 2.4.4 from 2.4.5? I think I have a remote SG3100 hitting this issue and it was on 2.4.2 earlier today, I upgraded it... it's 20 miles away :(

                2. in case (1) is not possible, is this bug also present in current 2.5.0 builds?

                sorry if these q's are already answered but I'm on mobile and so haven't read the whole thread (came here via a reddit link)

                T jimpJ D 3 Replies Last reply Reply Quote 0
                • RicoR
                  Rico LAYER 8 Rebel Alliance
                  last edited by

                  AFAIK there is no way to downgrade online. The only thing you can do is reflash 2.4.4 with usb thumb.

                  -Rico

                  1 Reply Last reply Reply Quote 0
                  • T
                    teiva @andrew_241
                    last edited by teiva

                    @andrew_241 Not sure if you the post about a FreeBSD bug affecting this version, but since I've dropped my vCPU count from 4 to 1 and although the firewall is busier than usual when loading or doing a filter reload the server is not locking up anymore like it was before. Anyway just thought i'd let you know.

                    1 Reply Last reply Reply Quote 0
                    • T
                      teiva @luckman212
                      last edited by teiva

                      @luckman212 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

                      if these q's are already answered but I'm on mobile and so haven't read the whole thread (came here via a reddit link)

                      This is great news. Dropping to 1vCPU has temporarily mitigated my issue.

                      1 Reply Last reply Reply Quote 0
                      • jimpJ
                        jimp Rebel Alliance Developer Netgate @luckman212
                        last edited by

                        @luckman212 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

                        is there any way to remotely downgrade to 2.4.4 from 2.4.5? I think I have a remote SG3100 hitting this issue and it was on 2.4.2 earlier today, I upgraded it... it's 20 miles away :(

                        No

                        in case (1) is not possible, is this bug also present in current 2.5.0 builds?

                        We haven't tested 2.5.0, but I don't think it does. That could change, though, as we're getting the 2.5.0 builds up onto stable/12 and it may be there.

                        Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                        Need help fast? Netgate Global Support!

                        Do not Chat/PM for help!

                        luckman212L 1 Reply Last reply Reply Quote 0
                        • luckman212L
                          luckman212 LAYER 8 @jimp
                          last edited by luckman212

                          @jimp Thank you again. Reading through redmine #10414 it seems like the temporary workaround is:

                          • set System > Advanced > Firewall & NAT > Firewall Maximum Table Entries to <65535 โ€” e.g. 65000
                          • disable Block bogon networks on all interfaces

                          The thing is, I've done both of those things on my only 2.4.5 system (a remote SG-3100) and I believe I am still hitting this problem.

                          Take a look at this gateway monitoring graph โ€” never seen spikes like this! They're almost all exactly 20 minutes apart. I checked /etc/crontab for any possible jobs that might be running on 20 minute intervals (found nothing). I also searched the filesystem for any references to 1200 seconds and found just one, in /usr/local/www/interfaces_bridge_edit.php stating "...the timeout of address cache entries [..] default is 1200 seconds". Don't know if that's anything.

                          Multiple conversations with the ISP and they are assuring me the problem is "on my end" โ€” of course. I'd normally set up some Wireshark captures between the ISP equipment and pfSense in this type of situation, but since I'm remote that isn't possible.

                          It seems like people are also reporting success on virtual machines by setting CPU cores to 1. Is there any boot flag that we can set here to disable SMP e.g. kern.smp.disabled=1 or hint.lapic.1.disable=1 or is that not necessary?

                          update: see below -- disabling SMP seems to have helpred.

                          1 Reply Last reply Reply Quote 0
                          • jimpJ
                            jimp Rebel Alliance Developer Netgate
                            last edited by

                            Not just bogons but anything that loads large tables. It could be a URL table alias, pfBlockerNG, or something else.

                            Remember: Upvote with the ๐Ÿ‘ button for any user/post you find to be helpful, informative, or deserving of recognition!

                            Need help fast? Netgate Global Support!

                            Do not Chat/PM for help!

                            luckman212L 1 Reply Last reply Reply Quote 0
                            • luckman212L
                              luckman212 LAYER 8 @jimp
                              last edited by

                              @jimp said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

                              anything that loads large tables. It could be a URL table alias, pfBlockerNG

                              This unit doesn't have any aliases defined, and pfBNG is not installed (no packages installed actually).

                              A 1 Reply Last reply Reply Quote 0
                              • A
                                akm22562 @luckman212
                                last edited by

                                @luckman212 In my case, I had CARP configured with bogons blocked on the WAN interface.

                                I can't afford to disable CARP. Disabling bogons was a big help.

                                Anyway, just my $0.02.

                                1 Reply Last reply Reply Quote 0
                                • C
                                  carl2187 @jimp
                                  last edited by

                                  @jimp amazing detective work to have already isolated it down to a specific upstream change in freebsd!

                                  Please let the community know if theres anything we can do to help, test, build kernels etc.

                                  Thanks for all you do!

                                  1 Reply Last reply Reply Quote 0
                                  • luckman212L
                                    luckman212 LAYER 8
                                    last edited by luckman212

                                    Seems like the fix for this will land in 2.4.5-p1 which is coming soon. But, this person was desperate. So as a test, I put the following in their /boot/loader.conf.local:

                                    kern.smp.disabled=1
                                    

                                    After rebooting, the problem is gone. It's only been an hour, but not a single hiccup so far (๐Ÿคžfingers crossed).This is on an SG-3100.

                                    nzkiwi68N 1 Reply Last reply Reply Quote 1
                                    • Cool_CoronaC
                                      Cool_Corona
                                      last edited by Cool_Corona

                                      I can confirm this fixes the issue completely!

                                      Its not enough to edit -> system -> tunables.

                                      You have to edit /boot/loader.conf.local manually.

                                      Its the same as limiting a VM to only 1 core.

                                      1 Reply Last reply Reply Quote 0
                                      • nzkiwi68N
                                        nzkiwi68 @luckman212
                                        last edited by

                                        kern.smp.disabled=1
                                        

                                        After rebooting, the problem is gone. It's only been an hour, but not a single hiccup so far (๐Ÿคžfingers crossed).This is on an SG-3100.

                                        But, a quick look online suggests to me that this is disabling all multi CPU support. On busy systems this could be a problem switching your multi core (example XG-1537 with 8 cores plus hyper threading) into a single CPU system!

                                        I recommend if you can wait on 2.4.4-p3 or limp along on 2.4.5 if you can.

                                        For me, I shall wait for the official patch / release.

                                        luckman212L 1 Reply Last reply Reply Quote 1
                                        • luckman212L
                                          luckman212 LAYER 8 @nzkiwi68
                                          last edited by

                                          @nzkiwi68 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

                                          this is disabling all multi CPU support [..] I recommend if you can wait on 2.4.4-p3 or limp along on 2.4.5 if you can.

                                          100% good advice. In this case I had to upgrade due to another problem, so I was "stuck" on 2.4.5 with a remote system and had no other option. When 2.4.5-p1 / 2.5.0 come out this should not be needed. Losing 1 core is a fair trade for regaining the stability.

                                          Cool_CoronaC 1 Reply Last reply Reply Quote 0
                                          • Cool_CoronaC
                                            Cool_Corona @luckman212
                                            last edited by

                                            @luckman212

                                            Agreed. Limiting to 1 core is a viable option on a home network or on a small B2B setup. A busy connection running IDS/IPS would be running full load and not have spare ressources left.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.