2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl

carl2187

on my "bad" upgraded to 2.4.5 systems, the @jimp commands listed do indeed cause the problem to manifest right away, same as doing "filter reload" from the gui. 20-30 second full lockup.

Re-installing a clean build of 2.4.5 on the SAME "bad" system fixes the issue entirely. @jimp commands no longer cause any trouble, and the "filter reload" doesn't cause any trouble.

So this bug really isn't a matter of hyper-v, or proxmox, or CPU counts and voodoo. This is something that is different at the kernel/filesystem level of an upgrade vs a clean install. Config doesn't matter either, upgraded with the problem, then factory reset, the problem still exists. Clean install and format, no problem, import config, still no problem.

Clean installs do not have this issue, this indicates that fundamentally 2.4.5 is working great on all systems when it is installed "correctly". At this time the only SURE way to get a "correct" 2.4.5 install is to do a clean install and not an upgrade.

The fact that "Different systems are impacted in different ways" is a false path, because the bug shouldn't exist on any system to begin with, and even "affected systems" like hyper-v, are actually NOT affected at all when a clean-format install of 2.4.5 is done.

jimp

And that assertion is demonstrably incorrect as the only place I can replicate the bug is on a fresh installation in Hyper-V.

jimp

If you are testing with bogons, the difference is probably that on a fresh install the bogon lists aren't populated yet until the first bogon update. Manually update bogons and try that again.

carl2187

@jimp

sounds good will try that and see, i hope your wrong because your implying my currently "good" 2.4.5 systems will essentially self-destruct once they update their bogon list. :)

Uncle_Bacon

So I've tried on my fresh 2.4.5 with config restored from old "affected" 2.4.5 install. Those commands produce an increase in CPU certainly but no loss of connection/complete lockup. System returns to normal within seconds. The table I used has just shy of 130,000 addresses.

carl2187

@jimp uh oh, looks like your spot on in your analysis. manually updating the bogon list has now brought the bug into my once-working clean install 2.4.5 environments.

so my statement of "only a clean install doesn't have the bug" is true, BUT only until the bogon list updates automatically ;)

The bogon file is present already after an upgrade, so that explains the false trail that I was on.

Good news is this lends a workaround to anyone with this problem, delete the bogon file for now and you'll be ok for a little while again, could get really aggressive and disable the update bogon script if really necessary for production right now.

Thanks for setting us up on the right track @jimp and hopefully we can find a fix for this pfctl table flush/add problem.

jimp

@carl2187 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

Good news is this lends a workaround to anyone with this problem, delete the bogon file for now and you'll be ok for a little while again, could get really aggressive and disable the update bogon script if really necessary for production right now

Or just disable bogon blocking on all interfaces. Though a fair amount of people are not using bogons but pfBlockerNG features which use large tables, and those would need to be disabled instead.

carl2187

I found that Virtualbox 6.1.6 vms have the issue as well. Clean install of 2.4.5 into a virtualbox vm with 4-cores, 6GB of ram, perfect at first, then manually update the Bogons from the shell. Then reload filter or use @jimp commands from shell to drop and reload the bogon table results in cpu spike and full outage for about 1 minute.

So this virtualbox pfsense instance has the issue even worse than Hyper-v vms on the exact same underlying physical hardware that I was using to test with Hyper-v.

Physical hardware i've done all my testing on both virtualbox and hyper-v:
Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz, lenovo w540 laptop. The vm settings in virtualbox took all the defaults for "freebsd x64" template except changing from 1 cpu to 4 cpu.

I've seen the (now quickly repeatable) issue on hyper-v versions 2016, 2019, and Windows 10 1909 hyper-v, and now Virtualbox running on win10 1909.

To repro yourself for testing:
Clean install 2.4.5
make sure you have the big bogon file downloaded first, goto shell of pfsense, run:
/etc/rc.update_bogons.sh 1

then flush and then add the bogon table from the firewall:
pfctl -t bogonsv6 -T flush
pfctl -t bogonsv6 -T add -f /etc/bogonsv6

pfsense is now locked up for a bit while it processes the bogon file, network traffic stalls to/from/through the firewall for about 20 seconds on hyper-v, about 1 minute on Virtualbox, cpu of the VM goes to 100%, console goes unresponsive.

carl2187

VMWare workstation 15 has the exact same issue and timing characteristics of Hyper-v. about 20 seconds of full cpu and full network outage when doing the bogon download, flush, add commands from the shell:

/etc/rc.update_bogons.sh 1
<usually see the issue here, depending on if the bogon file has been downloaded already or not>
pfctl -t bogonsv6 -T flush
pfctl -t bogonsv6 -T add -f /etc/bogonsv6
<always see the issue here, assuming the update_bogon script was able to complete successfully>

(make sure you see 111672 addresses deleted/added when running those commands, otherwise your bogon list is still empty and the bug wont manifest)

So far tested with repeatability each of these hypervisors have the issue: clean install pfsense 2.4.5 from iso, 4-cpu 6GB ram in various hypervisors: Hyper-v (2016, 2019, win10-1909), Virtualbox 6.1.6, VMWare Workstation 15.

carl2187

just tested vmware esxi, only got one packet lost during the "add" command, so 1-2 seconds of outage on vmware esxi 7.0.0

pfctl -t bogonsv6 -T add -f /etc/bogonsv6

Results in:

Hyper-v 2016: 20 sec outage
Hyper-v 2019: 20 sec outage
Hyper-v win10-1909: 20 sec outage
VMWare Workstation 15: 20 sec outage
Virtualbox 6.1.6: 50 sec outage
VMWare esxi 7.0.0: 1 sec outage

I have an old netgate SG-1000, it reloads the bogonv6 table in about 1 second, without any downtime or lost packets. So all virtualized environments seem to have at least 1 second of downtime, ranging up from there. The tiny cpu in the SG-1000 handles it without any outage.

jimp

Less about CPU power, more about CPU count. Knock any of the VMs down to a single core and they probably won't show the same symptoms.

jimp

We have identified the cause of the problem, it is a change made in FreeBSD for a PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230619

On a test kernel with r345177 reverted, there is no delay, lock, or other disruption on a multi-core Hyper-V VM:

: pfctl -t bogonsv6 -T flush
111171 addresses deleted.
: time pfctl -t bogonsv6 -T add -f /etc/bogonsv6
111171/111171 addresses added.
0.149u 0.196s 0:00.34 97.0%	373+192k 0+0io 0pf+0w
: pfctl -t bogonsv6 -T flush
111171 addresses deleted.
: time pfctl -t bogonsv6 -T add -f /etc/bogonsv6
111171/111171 addresses added.
0.175u 0.199s 0:00.37 97.2%	365+188k 0+0io 0pf+0w

On a stock 2.4.5 kernel that same system experienced a 60-second lock where the console and everything else was unresponsive.

We're still assessing the next steps.

luckman212

@jimp This is wonderful news! Good luck on the endeavor. Can I ask 2 questions?

is there any way to remotely downgrade to 2.4.4 from 2.4.5? I think I have a remote SG3100 hitting this issue and it was on 2.4.2 earlier today, I upgraded it... it's 20 miles away :(
in case (1) is not possible, is this bug also present in current 2.5.0 builds?

sorry if these q's are already answered but I'm on mobile and so haven't read the whole thread (came here via a reddit link)

Rico

AFAIK there is no way to downgrade online. The only thing you can do is reflash 2.4.4 with usb thumb.

-Rico

teiva

@andrew_241 Not sure if you the post about a FreeBSD bug affecting this version, but since I've dropped my vCPU count from 4 to 1 and although the firewall is busier than usual when loading or doing a filter reload the server is not locking up anymore like it was before. Anyway just thought i'd let you know.

teiva

@luckman212 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

if these q's are already answered but I'm on mobile and so haven't read the whole thread (came here via a reddit link)

This is great news. Dropping to 1vCPU has temporarily mitigated my issue.

jimp

@luckman212 said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

is there any way to remotely downgrade to 2.4.4 from 2.4.5? I think I have a remote SG3100 hitting this issue and it was on 2.4.2 earlier today, I upgraded it... it's 20 miles away :(

No

in case (1) is not possible, is this bug also present in current 2.5.0 builds?

We haven't tested 2.5.0, but I don't think it does. That could change, though, as we're getting the 2.5.0 builds up onto stable/12 and it may be there.

luckman212

@jimp Thank you again. Reading through redmine #10414 it seems like the temporary workaround is:

set System > Advanced > Firewall & NAT > Firewall Maximum Table Entries to <65535 — e.g. 65000
disable Block bogon networks on all interfaces

The thing is, I've done both of those things on my only 2.4.5 system (a remote SG-3100) and I believe I am still hitting this problem.

Take a look at this gateway monitoring graph — never seen spikes like this! They're almost all exactly 20 minutes apart. I checked /etc/crontab for any possible jobs that might be running on 20 minute intervals (found nothing). I also searched the filesystem for any references to 1200 seconds and found just one, in /usr/local/www/interfaces_bridge_edit.php stating "...the timeout of address cache entries [..] default is 1200 seconds". Don't know if that's anything.

Multiple conversations with the ISP and they are assuring me the problem is "on my end" — of course. I'd normally set up some Wireshark captures between the ISP equipment and pfSense in this type of situation, but since I'm remote that isn't possible.

It seems like people are also reporting success on virtual machines by setting CPU cores to 1. Is there any boot flag that we can set here to disable SMP e.g. kern.smp.disabled=1 or hint.lapic.1.disable=1 or ~~is that not necessary~~?

update: see below -- disabling SMP seems to have helpred.

jimp

Not just bogons but anything that loads large tables. It could be a URL table alias, pfBlockerNG, or something else.

luckman212

@jimp said in 2.4.5.a.20200110.1421 and earlier: High CPU usage from pfctl:

anything that loads large tables. It could be a URL table alias, pfBlockerNG

This unit doesn't have any aliases defined, and pfBNG is not installed (no packages installed actually).