High CPU and load very high after updating to 2.7.1 and 2.7.2

stephenw10

What NICs are you using? What packages are you running?

What does top -HaSP actually show in 2.7.2?

Steve

cocojeff3

The test after clearing my config and formatting the box I was running a base config with no customizations or packages. My NICs are listed below. I did collect the top -HaSP however the output and screen shots were lost due to a power outage after getting the box reformatted and my normal 2.7.0 config back in place to get my systems back up and running. If that is 100% required to make progress I can re-upgrade my box back to 2.7.2 and collect that data. Please advise if that is needed to make progress on this issue. I only remember the php-fpm: pool nginx from memory as that was the only thing that stood out as odd when I compared it to my VM running 2.7.0 at the same time.

em0: <Intel(R) PRO/1000 PT 82571EB/82571GB (Quad Copper)> port 0xd880-0xd89f mem 0xfea80000-0xfea9ffff,0xfea60000-0xfea7ffff irq 16 at device 0.0 on pci3
em1: <Intel(R) PRO/1000 PT 82571EB/82571GB (Quad Copper)> port 0xdc00-0xdc1f mem 0xfeae0000-0xfeafffff,0xfeac0000-0xfeadffff irq 17 at device 0.1 on pci3
em2: <Intel(R) PRO/1000 PT 82571EB/82571GB (Quad Copper)> port 0xe880-0xe89f mem 0xfeb80000-0xfeb9ffff,0xfeb60000-0xfeb7ffff irq 17 at device 0.0 on pci4
em3: <Intel(R) PRO/1000 PT 82571EB/82571GB (Quad Copper)> port 0xec00-0xec1f mem 0xfebe0000-0xfebfffff,0xfebc0000-0xfebdffff irq 18 at device 0.1 on pci4
em4: <Intel(R) 82567LF-3 ICH10> port 0xc880-0xc89f mem 0xfe9c0000-0xfe9dffff,0xfe9fa000-0xfe9fafff irq 20 at device 25.0 on pci0

stephenw10

Hmm, nothing very exotic there. I'd expect em NICs to work fine. I have a system here using a similar CPU and NICs that doesn't show that.

The best way to solve something like this is if we can replicate it. The next best way is to get as much info as we can from the machine hitting the issue.

cocojeff3

I completed the update again and it remains the same behavior as with the last upgrade. The upgrade was completed without issue. The boot time from start to finish went from 4:10.46 on 2.7.0 to 21:50.14 on 2.7.2. I have the following packages installed:
pfSense-pkg-arpwatch
pfSense-pkg-Avahi
pfSense-pkg-Backup
pfSense-pkg-darkstat
pfSense-pkg-nmap
pfSense-pkg-pfBlockerNG-devel
pfSense-pkg-RRD_Summary
pfSense-pkg-Service_Watchdog
pfSense-pkg-Status_Traffic_Totals
pfSense-pkg-suricata
pfSense-pkg-System_Patches
The load one hour after boot is 2.34 which is almost 4 times more than what is was running 2.7.0.

I have the following logs from before and after the upgrade and the OS boot information included in the attached zip file.
top -HaSP
systat -vmstat 1
netstat -m
systat -iostat 1

Data.zip

Squish

https://forum.netgate.com/topic/184245/high-interrupt-cpu-usage-in-v2-7-1/11

This seems to be the same issue here, where it was observed as interrupts in other hypervisors and on bare metal. I haven't been able to find any actual cause or solution. My best guess so far is that it is kernel related, something to do with interrupt moderation or device polling. I haven't tried rebuilding the kernel yet.

I did observe similar spikes when viewing the web UI as well.

Since it seems it is not related to virtualization after all, I will wait to see where best to continue this and be sure to post any of my results.

stephenw10

Hmm, some of those things like acb upload stalling make it look like a connectivity issue. Can the firewall ping out as expected?

cocojeff3

Yes the firewall outgoing connectivity works fine. I can ping out from the device as expected and its services can connect to get their updates ect.

stephenw10

Hmm, it could be the encryption part of the code, maybe using something in the new openssl version.

Does that firewall have any crypto hardware that could have been in use?

cocojeff3

No this box does not have any crypto hardware to use.

stephenw10

Hmm, OK well if I had that box here I would start by testing a clean install with a basic config and see if that still hits it. That would narrow the issue to either something unusual in the hardware or something specific to the config.

cocojeff3

I did do that when testing last weekend and I can confirm that with a factory default config the CPU usage and load was greater on 2.7.1 and 2.7.2. This is not an issue with the hardware, or any specify post installation configuration. This is an issue with the base system running 2.7.1 and 2.7.2 on this hardware. is there some log or debug level that i can get you output for that might allow you to narrow down the issue so that I can get this box back to running at normal utilization?