100% CPU load after upgrading from 2.4.5-p1 to 2.6.0 on some firewalls

unico-dm

Hi folks. Maybe you can help us on this one. We see 100% CPU load on some upgraded firewalls. This renders them unusuable and affects our network.

What are we doing? At the moment we roll out pfSense version 2.6.0 to all our firewall-pairs. We do that as an inplace upgrade from 2.4.5-p1. Short version of the procedure: We set the target firewall to permanent maintenance mode, switch off any sync between the pairs and then start the upgrade.

What's wrong? Half a dozen upgrades went well - these firewalls are working fine and are performing. But now we've encountered strange behaviour after upgrading two firewalls. Symptoms: The affected firewalls are not reachable most of the time and we see network issues.

We suspect this is because of two things.

most of the time cpu load is 100%. Following processes use up all cpu resources.

/usr/local/bin/dpinger
/usr/sbin/syslogd
[kernel{if_io_tqg_4}]
[intr{swi1: pfsync}]

Exception: Sometimes for 2-3h CPU load is OK but then everything starts again.

Log analysis shows gateway errors. All gateways on all interfaces have high latency or even timeouts. I can see as well CARP events (MASTER/BACKUP flapping even with maintenance mode on, thus the network issues). The gateways are fine and reachable. So no reason why the firewall shouldn't be able to ping the gateways.

What have we done so far? Set up firewall from scratch with config.xml. With the fresh set up the firewall behaves normally. But with the imported configuration the issues start.

Next steps We now try to set the firewalls up bit by bit to try to see, which configuration triggers the problem. But as the configuration is quite large this is very very very time-consuming.

Maybe one of you folks can help us find the cause a bit faster. Is there any known issue that explains this behaviour? Any kind of configuration that could trigger high CPU load? Or how could we analyze further? Thanks in advance for your help! Much appreciated!

SteveITS

@unico-dm re upgrade process, yours sounds complicated. Netgate recommends to upgrade the backup, fail over, upgrade the primary and that what we’ve always done.
https://docs.netgate.com/pfsense/en/latest/install/upgrade-guide-ha.html

2.4 to 2.6 is a pretty big skip…

If gzip was running I’d suggest turning off log compression but that’s not usually necessary unless it’s a slow CPU.

unico-dm

@steveits

Upgrade procedure is as you've described it. Sorry I didn't make that clear.

Upgrade from 2.4.5-p1 to 2.6.0 is a big step, yep.

Processors are Intel Xeon D-1518 and Intel Xeon E5-2600 v4. Our Workload usually bores the CPU. So no issue here. (Similar upgraded hardware still is bored on 2.6.0:) Will try to disable compression to ease analysis.

I will gather more info according to https://docs.netgate.com/pfsense/en/latest/troubleshooting/high-cpu-load.html

stephenw10

Install the System Patches package. In the list of recommended patches apply the pfcounter patch for this bug.
In some situations that can use all the CPU if it gets stuck in a reload loop.

Steve

unico-dm

@stephenw10 When we installed the patch, the symptoms were completely gone So thanks a lot for pointing us in the right direction!

So case solved

Additional info:

We think we know why this happens only on those two firewalls. It happens that they are the ones with the most rules and aliases (by far) in our environment.
we couldn't pinpoint it at first, because the 15min reload interval kept the load on maximum, so we couldn't see the interval (=the underlying mechanism) at all. But we could trigger the behavior in a "calm phase" by editing and applying a random rule (and thus triggering the reload). Then the load would be up for several minutes.