Tracing cause of cpu spike - SG1100

michmoor

Hello everyone
I am trying to trackdown the cause of increased CPU utilization on a newly deployed SG-1100.
Checking metrics something occurred on 7/24.

Checking system activity I see php-fpm taking up some CPU cycles. I restarted the PHP-FPM process from cli but that didn't do anything. Any ideas?

Gertjan

@michmoor

A possible reason :

You are using pfBlockerng.
As long as no one is using an IP or DND(BL) that is on of of the pfBlockerng feeds, all goes well - all is quiet.
Then, a new app, person, device, whatever, is hitting resources that are listed in of of the pfBlockerng lists/feeds, and now pfBlockerng wakes up.
Normally, IP filtering is done by pf - using lists build and generated by pfBlockerng, so very few resources are consumed.
Same thing for DNSBL : unbound does all the heavy lifting, pfblockerng does nothing, except for updating and making a new main IP and DNSBL feed/list every xx hours (or days).

But, as soon as some one or something starts to hit what is listed by pfBlockerng, then pfBlocker starts to do 'the other job' : making nice sharts, graphs and other GUI show stuff. And it does so using 'PHP'.
And that will eat CPU cycles.

Solution : disconnect your LAN (success guaranteed) or locate the device that gives a lot of work for pfBlockerng, and have a chat with the user.

michmoor

@Gertjan pretty sure this doesn’t have anything to do with traffic (blocked or allowed) based on the system resources screenshot I posted above.

Gertjan

@michmoor

Very possible.
I can't make up a story about zabbix. I don't know what that is.
All I see is a static image ^^

All the other process are "base" pfSense, I have the same.

Btw : Leaving the pfSense dashboard open in a browser also uses a lot of resources. Same reason : stats build by PHP is not a CPU friendly activity.

edit : no processes ? : (green) :

Mine :

( a hotel, 80+ devices connected right now ),

michmoor

@Gertjan
I didnt include processes as there was no notable change there that coincides with the increased in user and system util

keyser

@michmoor Be vary about the fact that the UI on a 2100 uses about 25% CPU power permanently as long as you have a browser open/connected and showing parts of the pfSense UI. So you might just be seeing the CPU usage of a open websession to the pfSense UI

Same goes for the sg-1100. They both have a very low performance ARM based CPU, so PHP refresh is eating that CPU power..

michmoor

@keyser said in Tracing cause of cpu spike - SG1100:

Same goes for the sg-1100. They both have a very low performance ARM based CPU, so PHP refresh is eating that CPU power..

I figured that much but i don't think that's whats happening now. Its probably a bad idea to review system activity on these low-end devices because of the issue you mentioned so seeing nginx process running is me logged in.
I went in via the CLI and below is what I'm seeing.

What sort of things increase user and system utilization in a processer then as it relates to pfsense?

last pid:  9678;  load averages:  0.86,  0.80,  0.77                                                                                                                       up 3+21:14:32  12:37:25
279 threads:   4 running, 257 sleeping, 18 waiting
CPU: 40.5% user,  0.0% nice,  7.8% system,  0.4% interrupt, 51.4% idle
Mem: 110M Active, 232M Inact, 512K Laundry, 260M Wired, 330M Free
ARC: 124M Total, 48M MFU, 69M MRU, 1648K Anon, 931K Header, 4162K Other
     95M Compressed, 252M Uncompressed, 2.66:1 Ratio

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        187 ki31     0B    32K RUN      1  76.9H  60.90% [idle{idle: cpu1}]
   11 root        187 ki31     0B    32K RUN      0  76.8H  43.54% [idle{idle: cpu0}]
51256 root         68    0    69M    39M piperd   1  18:43   1.46% /usr/local/bin/php_pfb -f /usr/local/pkg/pfblockerng/pfblockerng.inc filterlog
63294 unbound      20    0    99M    63M kqread   0   4:46   1.33% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
    0 root        -12    -     0B  1216K -        0  21:10   0.88% [kernel{z_wr_iss}]
93399 root         20    0    14M  3692K CPU0     0   0:00   0.48% top -aSH
    0 root        -16    -     0B  1216K -        0   8:35   0.46% [kernel{z_wr_int}]
   17 root        -16    -     0B    16K mmcsd    0   9:56   0.40% [mmcsd0: mmc/sd card]
   12 root        -64    -     0B   256K WAIT     1  26:50   0.34% [intr{gic0,s42: mvneta0}]
   12 root        -60    -     0B   256K WAIT     1  13:05   0.31% [intr{swi1: netisr 1}]
12058 root         20    0    21M  7600K select   1   2:56   0.30% /usr/local/sbin/bfdd -d
63294 unbound      20    0    99M    63M kqread   1   5:34   0.26% /usr/local/sbin/unbound -c /var/unbound/unbound.conf{unbound}
    0 root        -60    -     0B  1216K -        1   9:30   0.18% [kernel{wg_tqg_1}]
    6 root         -8    -     0B   736K tx->tx   0   3:21   0.16% [zfskern{txg_thread_enter}]
   12 root        -60    -     0B   256K WAIT     0   6:24   0.10% [intr{swi1: netisr 0}]
   20 root        -16    -     0B    48K psleep   1   1:29   0.10% [pagedaemon{dom0}]

keyser

@michmoor Very good question Indeed.
My first attempt would be to stop One service at a time to see if One of them indirectly is causing the issue.

I assume you have tried rebooting and the problem remains?

michmoor

@keyser Yep i rebooted and the problem is there. Im going to disable pfBlocker but i cant see that being the issue.

keyser

@michmoor just stop the services One at the time. No need to disable them

michmoor

@keyser @Gertjan
It was a process....it was pfBlocker. pfBlocker and Zabbix and FRR are the only packages of consequence on this 1100. Once i stopped pfblocker CPU idle shot back up to 95%.
From the monitoring graph alone you can see system util and user util dropped

So that being the case...I never deployed anything as small as an 1100 but i want to re-enable pfBlocker again.

Any concerns? This ARM CPU is working its butt off.

keyser

@michmoor Good to know the culprit

There is definitively something wrong with your pfblocker config - perhaps a corrupted file or list download?
Pfblocker should never spend even remotely that kind of CPU - even om a 1100.

I have 2100 - same Arm cpu - and pfblocker has a fairly advanced config on My box. CPU usage is only about 5-8% for the whole system unless i really start pushing traffic through. And pfblocker never uses any CPU to speak off unless its updating.

Look into the error logfile in pfblocker (can be done in the UI)
Perhaps try and disable your list feeds One at a time if the pfblocker error log file does not reveal the error.

michmoor

@keyser
The error logs are empty. I think this may just be a sizing issue in the end. SG1100 just isn't a powerful box and having to check a block list before creating state might be the issue.
I don't think the list count is bad...

keyser

@michmoor That is 100% NOT the issue since pfBlocker processes are not involved in all the firewall/state stuff - that’s pfSense doing that.
PfBlocker processes are responsible for fetching and parsing lists, creating firewall ALIAS’es out of the lists, and if configured to create/sort the firewall rules to include rules using the ALIAS’es. The other pfBlocker process is responsible for scraping the pfSense firewall log file for entries that are created from rules that contain pfBlocker ALIAS’es - to gather information to create all the statistics reports in the pfBlocker module.

To compare I have these lists on a SG-2100 (Same CPU) that sees about 5-8% CPU utilisation all day exept when updating lists (once every night):

One guess could be that you have a LOT of firewall logging on your box (hundreds of entries a second). That logging needs not be related to pfBlocker rules specifically as it still has to scrape every log entry. That will cause pfBlocker log scraping to consuming all that CPU. If that is the case make sure to reconfigure your firewall to log A LOT less. Your eMMC (storage) will only last maybe 4 - 8 months before being dead/entering readonly mode if you are really hitting it with logs.

michmoor

@keyser hmmm ok. then something is indeed wrong but I have no clue what it is then.
Traffic logging has been disabled for all firewall rules.
Im not sure what else to turn off or disable for now

Logging is enabled here but I don't think this would have much impact on CPU

py_error.log file is empty
error.log file only has stats from 7/4 regarding a download fail..
Anything else you think I can/should check?

keyser

@michmoor Not really, but It could be related to the Unbound DNS <-> DNSBL integration. I would probabaly remove the checkbox in keep settings, and then remove the package completely (with all settings).
After a reboot and confirming everything is peachy, I would install the package again - check everything - do basic config (no DNSBL or lists) - check everything - enable IP blocking - check everything - configure feeds - check everything - enable DNSBL and so on and so on.

To determine when in the configuration process the CPU usage arises.

michmoor

@keyser Took the advice and re-installed pfblocker without keeping settings. So far so good. I have no idea what was wrong with the configuration prior. I'll keep monitoring but so far it looks good. Strange one indeed.