Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    High Load Average after upgrade [SOLVED]

    Scheduled Pinned Locked Moved Problems Installing or Upgrading pfSense Software
    6 Posts 2 Posters 3.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • C
      Cursorkeys
      last edited by

      I've just upgraded to 2.3.1 and I'm seeing high load averages on a fairly beefy server.

      I've taken some readings and the CPUs look fairly idle, so I guess it's waiting on some kind of I/O:

      top -SH

      last pid: 13233;  load averages:  1.06,  1.08,  1.06                                                                            up 12+02:56:29  11:45:44
      271 processes: 18 running, 207 sleeping, 46 waiting
      CPU:  1.8% user,  0.9% nice,  2.3% system,  0.5% interrupt, 94.6% idle
      Mem: 731M Active, 12G Inact, 1564M Wired, 1396K Cache, 1579M Buf, 1818M Free
      Swap: 32G Total, 53M Used, 32G Free

      PID USERNAME PRI NICE  SIZE    RES STATE  C  TIME    WCPU COMMAND
        11 root    155 ki31    0K  256K CPU10  10 286.8H 100.00% idle{idle: cpu10}
        11 root    155 ki31    0K  256K CPU0    0 284.0H 100.00% idle{idle: cpu0}
        11 root    155 ki31    0K  256K CPU5    5 283.4H 100.00% idle{idle: cpu5}
        11 root    155 ki31    0K  256K CPU12  12 282.7H 100.00% idle{idle: cpu12}
        11 root    155 ki31    0K  256K CPU1    1 282.5H 100.00% idle{idle: cpu1}
        11 root    155 ki31    0K  256K CPU7    7 282.3H 100.00% idle{idle: cpu7}
        11 root    155 ki31    0K  256K CPU8    8 281.2H 100.00% idle{idle: cpu8}
        11 root    155 ki31    0K  256K CPU6    6 282.1H  99.76% idle{idle: cpu6}
        11 root    155 ki31    0K  256K RUN    2 281.5H  99.66% idle{idle: cpu2}
        11 root    155 ki31    0K  256K RUN    4 283.6H  98.19% idle{idle: cpu4}
        11 root    155 ki31    0K  256K CPU14  14 279.8H  97.75% idle{idle: cpu14}
        11 root    155 ki31    0K  256K CPU13  13 283.0H  97.46% idle{idle: cpu13}
        11 root    155 ki31    0K  256K CPU15  15 284.5H  97.27% idle{idle: cpu15}
        11 root    155 ki31    0K  256K CPU9    9 281.2H  96.29% idle{idle: cpu9}
        11 root    155 ki31    0K  256K CPU3    3 281.6H  95.36% idle{idle: cpu3}
        11 root    155 ki31    0K  256K CPU11  11 282.0H  87.70% idle{idle: cpu11}
      2653 root      49  20 13280M 11296M bpf    11  67:07  15.87% snort{snort}
      1844 squid    28    0  571M  321M kqread 11 574:16  14.89% squid
      23142 clamav    28    0  496M  420M uwait  11  0:06  8.40% clamd{clamd}
          0 root    -92    -    0K  224K -      4 331:51  8.06% kernel{em1 taskq}
          0 root    -92    -    0K  224K -      14 669:43  7.76% kernel{em0 taskq}
      23142 clamav    24    0  496M  420M uwait  13  0:14  1.56% clamd{clamd}
      23142 clamav    23    0  496M  420M uwait  7  0:02  1.56% clamd{clamd}
      23142 clamav    21    0  496M  420M select 12 241:02  1.07% clamd{clamd}
      6517 root      52    0  262M 20960K accept  3  0:00  0.88% php-fpm
      24716 clamav    20    0 55376K  4476K select  6  0:00  0.78% c-icap{c-icap}
        12 root    -60    -    0K  736K WAIT    0  48:07  0.68% intr{swi4: clock}
      25828 root      20    0 21856K  3740K CPU0    0  0:04  0.49% top
      23142 clamav    22    0  496M  420M uwait  2  0:03  0.39% clamd{clamd}
        12 root    -72    -    0K  736K WAIT  10  74:00  0.00% intr{swi1: netisr 10}
          8 root    -16    -    0K    32K psleep  4  56:11  0.00% pagedaemon{pagedaemon}
          6 root    -16    -    0K    16K pftm    3  43:27  0.00% pf purge
      90087 root      20    0  236M 29284K nanslp  4  37:50  0.00% php
        20 root      16    -    0K    16K syncer 14  30:48  0.00% syncer
      80674 root      20    0 60724K  9412K kqread  4  19:11  0.00% lighttpd_pfb
        14 root      -8    -    0K    48K -      6  17:17  0.00% geom{g_up}
        19 root      20    -    0K    32K sdflus 14  13:15  0.00% bufdaemon{/ worker}
      9492 root      20    0 16676K  1860K bpf    2  9:38  0.00% filterlog
      57199 root      52  20  260M  147M urdlck 13  9:11  0.00% snort{snort}
        15 root    -16    -    0K    16K -      3  7:56  0.00% rand_harvestq
        12 root    -88    -    0K  736K WAIT    8  5:34  0.00% intr{irq28: aac0}
      74829 root      20    0 14516K  1920K select 13  5:12  0.00% syslogd
      68737 root      52    0  262M 11756K kqread  4  4:54  0.00% php-fpm
          0 root    -16    -    0K  224K swapin 15  4:12  0.00% kernel{swapper}
      12301 root      20    0 19108K  2056K nanslp  0  3:48  0.00% dpinger{dpinger}
      11892 root      20    0 19108K  2056K nanslp  1  3:36  0.00% dpinger{dpinger}
      23142 clamav    20    0  496M  420M select  9  3:19  0.00% clamd{clamd}
      47516 root      20    0 30140K 17968K select  9  3:16  0.00% ntpd{ntpd}
      30682 squid    20    0 37752K  3884K select  8  3:10  0.00% pinger
      17206 root      20    0 28608K  2844K kqread  8  3:04  0.00% nginx
      56396 squid    20    0 37752K  3892K select  5  2:50  0.00% pinger
      16877 root      20    0 28608K  2940K kqread  1  2:46  0.00% nginx
      87981 squid    20    0 37752K  3892K select 13  2:40  0.00% pinger
      72469 squid    20    0 37752K  3892K select  1  2:22  0.00% pinger
      64219 squid    20    0 37752K  3892K select  6  2:08  0.00% pinger
      43200 unbound  20    0  224M  153M kqread  5  1:59  0.00% unbound{unbound}
      72460 squid    20    0 37752K  3892K select 14  1:56  0.00% pinger
      62363 squid    20    0 37752K  3920K select 15  1:49  0.00% pinger
      55801 clamav    20    0 25292K  2936K select  4  1:40  0.00% c-icap
      65037 squid    20    0 37752K  3928K select  2  1:32  0.00% pinger

      iostat 15

      tty          aacd0              md0              cd0            cpu
      tin  tout  KB/t tps  MB/s  KB/t tps  MB/s  KB/t tps  MB/s  us ni sy in id
        0    5 61.47  7  0.42  0.00  0  0.00  0.00  0  0.00  1  1  1  0 97
        0    16 50.81  6  0.30  0.00  0  0.00  0.00  0  0.00  1  1  2  0 96
        0    5 57.38  5  0.27  0.00  0  0.00  0.00  0  0.00  1  1  2  0 96
        0    5 52.39  6  0.31  0.00  0  0.00  0.00  0  0.00  1  0  1  0 97
        0    5 60.76  5  0.28  0.00  0  0.00  0.00  0  0.00  1  1  2  0 97
        0    5 48.44  6  0.28  0.00  0  0.00  0.00  0  0.00  1  1  2  0 96
        0    5 52.83  5  0.27  0.00  0  0.00  0.00  0  0.00  1  1  2  0 96
        0    5 51.45  7  0.33  0.00  0  0.00  0.00  0  0.00  1  0  1  0 98
        0    5 54.21  5  0.24  0.00  0  0.00  0.00  0  0.00  1  1  2  0 95

      vmstat 15

      procs      memory      page                    disks    faults        cpu
      r b w    avm    fre  flt  re  pi  po    fr  sr aa0 md0  in  sy  cs us sy id
      0 0 7  30634M  1819M  1099  0  1  1  1305 1057  0  0 2260  160 2216  2  1 97
      0 0 7  30621M  1820M  3619  0  0  0  4018 296  6  0 5939 8469 16927  2  1 96
      2 0 7  30621M  1820M  2459  0  0  0  2745 313  5  0 5852 6947 16585  2  2 96
      0 0 7  30621M  1824M  3396  0  0  0  3913 316  5  0 5906 8224 16667  3  3 94
      0 0 7  30642M  1819M  3813  0  0  1  4282 316  3  0 6131 9414 17290  2  2 96

      gstat -I -15s

      dT: 15.003s  w: 15.000s
      L(q)  ops/s    r/s  kBps  ms/r    w/s  kBps  ms/w  %busy Name
          0    12      0      0    0.0    12    477    0.7    0.3| aacd0
          0      0      0      0    0.0      0      0    0.0    0.0| cd0
          0    12      0      0    0.0    12    477    0.8    0.3| aacd0s1
          0    12      0      0    0.0    12    477    0.8    0.3| aacd0s1a
          0      0      0      0    0.0      0      0    0.0    0.0| aacd0s1b
          0    12      0      0    0.0    12    477    0.8    0.3| ufsid/575956061402fb6c
          0      0      0      0    0.0      0      0    0.0    0.0| label/swap0
          0      0      0      0    0.0      0      2    0.1    0.0| md0

      netstat -Lan

      Current listen queue sizes (qlen/incqlen/maxqlen)
      Proto Listen        Local Address
      tcp4  0/0/128        127.0.0.1.3128
      tcp4  0/0/128        192.168.158.88.3128
      tcp4  0/0/128        *.1344
      tcp4  0/0/128        127.0.0.1.953
      tcp4  0/0/128        *.53
      tcp6  0/0/128        *.53
      tcp4  0/0/128        *.8443
      tcp4  0/0/128        *.8081
      tcp6  0/0/128        *.80
      tcp4  0/0/128        *.80
      tcp4  0/0/128        *.22
      tcp6  0/0/128        *.22
      Netgraph sockets
      Type  Recv-Q Send-Q Node Address  #Hooks
      ctrl      0      0 [7e]:            0
      ctrl      0      0 [48]:            0
      unix  0/0/30        /var/run/qstats
      unix  0/0/128        /var/run/clamav/clamd.sock
      unix  0/0/128        /var/run/php-fpm.socket
      unix  0/0/5          /var/run/dpinger_###ROUTER01_192.168.158.88_192.168.158.119.sock
      unix  0/0/5          /var/run/dpinger_GW_WAN_###.###.###.###_###.###.###.###.sock
      unix  0/0/4          /var/run/devd.pipe
      unix  0/0/30        /var/run/check_reload_status
      unix  0/0/4          /var/run/devd.seqpacket.pipe

      Nothing seems to be jumping out at me, can anyone with a more trained eye see what the bottleneck is?

      Edit: More interesting things in log files:

      resolver log entries

      Jun 21 14:36:17 pfSense unbound: [43200:9] notice: sendmsg failed: No buffer space available
      –-A good run of these---

      gateway log messages

      Jun 22 09:25:57 pfSense dpinger: xxxROUTER01 192.168.158.119: sendto error: 55
      –-for all 1000 entries---

      192.168.158.119 is a router. It can be successfully pinged from the pfSense box.

      1 Reply Last reply Reply Quote 0
      • C
        Cursorkeys
        last edited by

        Edit: Nope, scratch that. The WCPU gobbling processes are indeed the virtual 'idle' processes. All the CPUs are definitely not busy at all.

        Any suggestions as to what to consider next gladly accepted!

        Edit2:

        Seeing this while doing a test ping:

        [2.3.1-RELEASE][root@pfSense.xxxxxxxx.co.uk]/root: ping 192.168.158.119
        PING 192.168.158.119 (192.168.158.119): 56 data bytes
        64 bytes from 192.168.158.119: icmp_seq=0 ttl=64 time=35.216 ms
        64 bytes from 192.168.158.119: icmp_seq=1 ttl=64 time=22.264 ms
        64 bytes from 192.168.158.119: icmp_seq=2 ttl=64 time=28.457 ms
        64 bytes from 192.168.158.119: icmp_seq=3 ttl=64 time=48.814 ms
        64 bytes from 192.168.158.119: icmp_seq=4 ttl=64 time=33.435 ms
        64 bytes from 192.168.158.119: icmp_seq=5 ttl=64 time=35.962 ms
        64 bytes from 192.168.158.119: icmp_seq=6 ttl=64 time=14.627 ms
        64 bytes from 192.168.158.119: icmp_seq=7 ttl=64 time=38.668 ms
        ping: sendto: No buffer space available
        ping: sendto: No buffer space available
        64 bytes from 192.168.158.119: icmp_seq=10 ttl=64 time=36.417 ms
        64 bytes from 192.168.158.119: icmp_seq=11 ttl=64 time=52.092 ms
        64 bytes from 192.168.158.119: icmp_seq=12 ttl=64 time=22.819 ms
        64 bytes from 192.168.158.119: icmp_seq=13 ttl=64 time=38.137 ms
        64 bytes from 192.168.158.119: icmp_seq=14 ttl=64 time=34.070 ms
        64 bytes from 192.168.158.119: icmp_seq=15 ttl=64 time=23.527 ms
        64 bytes from 192.168.158.119: icmp_seq=16 ttl=64 time=44.394 ms
        64 bytes from 192.168.158.119: icmp_seq=17 ttl=64 time=15.814 ms
        64 bytes from 192.168.158.119: icmp_seq=18 ttl=64 time=38.633 ms
        ^C
        –- 192.168.158.119 ping statistics ---
        19 packets transmitted, 17 packets received, 10.5% packet loss
        round-trip min/avg/max/stddev = 14.627/33.138/52.092/10.356 ms

        Sooo, somehow running out of MBUFs?

        netstat -m looks fine

        [2.3.1-RELEASE][root@pfSense.xxxxxxxx.co.uk]/root: netstat -m
        2261/15079/17340 mbufs in use (current/cache/total)
        1375/8335/9710/26584 mbuf clusters in use (current/cache/total/max)
        1375/8309 mbuf+clusters out of packet secondary zone in use (current/cache)
        6/3066/3072/13291 4k (page size) jumbo clusters in use (current/cache/total/max)
        0/0/0/3938 9k jumbo clusters in use (current/cache/total/max)
        0/0/0/2215 16k jumbo clusters in use (current/cache/total/max)
        3390K/32703K/36094K bytes allocated to network (current/cache/total)
        0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
        0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
        0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
        0/0/0 requests for jumbo clusters denied (4k/9k/16k)
        0 requests for sfbufs denied
        0 requests for sfbufs delayed
        97 requests for I/O initiated by sendfile

        Edit 3:

        Jacking up kern.ipc.somaxconn, net.local.stream.sendspace, net.local.stream.recvspace and net.inet.udp.recvspace seems to have stopped the error 55 messages.

        Doing that has increased the load average though, it's over 1.5 now.

        1 Reply Last reply Reply Quote 0
        • C
          Cursorkeys
          last edited by

          Right, I seem to have been able to solve this.

          loader.conf.local

          kern.ipc.nmbclusters="131072"
          net.inet.tcp.tso="0"
          kern.ipc.somaxconn="4096"
          net.inet.tcp.recvspace="65228"
          net.inet.tcp.sendspace="65228"
          hw.em.rxd="4096"
          hw.em.txd="4096"

          System Tunables

          net.inet.tcp.sendbuf_max Set autotuning maximum to at least 16MB 16777216
          net.inet.tcp.recvbuf_max Set autotuning maximum to at least 16MB 16777216
          net.inet.tcp.sendbuf_auto Enable send/recv autotuning 1
          net.inet.tcp.recvbuf_auto Enable send/recv autotuning 1
          net.inet.tcp.sendbuf_inc Autotuning Step Size 524288
          net.inet.tcp.recvbuf_inc Autotuning Step Size 524288
          net.inet.tcp.slowstart_flightsize Squid optimize: It would be more beneficial to increase the slow-start flightsize via the net.inet.tcp.slowstart_flightsize sysctl rather than disable delayed acks. (default = 1, –> 64) 262144/1460=maxnumber 64
          net.inet.udp.recvspace Optimised 65536
          net.local.stream.recvspace Optimized. (10x (mtu 16384+40)) 164240
          net.local.stream.sendspace Optimized. (10x (mtu 16384+40)) 164240
          kern.ipc.somaxconn Optimized for squid 4096
          net.inet.tcp.mssdflt Optimized. (default = 512, --> 1460) 1460
          net.inet.tcp.inflight.min FreeBSD manual recommends 6144 6144
          net.inet.tcp.local_slowstart_flightsize Loopback optimized. (for MTU 16384) see "net.local.stream.****space" 10
          net.inet.tcp.nolocaltimewait Loopback optimised 1
          net.inet.tcp.delayed_ack Optimized. see "net.inet.tcp.slowstart_flightsize" 1

          This now gives the following under full 20/20Mb load, much more what I would expect from such a powerful server:

          Load average
          0.04, 0.16, 0.19

          I must say the support here was disappointing. No-one could suggest anything? Stating 'this is likely a FreeBSD issue so look there' is also preferable to just being ignored.

          1 Reply Last reply Reply Quote 0
          • H
            heper
            last edited by

            i have no clue the issue here is exactly. maybe mbufs, maybe i/o, maybe nic tuning
            but you do realize that a load of 1.06  on a 16-core system is around 6-7% cpuload (if it were cpu related) ? so nothing to be concerned about if everything works as intended.
            The new webgui is more resource intensive aswell, does the load drop if you close the GUI ?

            squid & clamd can be resource hogs … does removing them change anything?

            about the lack of support:
            you posted on a friday & today is monday .... most forum member & all of the paid staff are fairly idle over the weekends

            1 Reply Last reply Reply Quote 0
            • C
              Cursorkeys
              last edited by

              Thanks for the reply. I couldn't seem to pin it down with the tools I knew how to use, possibly someone more experienced with FreeBSD could have made a more useful diagnostic. All the tools showed everything was OK. I'm still unaware of a way to check what HW interrupts/low level driver activity is doing on a FreeBSD system, I suspect that might have shown something unusual.

              I think you are right on the NIC tuning, nothing made a big difference by itself but after tweaking the heck out of all of those tuneables I got the results above.

              I didn't check the load without the GUI, I'm sure that's contributing.

              I didn't try without squid and clamd but squid did feel 'slow' before, time to initial page load on non-cached domains was on the order of 5 seconds. You can barely notice the 'waiting for' message in the browser now.
              I didn't realise that load average is different to Linux when I started, this has made Squid perform better though so possibly something was overloaded in some way.

              Admittedly I wasn't able to go about this very scientifically, I just browsed this forum and tried every tuning trick I could see that might apply to my system.

              Sorry if I seemed rude, I wasn't trying to be. My initial post was on the 22nd of June not last Friday so I was feeling a little bit frustrated.

              1 Reply Last reply Reply Quote 0
              • H
                heper
                last edited by

                oh june ;)

                be sure to bump a topic every week if nobody replied to it :)

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.