Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    High load/interrupt and CPU usage

    Scheduled Pinned Locked Moved General pfSense Questions
    6 Posts 3 Posters 4.6k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      arainero
      last edited by

      Hello everyone,

      I logged into my pfSense box at the datacenter today and noticed higher than normal temperatures and then noticed an unusually high load. I SSH'd into the firewall and noticed that the system load was averaging around 12. The first thing I did was disabled ntopng since it was using ~20% according to top. After that the load dropped to around 9 and has stayed there.

      After running top -aSH I got:

      last pid: 74909;  load averages: 10.11, 10.37,  9.96                                                                                       up 19+12:52:25  19:10:02
      256 processes: 14 running, 158 sleeping, 4 zombie, 80 waiting
      CPU:  0.7% user,  0.0% nice,  0.4% system, 65.0% interrupt, 33.9% idle
      Mem: 33M Active, 1057M Inact, 1015M Wired, 202M Cache, 898M Buf, 5580M Free
      Swap: 16G Total, 16G Free
      
        PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
         12 root     -92    -     0K  1360K WAIT    3  28.3H  46.78% [intr{irq260: t5nex0:0}]
         12 root     -92    -     0K  1360K CPU4    4  28.1H  45.90% [intr{irq265: t5nex0:0}]
         12 root     -92    -     0K  1360K WAIT    7  28.0H  43.80% [intr{irq266: t5nex0:0}]
         12 root     -92    -     0K  1360K WAIT    6  28.3H  43.16% [intr{irq261: t5nex0:0}]
         12 root     -92    -     0K  1360K WAIT    1  28.2H  43.16% [intr{irq263: t5nex0:0}]
         12 root     -92    -     0K  1360K WAIT    5  28.2H  41.70% [intr{irq264: t5nex0:0}]
         12 root     -92    -     0K  1360K WAIT    5  28.3H  38.38% [intr{irq262: t5nex0:0}]
         12 root     -92    -     0K  1360K WAIT    0  28.1H  36.47% [intr{irq259: t5nex0:0}]
         11 root     155 ki31     0K   128K CPU6    6 419.4H  36.28% [idle{idle: cpu6}]
         11 root     155 ki31     0K   128K RUN     2 418.8H  34.86% [idle{idle: cpu2}]
         11 root     155 ki31     0K   128K RUN     5 419.3H  33.98% [idle{idle: cpu5}]
         11 root     155 ki31     0K   128K CPU0    0 417.2H  33.79% [idle{idle: cpu0}]
         11 root     155 ki31     0K   128K RUN     3 419.2H  33.59% [idle{idle: cpu3}]
         11 root     155 ki31     0K   128K RUN     7 419.5H  32.37% [idle{idle: cpu7}]
         11 root     155 ki31     0K   128K RUN     1 418.8H  32.08% [idle{idle: cpu1}]
         11 root     155 ki31     0K   128K CPU4    4 419.4H  29.98% [idle{idle: cpu4}]
         12 root     -72    -     0K  1360K WAIT    3 817:35  21.68% [intr{swi1: pfsync}]
         12 root     -92    -     0K  1360K WAIT    4 212:37   8.15% [intr{irq310: igb3:que}]
         12 root     -92    -     0K  1360K WAIT    5 214:54   7.08% [intr{irq284: igb0:que}]
         12 root     -92    -     0K  1360K WAIT    3 204:46   6.40% [intr{irq291: igb1:que}]
         12 root     -92    -     0K  1360K WAIT    1 220:40   6.30% [intr{irq280: igb0:que}]
         12 root     -92    -     0K  1360K WAIT    4 206:17   6.30% [intr{irq301: igb2:que}]
         12 root     -92    -     0K  1360K WAIT    3 206:09   5.86% [intr{irq300: igb2:que}]
         12 root     -92    -     0K  1360K WAIT    2 237:13   5.66% [intr{irq299: igb2:que}]
         12 root     -92    -     0K  1360K WAIT    1 226:07   5.66% [intr{irq307: igb3:que}]
         12 root     -92    -     0K  1360K WAIT    5 209:29   5.47% [intr{irq293: igb1:que}]
         12 root     -92    -     0K  1360K WAIT    2 205:22   5.27% [intr{irq290: igb1:que}]
      
      

      It appears that t5nex0 is using a ton of CPU processing. I bought the http://store.pfsense.org/C2758/ with the 10gb Chelsio card and the t5nex0 appears to be related to the Chelsio card. We've had problems in the past where the fiber cable plugged into the Chelsio card would go up/down repeatedly and the only way to resolve that was to physically unplug and replug the cable in. That problem would only creep up every month or so. I ended up replacing the twinax cable and it hasn't happened yet, but I can't rule out that replacing the cable fixed that problem entirely. I never noticed this load before and I am wondering if this had a correlation now.

      systat -vmstat 1

      Mem:KB    REAL            VIRTUAL                       VN PAGER   SWAP PAGER
              Tot   Share      Tot    Share    Free           in   out     in   out
      Act  193852    7628  1346892     8108 5918148  count
      All  316032   11752  1389380    35388          pages
      Proc:                                                            Interrupts
        r   p   d   s   w   Csw  Trp  Sys  Int  Sof  Flt        ioflt  286k total
        1          74      446k  10k 5970 277k  18k 7712    739 cow         uart0 4
                                                           6947 zfod        ehci0 23
       2.7%Sys  68.0%Intr  2.0%User  0.0%Nice 27.4%Idle         ozfod  1123 cpu0:timer
      |    |    |    |    |    |    |    |    |    |           %ozfod       t5nex0:evt
      =++++++++++++++++++++++++++++++++++>                      daefr  8179 t5nex0:0.0
                                             136 dtbuf     2363 prcfr  7986 t5nex0:0.1
      Namei     Name-cache   Dir-cache     85468 desvn     9434 totfr  8210 t5nex0:0.2
         Calls    hits   %    hits   %     52334 numvn          react  8378 t5nex0:0.3
          2223    2223 100                  1408 frevn          pdwak  7663 t5nex0:0.4
                                                             13 pdpgs  7918 t5nex0:0.5
      Disks   md0  ada0 pass0                                   intrn  8406 t5nex0:0.6
      KB/t   0.00  0.00  0.00                           1039488 wire   8037 t5nex0:0.7
      tps       0     0     0                             35516 act    6824 igb0:que 0
      MB/s   0.00  0.00  0.00                           1082412 inact  6668 igb0:que 1
      %busy     0     0     0                            206340 cache  6490 igb0:que 2
                                                        5711808 free   6552 igb0:que 3
                                                        1030960 buf    6708 igb0:que 4
                                                                       6774 igb0:que 5
                                                                       6638 igb0:que 6
                                                                       6676 igb0:que 7
                                                                            igb0:link
                                                                       6204 igb1:que 0
                                                                       6513 igb1:que 1
                                                                       6436 igb1:que 2
                                                                       6669 igb1:que 3
                                                                       6495 igb1:que 4
      
      

      netstat -m

      131257/16253/147510 mbufs in use (current/cache/total)
      131221/3287/134508/1000000 mbuf clusters in use (current/cache/total/max)
      131221/3122 mbuf+clusters out of packet secondary zone in use (current/cache)
      8128/751/8879/13291 4k (page size) jumbo clusters in use (current/cache/total/max)
      0/0/0/3938 9k jumbo clusters in use (current/cache/total/max)
      64/0/64/2215 16k jumbo clusters in use (current/cache/total/max)
      328792K/13641K/342433K bytes allocated to network (current/cache/total)
      0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
      0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
      0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
      0/0/0 requests for jumbo clusters denied (4k/9k/16k)
      0 requests for sfbufs denied
      0 requests for sfbufs delayed
      0 requests for I/O initiated by sendfile
      
      

      ps auxw | grep '[p]hp'

      root     7815   1.1  0.4 228324 32464  -  S     7:54PM      0:00.17 php-fpm: pool lighty (php-fpm)
      root      243   0.0  0.3 224228 21308  -  Ss   16Sep15      1:16.62 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
      root    23896   0.0  0.5 228324 38064  -  S     7:55PM      0:00.07 php-fpm: pool lighty (php-fpm)
      
      

      Some notes:
      We aren't doing any unusual traffic today. I have not rebooted anything yet since this machine is in production. I can try to reboot later after hours. Our failover firewall has no load (as expected).

      Thanks for the help!

      1 Reply Last reply Reply Quote 0
      • KOMK Offline
        KOM
        last edited by

        It's your NICs interrupting the CPU.  Try going to System - Advanced - Networking.  In Network Interfaces, uncheck Enable device polling.

        1 Reply Last reply Reply Quote 0
        • A Offline
          arainero
          last edited by

          Thanks for the reply.

          After going to system > advanced > networking Enable device polling was already unchecked. Should I try something else?

          1 Reply Last reply Reply Quote 0
          • KOMK Offline
            KOM
            last edited by

            Sorry, I think I had it backwards.  If you have high interrupts, try enabling polling to see if it calms down somewhat.

            1 Reply Last reply Reply Quote 0
            • H Offline
              heper
              last edited by

              i personally haven't had much luck by enabling polling. it just slowed things down in my case.

              have you tried some of the tuning tips here? http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr
              i don't own 10gbe equipment, but i don't think it can hurt to try.

              1 Reply Last reply Reply Quote 0
              • A Offline
                arainero
                last edited by

                @KOM:

                Sorry, I think I had it backwards.  If you have high interrupts, try enabling polling to see if it calms down somewhat.

                I enabled it and the load didn't move too much in either direction. Thank you for the suggestion.

                @heper:

                i personally haven't had much luck by enabling polling. it just slowed things down in my case.

                have you tried some of the tuning tips here? http://bsdrp.net/documentation/examples/forwarding_performance_lab_of_a_hp_proliant_dl360p_gen8_with_10-gigabit_with_10-gigabit_chelsio_t540-cr
                i don't own 10gbe equipment, but i don't think it can hurt to try.

                I will have to take a look at this. Thanks for providing it.

                On another note, the load appears to be around 5-6 now. I believe this is related to lower usage. Towards the end of the business day our usage tapers off significantly, but we only do ~300 Mbs at peak anyways.

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.