Navigation

    Netgate Discussion Forum
    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search

    Upgrade to 2.3.1 causes network performance degradation (High CPU usage by NIC)

    General pfSense Questions
    3
    5
    1488
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J
      jgm last edited by

      Hi,

      Currently we have a 2-node pfsense system working in active/passive HA. This cluster was running pfsense v2.2.3, and recently we upgraded the slave node to v2.3.1-1, and forced a failover from master to slave in order to test things..

      This system is running a mix of normal pf filtering, some IPSec tunnels, and also an HAProxy instance exposing a few (+-10) frontend+backends.

      However, the upgraded node (when running as master), shows a clear network performance degradation: While node-1 (the one still running v2.2.3) can easily forward traffic at +250Mb/s, the alternate node (the one running v2.3) tops at +-80Mb/s.

      While diagnosing the issue we’ve found node running pfSense v2.3 to have a high load under such a ‘low’ traffic (ie. 80Mb/s), and high CPU usage by network drivers, as show below:

      [2.3.1-RELEASE][root@]/root: top -nCHSIzs1
      last pid: 28317;  load averages:  4.07,  4.23,  4.37  up 2+11:40:04    16:22:50
      311 processes: 9 running, 282 sleeping, 20 waiting

      Mem: 31M Active, 502M Inact, 385M Wired, 883M Buf, 5020M Free
      Swap:

      PID USERNAME  PRI NICE  SIZE    RES STATE  C  TIME    CPU COMMAND
          0 root      -92    -    0K  240K CPU1    1  21.8H  99.37% kernel{nfe0 taskq}
          0 root      -92    -    0K  240K CPU2    2  29.4H  73.29% kernel{em0 taskq}
          0 root      -92    -    0K  240K CPU0    0  18.6H  44.78% kernel{em1 taskq}
        12 root      -72    -    0K  336K WAIT    0  65:15  14.60% intr{swi1: netisr 0}
        438 nobody      22    0 30184K  4404K select  3 121:18  4.79% dnsmasq
      28430 root        21    0 43756K 17440K kqread  3  51:40  1.46% haproxy
        12 root      -72    -    0K  336K WAIT    0  31:51  1.37% intr{swi1: pfsync}
      90479 root        20    0 25720K  7176K select  2  23:43  0.59% openvpn
      49607 root        20    0 14516K  2320K select  0  28:31  0.29% syslogd
      30713 root        20    0 16676K  2736K bpf    0  18:55  0.10% filterlog
      28317 root        21    0 21856K  2992K CPU2    2  0:00  0.10% top

      Obviously, firewall rules, services configuration, IPSec tunnels, etc. are configured the same on both nodes. And we’ve compared system values (like tunables, /boot/loader.conf*, and runtime sysctl values across both nodes).. So it looks like an issue with regard to pfSense 2.3 kernel code changes/enhancements.

      Both nodes are identical hardware, consisting on:

      FreeBSD 10.3-RELEASE-p3 #2 1988fec(RELENG_2_3_1): Wed May 25 14:14:46 CDT 2016
          root@ce23-amd64-builder:/builder/pfsense-231/tmp/obj/builder/pfsense-231/tmp/FreeBSD-src/sys/pfSense amd64
      FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
      CPU: Dual-Core AMD Opteron™ Processor 2216 (2393.69-MHz K8-class CPU)
        Origin="AuthenticAMD"  Id=0x40f12  Family=0xf  Model=0x41  Stepping=2
        Features=0x178bfbff <fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,htt>Features2=0x2001 <sse3,cx16>AMD Features=0xea500800 <syscall,nx,mmx+,ffxsr,rdtscp,lm,3dnow!+,3dnow!>AMD Features2=0x1f <lahf,cmp,svm,extapic,cr8>SVM: NAsids=64
      real memory  = 6442450944 (6144 MB)
      avail memory = 6194679808 (5907 MB)
      Event timer "LAPIC" quality 400
      ACPI APIC Table: <sun   ="" x4200="" m2="">FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
      FreeBSD/SMP: 2 package(s) x 2 core(s)
      cpu0 (BSP): APIC ID:  0
      cpu1 (AP): APIC ID:  1
      cpu2 (AP): APIC ID:  2
      cpu3 (AP): APIC ID:  3

      The network interface cards available at both nodes are as follows:
      nfe0: NVIDIA nForce4 CK804 MCP9 Networking Adapter
      nfe1: NVIDIA nForce4 CK804 MCP9 Networking Adapter
      em0: Intel(R) PRO/1000 (82546EB)
      em1: Intel(R) PRO/1000 (82546EB)
      lagg0: LACP lagg with em0, em1 & nfe0 attached.

      The pfSense version running at node-1 (the one not yet upgraded) is:
      [2.2.3-RELEASE][root@]/root: uname -a
      FreeBSD  10.1-RELEASE-p13 FreeBSD 10.1-RELEASE-p13 #0 c77d1b2(releng/10.1)-dirty: Tue Jun 23 17:00:47 CDT 2015    root@pfs22-amd64-builder:/usr/obj.amd64/usr/pfSensesrc/src/sys/pfSense_SMP.10  amd64

      The pfSense version running at node-2 (the upgraded one) is:
      [2.3.1-RELEASE][root@]/root: uname -a
      FreeBSD xxxx.aaa.com 10.3-RELEASE-p3 FreeBSD 10.3-RELEASE-p3 #2 1988fec(RELENG_2_3_1): Wed May 25 14:14:46 CDT 2016    root@ce23-amd64-builder:/builder/pfsense-231/tmp/obj/builder/pfsense-231/tmp/FreeBSD-src/sys/pfSense  amd64

      Here are some additional statistics we’ve collected, just in case this may help diagnose:

      Packets with errors at em0 and em1 interfaces
      [2.3.1-RELEASE][root@]/root: sysctl dev.em.0.mac_stats. | grep 'buff|missed'
      dev.em.0.mac_stats.recv_no_buff: 28924720
      dev.em.0.mac_stats.missed_packets: 1109472
      [2.3.1-RELEASE][root@]/root: sysctl dev.em.1.mac_stats. | grep 'buff|missed'
      dev.em.1.mac_stats.recv_no_buff: 2873803
      dev.em.1.mac_stats.missed_packets: 79003

      Networks errors
      [2.3.1-RELEASE][root@]/root: netstat -ihw 1
                  input        (Total)          output
        packets  errs idrops      bytes    packets  errs      bytes colls
            52k    30    0        32M        55k    0        36M    0
            46k    0    0        24M        48k    0        30M    0
            50k    64    0        31M        54k    0        35M    0
            45k    35    0        26M        48k    0        31M    0
            48k    19    0        28M        52k    0        33M    0
            45k    2    0        29M        48k    0        33M    0
            50k    0    0        30M        53k    0        35M    0
            50k    0    0        33M        53k    0        37M    0
            43k    9    0        28M        45k    0        32M    0
            53k    12    0        34M        56k    0        39M    0
            50k    0    0        30M        53k    0        34M    0
            44k    0    0        26M        47k    0        30M    0

      Number of interrupts are very high for NICs
      [2.3.1-RELEASE][root@]/root: vmstat -i
      interrupt                          total      rate
      irq44: nfe1                    68575583        318
      irq4: uart0                        2298          0
      irq14: ata0                      143914          0
      irq20: ohci0                          26          0
      irq21: ehci0                          2          0
      irq22: nfe0                    225192527      1044
      irq56: em0                    121230546        562
      irq57: em1                    305005131      1414
      cpu0:timer                    242940061      1126
      irq256: mpt0                      763114          3
      cpu1:timer                      95960989        445
      cpu2:timer                    135271696        627
      cpu3:timer                    133771488        620
      Total                        1328857375      6164

      We have been investigating whether an package of pfsense 2.3.1 or FreeBSD 10.3-release-p3 can cause problems, but we are unable to determine the cause of the problem. Any suggestion?

      Thanks</sun ></lahf,cmp,svm,extapic,cr8></syscall,nx,mmx+,ffxsr,rdtscp,lm,3dnow!+,3dnow!></sse3,cx16></fpu,vme,de,pse,tsc,msr,pae,mce,cx8,apic,sep,mtrr,pge,mca,cmov,pat,pse36,clflush,mmx,fxsr,sse,sse2,htt>

      1 Reply Last reply Reply Quote 0
      • C
        cmb last edited by

        The missed packet count and recv_no_buff there is huge. What were the other counters relative to that at the time? What specifically do you have set in loader.conf(.local)?

        1 Reply Last reply Reply Quote 0
        • J
          jgm last edited by

          Hi

          This is loader.conf.local config:

          cat /boot/loader.conf.local
          net.inet.tcp.tso=0
          kern.ipc.nmbclusters=1000000
          legal.intel_iwi.license_ack=1
          legal.intel_ipw.license_ack=1

          maximum number of interrupts per second on any interrupt level (vmstat -i for

          total rate). If you still see Interrupt Storm detected messages, increase the

          limit to a higher number and look for the culprit.  For 10gig NIC's set to

          9000 and use large MTU. (default 1000)

          hw.intr_storm_threshold="9000"
          #hw.em.enable_msix=0
          #hw.pci.enable_msi=0
          #hw.pci.enable_msix=0

          Yesterday we reboot the pfsense v2.2.3, and put it in passive, but these are the counters where we forced to pass traffic (and only em0 interface is active):

          sysctl dev.em.0.mac_stats.
          dev.em.0.mac_stats.tso_ctx_fail: 0
          dev.em.0.mac_stats.tso_txd: 0
          dev.em.0.mac_stats.tx_frames_1024_1522: 155735
          dev.em.0.mac_stats.tx_frames_512_1023: 380
          dev.em.0.mac_stats.tx_frames_256_511: 2094
          dev.em.0.mac_stats.tx_frames_128_255: 315482
          dev.em.0.mac_stats.tx_frames_65_127: 1882913
          dev.em.0.mac_stats.tx_frames_64: 71453
          dev.em.0.mac_stats.mcast_pkts_txd: 6253
          dev.em.0.mac_stats.bcast_pkts_txd: 336
          dev.em.0.mac_stats.good_pkts_txd: 2428057
          dev.em.0.mac_stats.total_pkts_txd: 2441152
          dev.em.0.mac_stats.good_octets_txd: 448722915
          dev.em.0.mac_stats.good_octets_recvd: 3137400406
          dev.em.0.mac_stats.rx_frames_1024_1522: 1850689
          dev.em.0.mac_stats.rx_frames_512_1023: 259833
          dev.em.0.mac_stats.rx_frames_256_511: 111457
          dev.em.0.mac_stats.rx_frames_128_255: 83528
          dev.em.0.mac_stats.rx_frames_65_127: 865248
          dev.em.0.mac_stats.rx_frames_64: 137739
          dev.em.0.mac_stats.mcast_pkts_recvd: 1233699
          dev.em.0.mac_stats.bcast_pkts_recvd: 19355
          dev.em.0.mac_stats.good_pkts_recvd: 3308494
          dev.em.0.mac_stats.total_pkts_recvd: 3313623
          dev.em.0.mac_stats.xoff_txd: 9107
          dev.em.0.mac_stats.xoff_recvd: 0
          dev.em.0.mac_stats.xon_txd: 3988
          dev.em.0.mac_stats.xon_recvd: 0
          dev.em.0.mac_stats.coll_ext_errs: 0
          dev.em.0.mac_stats.alignment_errs: 0
          dev.em.0.mac_stats.crc_errs: 0
          dev.em.0.mac_stats.recv_errs: 0
          dev.em.0.mac_stats.recv_jabber: 0
          dev.em.0.mac_stats.recv_oversize: 0
          dev.em.0.mac_stats.recv_fragmented: 0
          dev.em.0.mac_stats.recv_undersize: 0
          dev.em.0.mac_stats.recv_no_buff: 266406
          dev.em.0.mac_stats.missed_packets: 5129
          dev.em.0.mac_stats.defer_count: 0
          dev.em.0.mac_stats.sequence_errors: 0
          dev.em.0.mac_stats.symbol_errors: 0
          dev.em.0.mac_stats.collision_count: 0
          dev.em.0.mac_stats.late_coll: 0
          dev.em.0.mac_stats.multiple_coll: 0
          dev.em.0.mac_stats.single_coll: 0
          dev.em.0.mac_stats.excess_coll: 0

          1 Reply Last reply Reply Quote 0
          • C
            cmb last edited by

            Wow that's a really significant percentage of your total traffic ending up no_buff. Next thing I'd try is taking nfe out of the lagg since it's just using em0 anyway, and see what that does. Maybe even leave the lagg with just em0 as the only member. It doesn't look right that nfe0 would have the most load if em0 was the only active interface in the lagg, shouldn't have been any work to do on nfe0 in that case.

            1 Reply Last reply Reply Quote 0
            • ?
              Guest last edited by

              However, the upgraded node (when running as master), shows a clear network performance degradation: While node-1 (the one still running v2.2.3) can easily forward traffic at +250Mb/s, the alternate node (the one running v2.3) tops at +-80Mb/s.

              Well, how to say it and being friendly any more? If I buy a MS Windows Server 2008 together with hardware
              and now I want to install MS Server 2012 R2 on it, I will find perhaps out that this hardware is not really good
              matching the newer software version. But there in MS Windows based fields we know this and life with this.
              Why not also with FreeBSD and pfSense? As a customer and user of pfSense I can´t say I would be loving to
              see even newer things, such as Intel QuickAssist, AES-NI support and DPDK or netmap-fwd, but I am no really
              willing to buy new hardware or plain upgrading this hardware to the nearly latest or an actual stand. Not really
              nice said, but the true from my point of view on this.

              While diagnosing the issue we’ve found node running pfSense v2.3 to have a high load under such a ‘low’ traffic (ie. 80Mb/s), and high CPU usage by network drivers, as show below:

              Perhaps, only perhaps I mean, they are working on newer drivers or make older drivers better matching
              with the actual new hardware, but then often compared to older hardware it is then not really a gain and
              playing well together. Perhaps you could think about a newer board, stronger CPU or SoC and/or more or
              faster RAM? I really don´t know it and I am not a professional likes cmb and others, but often new hardware
              does the trick for many years, let us say the next 5 or 6 years.

              Any suggestion?

              I will be truly to you, I would stay with the 64Bit version 2.2.6, but even this is related to all circumstances
              and seen affects in each pfSense system. Some are really hard likes your 250Mbs/80Mbs, but also other strange
              points would let me say wait since pfSense let us say 2.4 or higher. And if this would be not really better going
              then for you and your company I would really urgent think about a hardware upgrade.

              1 Reply Last reply Reply Quote 0
              • First post
                Last post