LACP connection resulting in high CPU interupt



  • I'm running pfSense 2.2.4 on an x64 box (Intel Xeon E3-1225 v3 with 4Gb RAM).
    It's an appliance built for pfSense, so it's kitted out with one Intel 'em' and several other 'igb' interfaces (7).
    It's running in bridging mode with SNORT.

    I've been running a simular spec box for many months with now problems.
    However I've running into a problem with LACP and high CPU interupts with this new box.
    (LACP wasn't used on the old box).

    One our WAN side we have a single 100Mb connection, although we often only use 20% of this.

    One the LAN side I've configured two of the igb NICs in an LACP mode LAGG , which connect to two ports on a Cisco switch (configured for active LACP).

    This LACP aggregation 'appears' to cause high interupt load on the CPU.

    Here is an example of the system activity:
    last pid: 18567;  load averages:  5.31,  5.05,  5.28  up 0+02:05:29    14:07:52
    179 processes: 11 running, 117 sleeping, 51 waiting

    Mem: 581M Active, 285M Inact, 272M Wired, 1024K Cache, 327M Buf, 2714M Free
    Swap:

    PID USERNAME PRI NICE  SIZE    RES STATE  C  TIME    WCPU COMMAND
      12 root    -92    -    0K  896K CPU0    0  85:25  74.27% [intr{irq277: igb2:que}]
      12 root    -92    -    0K  896K CPU1    1  84:46  72.56% [intr{irq278: igb2:que}]
      12 root    -92    -    0K  896K CPU2    2  70:36  41.26% [intr{irq279: igb2:que}]
    20168 root    103  20  1185M  774M RUN    0  7:20  38.09% /usr/local/bin/snort -R 50239 -D -l /var/l
      11 root    155 ki31    0K    64K RUN    2  33:58  36.67% [idle{idle: cpu2}]
      12 root    -92    -    0K  896K WAIT    3  62:42  35.25% [intr{irq280: igb2:que}]
      11 root    155 ki31    0K    64K RUN    3  39:24  34.18% [idle{idle: cpu3}]
      11 root    155 ki31    0K    64K RUN    1  26:32  21.48% [idle{idle: cpu1}]
      11 root    155 ki31    0K    64K RUN    0  25:44  10.35% [idle{idle: cpu0}]
        0 root    -92    0    0K  640K -      1  10:48  9.47% [kernel{igb2 que}]
        0 root    -92    0    0K  640K -      1  11:19  9.38% [kernel{igb2 que}]
      12 root    -92    -    0K  896K WAIT    2  4:17  3.08% [intr{irq269: igb0:que}]
      12 root    -92    -    0K  896K RUN    1  4:17  3.08% [intr{irq268: igb0:que}]
      12 root    -92    -    0K  896K WAIT    0  0:13  2.39% [intr{irq287: igb4:que}]
      12 root    -92    -    0K  896K WAIT    3  4:15  1.46% [intr{irq270: igb0:que}]
      12 root    -92    -    0K  896K RUN    0  3:41  1.46% [intr{irq267: igb0:que}]
    36951 root      21    0  223M 40564K piperd  0  0:00  0.20% php-fpm: pool lighty (php-fpm)
      19 root    -16 ki-1    0K    16K pollid  3  11:35  0.00% [idlepoll]

    As you might guess igb2 is one of the ports in the LAGG (the other being igb1).

    I've tried enabling device polling (which increases system CPU calls slightly but not much else) - which someone mentioned in another post helped for LACP.

    Checksum offloading and TCP offloading don't help (I was scrapping the bottom of the barrel trying those & I did reboot..).

    I did also try adding a tunable for 'adaptive interrupt moderation' - which was my best shot, setting to be enabled with a value of 1. This didn't appear to have any effect on interrupt load.

    Anything I might have missed, or is LACP a problem for CPU loads?

    I could try alternative LAGG configurations, such as Cisco's 'fec', loadbalance or roundrobin LAGG types - but I don't know if they would be any better.
    Our WAN connection is due to be upgrades to a 2Gb connection at some point (using LACP), so I'd hoped to use LACP directly on pfSense.



  • have you tried disabling snort while this is happening?

    also, how much traffic are you running over the lagg with that interrupt rate?



  • Anything I might have missed, or is LACP a problem for CPU loads?

    This is more based on what CPU and NICs you are using.
    A server grade CPU likes a Xeon E3 or also server grade NICs likes Intel server adapters
    would not be really the problem, as I see it right.

    I could try alternative LAGG configurations, such as Cisco's 'fec', loadbalance or roundrobin LAGG types

    Can you please name the whole line of devices please? a.e.
    Internet –- ISP --- Modem --- Cisco Router --- transparent pfSense --- LAN Switch

    Using a static LAG with round robin would be a solution, but then on both sides all settings must be matching
    the other side exactly! Just tra it out.

    Our WAN connection is due to be upgrades to a 2Gb connection at some point (using LACP), so I'd hoped to use LACP directly on pfSense.

    Are this two 1 GBit/s lines or one 10 GBit/s limited to 2 GBit/s!?
    Or is this a MLPPP (MPLS) service from your provider?

    In normal bonding or building LAGs at the WAN interface would not be really running, only if you
    place a switch in front of the pfsense or if you get MLPPP (MPLS) service from your ISP.



  • Thanks, BlueKobold

    I think you've nailed it. Our setup is:
    Internet –- ISP --- Cisco Router --- transparent pfSense --- LAN Switch

    We have a single 100Mb/s link from the Cisco router to pfSense.
    It's only the link to the LAN switches that had the LAGG, which was 2x 1Gb/s links in LACP.

    I'm guess it's this mismatch that is throwing things out.
    I'll wait for the router to be upgraded first before testing matching LACP LAGG on both sides.

    I'm going to remove the LAGG group from our LAN to go down to a single NIC both sides of the pfSense bridge.



  • I think you've nailed it. Our setup is:
    Internet –- ISP --- Cisco Router --- transparent pfSense --- LAN Switch

    Running in transparent mode is perhaps a so called fine thing, but bridging ports together
    brings often more then one failure or problem in the game, likes;

    • port flapping
    • packet loss
    • packet drop

    We have a single 100Mb/s link from the Cisco router to pfSense.

    This is in my eyes then the bottleneck here in the game.

    It's only the link to the LAN switches that had the LAGG, which was 2x 1Gb/s links in LACP.

    Try as suggested the round robin method for filling the pipe constantly.

    I'm guess it's this mismatch that is throwing things out.
    I'll wait for the router to be upgraded first before testing matching LACP LAGG on both sides.

    Would be the best in my eyes too! Or go by 10 GBit/s from the router to the pfSense and then with
    10 GBit/s from the pfsense to the LAN Switch, it would be better in my eyes.

    I'm going to remove the LAGG group from our LAN to go down to a single NIC both sides of the pfSense bridge.

    Ok