Periodic packet loss + constant CARP switchovers with Intel NICS i350 (igb)



  • I'm running PFSense 2.2.4 and recently upgraded to original Intel i350-T4v2 and i350-T2v2 based NICs (using the igb driver) + Core I7 + 12GB Ram in a CARP configuration. After the upgrade I noticed massive periodic packet loss every other minute. I finally solved the problem, solution is further down.

    The pattern of connectivity from the client looked like:
    5 minutes of routing without any packet loss
    a few seconds of massive packet loss (30%, 60%, 100%)
    8 minutes of routing without any packet loss
    a few seconds of massive packet loss (30%, 60%, 100%)
    6 minutes of routing without any packet loss
    etc. etc.

    This behavior caused the other member of my CARP cluster to fail over every time, which made things even worse, resulting in a huge number of dropped packets and VPN disconnects. Switched back to original hardware and began a long debugging session.

    I totally reinstalled PFSense on different hardware with a minimal config, disabled CARP, LACP/Lagg and had a very bare-bones setting running. The only thing I took over were the new Intel NICs, which should work without a hitch, right? They are listed in the official shop as accessories, so no problems there. Wrong.

    Problems with measuring accuracy:
    This was the first time I had to deal with a major case of packet loss and learned the hard way that you must disable any energy saving feature on your Windows client, prior to testing with Ping and Iperf. Otherwise you get erroneous readings of background-noise packet loss, which threw me off the correct trail for way too long. I measured the actual packet loss + a few percent of lost packages due to energy saving every other second, which masked the true problem. The solution is to simply turn on the "Maximum Performance" energy profile, which disables the green nonsense altogether. Doing this I got reproducible test results using Iperf in UDP mode which enabled me to eliminate component after component.

    Looking for the problem:
    Surprisingly there were no missing packets counted when looking at PFSense:
    sysctl dev.igb
    -> dev.igb.0.mac_stats.missed_packets: 0
    But I definitively had massive packet loss during the short times of outages, pinging from inside to the internet router behind PFSense.

    So I suspected the HP Procurve 5406zl switch to be the culprit (which turned out to be false). After narrowing it down I came to the conclusion that it must indeed be the NICs in the PFSense boxes itself. But WHY?

    I looked all over the internet and could not find any setting that would solve the problem. The last thing I was looking into was how to compile the newest Intel driver into PFSense (turns out the included driver in PFsense works just fine).
    When looking at the README of the official FreeBSD igb driver from Intel I came across the following obscure setting:

    EEE
    –-
    Valid Range:  0-1
    Default Value: 1 (enabled)

    A link between two EEE-compliant devices will result in periodic bursts of
    data followed by long periods where in the link is in an idle state. This Low
    Power Idle (LPI) state is supported in both 1Gbps and 100Mbps link speeds.

    NOTE: EEE support requires autonegotiation.

    Eureka!
    https://en.wikipedia.org/wiki/Energy-Efficient_Ethernet

    PFSense didn't count any packets that were thrown away because the NIC was powered down during those events, due to power saving stuff! Argh. The packets never reached a high enough layer to trigger any logging! Had I used an older NIC not supporting EEE or a cheaper switch not supporting EEE the problem would have never come to the surface. Oh well.

    I disabled the EEE setting with the following command (Diagnostics -> Command Prompt), and the problems were gone instantly:
    sysctl dev.igb.0.eee_disabled:1
    sysctl dev.igb.1.eee_disabled:1
    sysctl dev.igb.2.eee_disabled:1
    sysctl dev.igb.3.eee_disabled:1
    sysctl dev.igb.4.eee_disabled:1
    sysctl dev.igb.5.eee_disabled:1

    To make the setting persistent across reboots, I created the following under Advanced -> System Tunables. One entry for each interface.
    I tried putting these into /boot/loader.conf.local but was unsuccessful. You need to run the commands individually for each interface, as far as I could tell.
    dev.igb.0.eee_disabled, value=1
    dev.igb.1.eee_disabled, value=1
    dev.igb.2.eee_disabled, value=1
    dev.igb.3.eee_disabled, value=1
    dev.igb.4.eee_disabled, value=1
    dev.igb.5.eee_disabled, value=1

    You can verify current status of the setting by entering the command
    sysctl hw.igb | grep eee
    There should be some lines similar to this:
    dev.igb.0.eee_disabled: 1

    After this I had no more trouble and even LAGG + Carp worked as expected.

    @PFSense team: please update your NIC tuning guide for IGB. This setting is CRUCIAL to get Pfsense running with I350-T4 Nics + a switch supporting EEE. You should also append that it is mandatory to set nmbclusters on systems with a big number of CPUs, or else random interfaces will not initialize on startup (had this problem as well).

    These are the settings I added to /boot/loader.conf.local for your information. The legal stuff is just cosmetic, as far as I can tell. Without it you get a nasty looking error message in the system logs on each boot, telling you to set these settings to accept the Intel EULA.

    #Mandatory for big number of CPU cores + Intel i350 NICs
    kern.ipc.nmbclusters=1000000

    #suppress Intel license related error messages on bootup
    legal.intel_ipw.license_ack=1
    legal.intel_iwi.license_ack=1

    This information would have saved me 20+ hours of work. Mainly because I'm dumb and don't know a thing about measuring packet loss in a scientific way, but still: I hope somebody else finds this useful.



  • I had to disable EEE on my switch because I would get a single frame error every time traffic started up again. I'm not sure if it's the fault of my i350-T2 or my HP 1810-24Gv2, but seeing errors was annoying. The switch already has a great idle power consumption, I'm not too concerned with the slight difference.



  • You can use cron instead of change into /boot/loader.conf.local



  • Note the command for checkking eee status should be

    
    sysctl dev.igb | grep eee
    
    

    Thanx Steve
    https://forum.pfsense.org/index.php?topic=132528.msg756102#msg756102

    /Bingo


  • Netgate

    I'm running PFSense 2.2.4

    Please upgrade to 2.4.0 and see if whatever problems you think you are having are still present. Many, many igb CARP installations do not exhibit anything like this.



  • You can use cron instead of change into /boot/loader.conf.local

    If you update something it might be running well, but during an upgrade mostly all will be written new and then all your
    pimps, tunings and custom set ups are gone, so it sound perhaps strange but it is the best way to be sure that this
    settings will survive the next upgrade.



  • The original post is from 2015 ;)


  • Netgate

    ugh necro.