Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Periodic packet loss + constant CARP switchovers with Intel NICS i350 (igb)

    Scheduled Pinned Locked Moved Hardware
    8 Posts 7 Posters 3.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • D
      DerBachmannRocker
      last edited by

      I'm running PFSense 2.2.4 and recently upgraded to original Intel i350-T4v2 and i350-T2v2 based NICs (using the igb driver) + Core I7 + 12GB Ram in a CARP configuration. After the upgrade I noticed massive periodic packet loss every other minute. I finally solved the problem, solution is further down.

      The pattern of connectivity from the client looked like:
      5 minutes of routing without any packet loss
      a few seconds of massive packet loss (30%, 60%, 100%)
      8 minutes of routing without any packet loss
      a few seconds of massive packet loss (30%, 60%, 100%)
      6 minutes of routing without any packet loss
      etc. etc.

      This behavior caused the other member of my CARP cluster to fail over every time, which made things even worse, resulting in a huge number of dropped packets and VPN disconnects. Switched back to original hardware and began a long debugging session.

      I totally reinstalled PFSense on different hardware with a minimal config, disabled CARP, LACP/Lagg and had a very bare-bones setting running. The only thing I took over were the new Intel NICs, which should work without a hitch, right? They are listed in the official shop as accessories, so no problems there. Wrong.

      Problems with measuring accuracy:
      This was the first time I had to deal with a major case of packet loss and learned the hard way that you must disable any energy saving feature on your Windows client, prior to testing with Ping and Iperf. Otherwise you get erroneous readings of background-noise packet loss, which threw me off the correct trail for way too long. I measured the actual packet loss + a few percent of lost packages due to energy saving every other second, which masked the true problem. The solution is to simply turn on the "Maximum Performance" energy profile, which disables the green nonsense altogether. Doing this I got reproducible test results using Iperf in UDP mode which enabled me to eliminate component after component.

      Looking for the problem:
      Surprisingly there were no missing packets counted when looking at PFSense:
      sysctl dev.igb
      -> dev.igb.0.mac_stats.missed_packets: 0
      But I definitively had massive packet loss during the short times of outages, pinging from inside to the internet router behind PFSense.

      So I suspected the HP Procurve 5406zl switch to be the culprit (which turned out to be false). After narrowing it down I came to the conclusion that it must indeed be the NICs in the PFSense boxes itself. But WHY?

      I looked all over the internet and could not find any setting that would solve the problem. The last thing I was looking into was how to compile the newest Intel driver into PFSense (turns out the included driver in PFsense works just fine).
      When looking at the README of the official FreeBSD igb driver from Intel I came across the following obscure setting:

      EEE
      –-
      Valid Range:  0-1
      Default Value: 1 (enabled)

      A link between two EEE-compliant devices will result in periodic bursts of
      data followed by long periods where in the link is in an idle state. This Low
      Power Idle (LPI) state is supported in both 1Gbps and 100Mbps link speeds.

      NOTE: EEE support requires autonegotiation.

      Eureka!
      https://en.wikipedia.org/wiki/Energy-Efficient_Ethernet

      PFSense didn't count any packets that were thrown away because the NIC was powered down during those events, due to power saving stuff! Argh. The packets never reached a high enough layer to trigger any logging! Had I used an older NIC not supporting EEE or a cheaper switch not supporting EEE the problem would have never come to the surface. Oh well.

      I disabled the EEE setting with the following command (Diagnostics -> Command Prompt), and the problems were gone instantly:
      sysctl dev.igb.0.eee_disabled:1
      sysctl dev.igb.1.eee_disabled:1
      sysctl dev.igb.2.eee_disabled:1
      sysctl dev.igb.3.eee_disabled:1
      sysctl dev.igb.4.eee_disabled:1
      sysctl dev.igb.5.eee_disabled:1

      To make the setting persistent across reboots, I created the following under Advanced -> System Tunables. One entry for each interface.
      I tried putting these into /boot/loader.conf.local but was unsuccessful. You need to run the commands individually for each interface, as far as I could tell.
      dev.igb.0.eee_disabled, value=1
      dev.igb.1.eee_disabled, value=1
      dev.igb.2.eee_disabled, value=1
      dev.igb.3.eee_disabled, value=1
      dev.igb.4.eee_disabled, value=1
      dev.igb.5.eee_disabled, value=1

      You can verify current status of the setting by entering the command
      sysctl hw.igb | grep eee
      There should be some lines similar to this:
      dev.igb.0.eee_disabled: 1

      After this I had no more trouble and even LAGG + Carp worked as expected.

      @PFSense team: please update your NIC tuning guide for IGB. This setting is CRUCIAL to get Pfsense running with I350-T4 Nics + a switch supporting EEE. You should also append that it is mandatory to set nmbclusters on systems with a big number of CPUs, or else random interfaces will not initialize on startup (had this problem as well).

      These are the settings I added to /boot/loader.conf.local for your information. The legal stuff is just cosmetic, as far as I can tell. Without it you get a nasty looking error message in the system logs on each boot, telling you to set these settings to accept the Intel EULA.

      #Mandatory for big number of CPU cores + Intel i350 NICs
      kern.ipc.nmbclusters=1000000

      #suppress Intel license related error messages on bootup
      legal.intel_ipw.license_ack=1
      legal.intel_iwi.license_ack=1

      This information would have saved me 20+ hours of work. Mainly because I'm dumb and don't know a thing about measuring packet loss in a scientific way, but still: I hope somebody else finds this useful.

      1 Reply Last reply Reply Quote 0
      • H
        Harvy66
        last edited by

        I had to disable EEE on my switch because I would get a single frame error every time traffic started up again. I'm not sure if it's the fault of my i350-T2 or my HP 1810-24Gv2, but seeing errors was annoying. The switch already has a great idle power consumption, I'm not too concerned with the slight difference.

        1 Reply Last reply Reply Quote 0
        • M
          minhneo
          last edited by

          You can use cron instead of change into /boot/loader.conf.local

          1 Reply Last reply Reply Quote 0
          • bingo600B
            bingo600
            last edited by

            Note the command for checkking eee status should be

            
            sysctl dev.igb | grep eee
            
            

            Thanx Steve
            https://forum.pfsense.org/index.php?topic=132528.msg756102#msg756102

            /Bingo

            If you find my answer useful - Please give the post a 👍 - "thumbs up"

            pfSense+ 23.05.1 (ZFS)

            QOTOM-Q355G4 Quad Lan.
            CPU  : Core i5 5250U, Ram : 8GB Kingston DDR3LV 1600
            LAN  : 4 x Intel 211, Disk  : 240G SAMSUNG MZ7L3240HCHQ SSD

            1 Reply Last reply Reply Quote 0
            • DerelictD
              Derelict LAYER 8 Netgate
              last edited by

              I'm running PFSense 2.2.4

              Please upgrade to 2.4.0 and see if whatever problems you think you are having are still present. Many, many igb CARP installations do not exhibit anything like this.

              Chattanooga, Tennessee, USA
              A comprehensive network diagram is worth 10,000 words and 15 conference calls.
              DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
              Do Not Chat For Help! NO_WAN_EGRESS(TM)

              1 Reply Last reply Reply Quote 0
              • ?
                Guest
                last edited by

                You can use cron instead of change into /boot/loader.conf.local

                If you update something it might be running well, but during an upgrade mostly all will be written new and then all your
                pimps, tunings and custom set ups are gone, so it sound perhaps strange but it is the best way to be sure that this
                settings will survive the next upgrade.

                1 Reply Last reply Reply Quote 0
                • GrimsonG
                  Grimson Banned
                  last edited by

                  The original post is from 2015 ;)

                  1 Reply Last reply Reply Quote 0
                  • DerelictD
                    Derelict LAYER 8 Netgate
                    last edited by

                    ugh necro.

                    Chattanooga, Tennessee, USA
                    A comprehensive network diagram is worth 10,000 words and 15 conference calls.
                    DO NOT set a source address/port in a port forward or firewall rule unless you KNOW you need it!
                    Do Not Chat For Help! NO_WAN_EGRESS(TM)

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.