Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Primary igb driver crash - manually disable carp on the primary under high load

    Scheduled Pinned Locked Moved Hardware
    7 Posts 3 Posters 2.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A
      adam65535
      last edited by

      The crash always occurs on the primary (80% of the time) even if I swap the hard drives too so that the primary hardware and the secondary hardware swap so I am pretty sure it is not a bad network card, motherboard, etc.  If I disable carp while under light loads on the primary it works fine and no crashes.  If I then do the same test and disable carp on the secondary while the master carp is disabled and then enable carp on the primary so that the primary becomes master again it works fine and no crash.  This is very repeatable and happens with pfsense 2.0.3 and 2.1.1-PRERELEASE (build of Feb 19th).

      The crash always happens in the igb driver it seems from the crash dumps.

      If I just unplug a network cable on the primary it does not crash that way either (even under high load).  The master stays running and becomes backup and the slave takes over and becomes  just as you would expect.  It is only when I disable carp (under iperf test) that the primary crashes which is how I force the primary to become backup.

      I have two Dell R320 firewalls in a cluster that are identical.
      2 quad port Intel gigabit ET2 network cards in each firewall (8 ports each firewall)
      Onboard unsupported broadcom nics disable in bios
      Hyperthreading disabled in bios
      4 cores

      I have a production firewall config with the iperf3 test running through the firewall with a client on the outside going through the built in loadbalancer (relayd) to a server running iperf3 on the inside.

      Test command line on the client and server using port 4444:
      Server# iperf3 -p 4444 -s -4
      Client# iperf3 -c 192.168.1.10 -p 4444 -t 500

      loader.conf.local:
      kern.ipc.nmbclusters="131072"
      hw.igb.num_queues=1
      hw.igb.txd="4096"
      hw.igb.rxd="4096"
      hw.igb.rx_process_limit="1000"

      #netstat -m (while under iperf test on primary before disabling carp)
      20615/2050/22665 mbufs in use (current/cache/total)
      20507/1723/22230/131072 mbuf clusters in use (current/cache/total/max)
      20505/1255 mbuf+clusters out of packet secondary zone in use (current/cache)
      0/104/104/65536 4k (page size) jumbo clusters in use (current/cache/total/max)
      0/0/0/32768 9k jumbo clusters in use (current/cache/total/max)
      0/0/0/16384 16k jumbo clusters in use (current/cache/total/max)
      46173K/4374K/50547K bytes allocated to network (current/cache/total)
      0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
      0/0/0 requests for jumbo clusters denied (4k/9k/16k)
      0/0/0 sfbufs in use (current/peak/max)
      0 requests for sfbufs denied
      0 requests for sfbufs delayed
      0 requests for I/O initiated by sendfile
      0 calls to protocol drain routines

      #vmstat -i (while under iperf test on primary before disabling carp)
      interrupt                          total      rate
      irq20: atapci0                    48357          5
      irq22: ehci1                      18467          2
      irq23: ehci0                      27944          3
      cpu0: timer                    18377583      1999
      irq256: igb0:que 0              3388510        368
      irq257: igb0:link                      4          0
      irq258: igb1:que 0              3394503        369
      irq259: igb1:link                      2          0
      irq260: igb2:que 0                30518          3
      irq261: igb2:link                      2          0
      irq262: igb3:que 0                23436          2
      irq263: igb3:link                      2          0
      irq264: igb4:que 0                626670        68
      irq265: igb4:link                      2          0
      cpu3: timer                    18377412      1999
      cpu2: timer                    18377397      1999
      cpu1: timer                    18377405      1999
      Total                          81068214      8822

      top -P (while under iperf test on primary before disabling carp)
      last pid: 55340;  load averages:  0.07,  0.02,  0.00    up 0+02:34:29  19:49:38
      42 processes:  1 running, 41 sleeping
      CPU 0:  0.0% user,  0.0% nice,  0.0% system, 48.5% interrupt, 51.5% idle
      CPU 1:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
      CPU 2:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
      CPU 3:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
      Mem: 64M Active, 24M Inact, 153M Wired, 220K Cache, 26M Buf, 3645M Free
      Swap: 8192M Total, 8192M Free

      PID USERNAME    THR PRI NICE  SIZE    RES STATE  C  TIME  WCPU COMMAND
        265 root          1  76  20  6908K  1388K kqread  3  0:55  0.00% check_rel
      72286 root          1  58    0  151M 45284K accept  1  0:09  0.00% php
      26203 root          1  44    0  5784K  1468K select  2  0:02  0.00% apinger
      85021 root          1  76  20  8296K  1732K wait    1  0:01  0.00% sh
      69713 _relayd        1  44    0 12320K  3828K kqread  2  0:01  0.00% relayd
      63080 root          1  44    0 26268K  7140K kqread  1  0:01  0.00% lighttpd
      …

      Crash dump:
      http://textdump.net/read/4300/

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        The new igb(4) drivers have been backed out of 2.1.1 because they were unreliable for a lot of people. The 19th is a snapshot that may or may not have had the unstable drivers in so the first thing I suggest you do is update to a new snapshot.
        https://forum.pfsense.org/index.php/topic,72763.0.html

        Alternatively go to 2.1 release which had the stable driver.

        Are you not using that specifically because of the issue with AltQ?

        Steve

        1 Reply Last reply Reply Quote 0
        • A
          adam65535
          last edited by

          Thanks for the response.  I know for sure that the old 2.1 drivers are back in the feb 19th 2.1.1-PRERELEASE build because the driver obeys the hw.igb.num_queues setting.  The updated driver that was in for a few pre-releases before the 19th had that issue.  As I mentioned… this happens even with 2.0.3.

          If someone is just doing regular testing or in production where the system is not stressed I suspect they would never see the problem.  They would have to be under heavy load and manually disable carp on the primary.  I can only get it to crash if disabling carp while the system is under high stress under iperf.  The previous complaints about the updated drivers where that they would crash igb with just an iperf test.  I suspect this issue is different or maybe the old driver has the same issue but only shows itself under the condition that I mentioned.

          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            Ah, well it was worth asking but it seems you're aware of that. 2.0.3 was still built on FreeBSD 8.1 so it's not a fair comparison with 2.1.
            Do you have independent pfSync interfaces?

            Steve

            1 Reply Last reply Reply Quote 0
            • A
              adam65535
              last edited by

              Yea.  I have a dedicated pfsync interface.

              I am going to check for new firmware and install any updates on the Dell R320 and network cards one more time (It was last done about 2 or 3 months ago) just in case there was something changed recently.

              I expect to have paid support going soon with pfsense.org but I know the pfsense devs don't write the drivers though so I am not sure what they can do.  If it is a deep driver bug I assume it would be Intel that would have to fix it (I thought I read that Intel developed the driver but I am not sure).

              It would be interesting to see a list of commands that get executed when carp gets disabled to try them one at a time manually to see if I can get it to crash and know which command triggered it.

              1 Reply Last reply Reply Quote 0
              • dotdashD
                dotdash
                last edited by

                Is there a specific reason you need to disable CARP under high load?

                1 Reply Last reply Reply Quote 0
                • A
                  adam65535
                  last edited by

                  It is not a need… It is more of concerns about the uncertainty of the issue.  I am a little concerned that something might come up and I do need to manually switch to the backup without physical access to unplug a cable to work around it.  I worry more that the problem might show up in a way that I have not tested yet since I do not understand the real cause of the problem.

                  I have some i350 cards coming in to test instead of the Quad Intel ET2 cards but since they use the same igb drivers I suspect I will have the same issue.

                  EDIT:  Keep in mind that high load as I defined it above is only a single TCP connection taking up 600mbit of traffic which is something that could happen somewhat frequently depending on what is transferred between interfaces at times(backups, file shares, deployments to production servers, etc).  It is the manually disabling Carp that would be infrequent of course.

                  1 Reply Last reply Reply Quote 0
                  • First post
                    Last post
                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.