Primary igb driver crash - manually disable carp on the primary under high load



  • The crash always occurs on the primary (80% of the time) even if I swap the hard drives too so that the primary hardware and the secondary hardware swap so I am pretty sure it is not a bad network card, motherboard, etc.  If I disable carp while under light loads on the primary it works fine and no crashes.  If I then do the same test and disable carp on the secondary while the master carp is disabled and then enable carp on the primary so that the primary becomes master again it works fine and no crash.  This is very repeatable and happens with pfsense 2.0.3 and 2.1.1-PRERELEASE (build of Feb 19th).

    The crash always happens in the igb driver it seems from the crash dumps.

    If I just unplug a network cable on the primary it does not crash that way either (even under high load).  The master stays running and becomes backup and the slave takes over and becomes  just as you would expect.  It is only when I disable carp (under iperf test) that the primary crashes which is how I force the primary to become backup.

    I have two Dell R320 firewalls in a cluster that are identical.
    2 quad port Intel gigabit ET2 network cards in each firewall (8 ports each firewall)
    Onboard unsupported broadcom nics disable in bios
    Hyperthreading disabled in bios
    4 cores

    I have a production firewall config with the iperf3 test running through the firewall with a client on the outside going through the built in loadbalancer (relayd) to a server running iperf3 on the inside.

    Test command line on the client and server using port 4444:
    Server# iperf3 -p 4444 -s -4
    Client# iperf3 -c 192.168.1.10 -p 4444 -t 500

    loader.conf.local:
    kern.ipc.nmbclusters="131072"
    hw.igb.num_queues=1
    hw.igb.txd="4096"
    hw.igb.rxd="4096"
    hw.igb.rx_process_limit="1000"

    #netstat -m (while under iperf test on primary before disabling carp)
    20615/2050/22665 mbufs in use (current/cache/total)
    20507/1723/22230/131072 mbuf clusters in use (current/cache/total/max)
    20505/1255 mbuf+clusters out of packet secondary zone in use (current/cache)
    0/104/104/65536 4k (page size) jumbo clusters in use (current/cache/total/max)
    0/0/0/32768 9k jumbo clusters in use (current/cache/total/max)
    0/0/0/16384 16k jumbo clusters in use (current/cache/total/max)
    46173K/4374K/50547K bytes allocated to network (current/cache/total)
    0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
    0/0/0 requests for jumbo clusters denied (4k/9k/16k)
    0/0/0 sfbufs in use (current/peak/max)
    0 requests for sfbufs denied
    0 requests for sfbufs delayed
    0 requests for I/O initiated by sendfile
    0 calls to protocol drain routines

    #vmstat -i (while under iperf test on primary before disabling carp)
    interrupt                          total      rate
    irq20: atapci0                    48357          5
    irq22: ehci1                      18467          2
    irq23: ehci0                      27944          3
    cpu0: timer                    18377583      1999
    irq256: igb0:que 0              3388510        368
    irq257: igb0:link                      4          0
    irq258: igb1:que 0              3394503        369
    irq259: igb1:link                      2          0
    irq260: igb2:que 0                30518          3
    irq261: igb2:link                      2          0
    irq262: igb3:que 0                23436          2
    irq263: igb3:link                      2          0
    irq264: igb4:que 0                626670        68
    irq265: igb4:link                      2          0
    cpu3: timer                    18377412      1999
    cpu2: timer                    18377397      1999
    cpu1: timer                    18377405      1999
    Total                          81068214      8822

    top -P (while under iperf test on primary before disabling carp)
    last pid: 55340;  load averages:  0.07,  0.02,  0.00    up 0+02:34:29  19:49:38
    42 processes:  1 running, 41 sleeping
    CPU 0:  0.0% user,  0.0% nice,  0.0% system, 48.5% interrupt, 51.5% idle
    CPU 1:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
    CPU 2:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
    CPU 3:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
    Mem: 64M Active, 24M Inact, 153M Wired, 220K Cache, 26M Buf, 3645M Free
    Swap: 8192M Total, 8192M Free

    PID USERNAME    THR PRI NICE  SIZE    RES STATE  C  TIME  WCPU COMMAND
      265 root          1  76  20  6908K  1388K kqread  3  0:55  0.00% check_rel
    72286 root          1  58    0  151M 45284K accept  1  0:09  0.00% php
    26203 root          1  44    0  5784K  1468K select  2  0:02  0.00% apinger
    85021 root          1  76  20  8296K  1732K wait    1  0:01  0.00% sh
    69713 _relayd        1  44    0 12320K  3828K kqread  2  0:01  0.00% relayd
    63080 root          1  44    0 26268K  7140K kqread  1  0:01  0.00% lighttpd

    Crash dump:
    http://textdump.net/read/4300/


  • Netgate Administrator

    The new igb(4) drivers have been backed out of 2.1.1 because they were unreliable for a lot of people. The 19th is a snapshot that may or may not have had the unstable drivers in so the first thing I suggest you do is update to a new snapshot.
    https://forum.pfsense.org/index.php/topic,72763.0.html

    Alternatively go to 2.1 release which had the stable driver.

    Are you not using that specifically because of the issue with AltQ?

    Steve



  • Thanks for the response.  I know for sure that the old 2.1 drivers are back in the feb 19th 2.1.1-PRERELEASE build because the driver obeys the hw.igb.num_queues setting.  The updated driver that was in for a few pre-releases before the 19th had that issue.  As I mentioned… this happens even with 2.0.3.

    If someone is just doing regular testing or in production where the system is not stressed I suspect they would never see the problem.  They would have to be under heavy load and manually disable carp on the primary.  I can only get it to crash if disabling carp while the system is under high stress under iperf.  The previous complaints about the updated drivers where that they would crash igb with just an iperf test.  I suspect this issue is different or maybe the old driver has the same issue but only shows itself under the condition that I mentioned.


  • Netgate Administrator

    Ah, well it was worth asking but it seems you're aware of that. 2.0.3 was still built on FreeBSD 8.1 so it's not a fair comparison with 2.1.
    Do you have independent pfSync interfaces?

    Steve



  • Yea.  I have a dedicated pfsync interface.

    I am going to check for new firmware and install any updates on the Dell R320 and network cards one more time (It was last done about 2 or 3 months ago) just in case there was something changed recently.

    I expect to have paid support going soon with pfsense.org but I know the pfsense devs don't write the drivers though so I am not sure what they can do.  If it is a deep driver bug I assume it would be Intel that would have to fix it (I thought I read that Intel developed the driver but I am not sure).

    It would be interesting to see a list of commands that get executed when carp gets disabled to try them one at a time manually to see if I can get it to crash and know which command triggered it.



  • Is there a specific reason you need to disable CARP under high load?



  • It is not a need… It is more of concerns about the uncertainty of the issue.  I am a little concerned that something might come up and I do need to manually switch to the backup without physical access to unplug a cable to work around it.  I worry more that the problem might show up in a way that I have not tested yet since I do not understand the real cause of the problem.

    I have some i350 cards coming in to test instead of the Quad Intel ET2 cards but since they use the same igb drivers I suspect I will have the same issue.

    EDIT:  Keep in mind that high load as I defined it above is only a single TCP connection taking up 600mbit of traffic which is something that could happen somewhat frequently depending on what is transferred between interfaces at times(backups, file shares, deployments to production servers, etc).  It is the manually disabling Carp that would be infrequent of course.