Intel X520-DA2, kernel: CRITICAL: ECC ERROR!! Please Reboot!!



  • Any advice on this error i’m getting with pfsense 2.1RC2 amd64? (yes yes, stable is out, i know)

    i installed pfsense on a ibm server with an intel X520-DA2. when i connected it to a dell 10GbE switch using a DAC sfp+ cable, the link went up/down/up/down/up/down, and the console wrote out this message:

    Sep 10 16:16:06 check_reload_status: Linkup starting ix1
    Sep 10 16:16:06 kernel: ix1:
    Sep 10 16:16:06 kernel: ix1: link state changed to DOWN
    Sep 10 16:16:06 kernel:
    Sep 10 16:16:06 kernel: CRITICAL: ECC ERROR!! Please Reboot!!
    Sep 10 16:16:06 kernel: ix1:
    Sep 10 16:16:06 kernel: CRITICAL: ECC ERROR!! Please Reboot!!
    Sep 10 16:16:06 kernel:

    what to do?



  • I’ve seen that too on one of our test systems. Haven’t seen it on any production systems yet. I’m not sure what it means, haven’t had time to look into it in much depth yet. -RELEASE wouldn’t be any different, the driver didn’t change since pre-RC2.


  • Banned

    Well, this is the relevant part of the source code if it helps… 😛

    
    if (reg_eicr & IXGBE_EICR_ECC) {
                   	device_printf(adapter->dev, "\nCRITICAL: ECC ERROR!! "
    		    "Please Reboot!!\n");
    		IXGBE_WRITE_REG(hw, IXGBE_EICR, IXGBE_EICR_ECC);
    	} else
    
    


  • have you tried rebooting?



  • @cmb:

    I’ve seen that too on one of our test systems. Haven’t seen it on any production systems yet. I’m not sure what it means, haven’t had time to look into it in much depth yet. -RELEASE wouldn’t be any different, the driver didn’t change since pre-RC2.

    Even though it doesn’t help me, your post gave me more information than google.  🙂

    I have commercial support, so i’ll ask “the guys” and see what they have to say.



  • @gonzopancho:

    have you tried rebooting?

    Haven’t we all?



  • I have two pfsense test units with 10Gb-Base-LR Intel cards. Neither had this error on 2.03. Both started after the 2.1-RELEASE upgrade.


  • Developer Netgate Administrator

    For those seeing this error, do the NICs actually function when they hit this error? Or do they stop passing traffic?



  • Hi,

    NICs are working but some problem with MBUF reuse. i have 2 firewalls with x520-DA2 and x540-T2 cards that crash every 8 hours after MBUFs hit 100%.
    i have 512000 MBUFs defined on each system. i need to find solution for this problem ASAP.



  • Hmm, I also get this error with a X520-DA2 on an IBM box after updating to 2.1.
    I upgraded from a -RC0 around July, that was before the last ixgbe Update before shipping -RELEASE
    I have the VLAN fix as per wiki documentation.

    I’ll have some chances to beat that thing a little bit and see how the MBUFs develop.
    Otherwise I’ll have to presto go back to 2.1-RC0 which was rock-solid (sorry to say I wasn’t able to give -RC1/2 any beating) 😕

    Update: I might give the .ko module from FreeBSD 8.3 or 8.4 a try - as workaround - since my RC0 at least shipped the version that was (almost) equivalent to what went into 8.4.



  • @MatSim:

    Hmm, I also get this error with a X520-DA2 on an IBM box after updating to 2.1.
    I upgraded from a -RC0 around July, that was before the last ixgbe Update before shipping -RELEASE
    I have the VLAN fix as per wiki documentation.

    I’ll have some chances to beat that thing a little bit and see how the MBUFs develop.
    Otherwise I’ll have to presto go back to 2.1-RC0 which was rock-solid (sorry to say I wasn’t able to give -RC1/2 any beating) 😕

    Update: I might give the .ko module from FreeBSD 8.3 or 8.4 a try - as workaround - since my RC0 at least shipped the version that was (almost) equivalent to what went into 8.4.

    can you upload your .ko module? i’ll try to see if the mbuf problem exists or not



  • Hi all

    While it’s certainly not great to see that this issue slipped into 2.1-RELEASE I’ve heard from ESF crew, that they are actively working on a fix for this issue. If you want to join forces, offer them some of your support hours so they can work for this on paid basis. Anyhow, my not-so quick fiddlings on my part:

    • The plain, but old ixgbe 2.4.5 from 8.3-RELEASE works with the X520-DA2 I have here.

    • An experiment with 8.3’s source tree + cherry-picks of 8.4’s ixgbe paniced about at the level of boot comand disabling the VLAN filter (that was needed since RC0 for VLANs)

    You can get the modules that I tried on my box here: http://id.gymkl.ch/pfsense/ixgbe
    That said: This all is not endorsed by ESF crew or anyone else (not even myself). This may, or may not cause problems on your systems, nonetheless, if you are brave give it a try for workaround.

    
    # On the pfSense shell
    cd /boot/kernel
    fetch http://id.gymkl.ch/pfsense/ixgbe/ixgbe2.4.5-fbsd8.3-amd64.ko
    chmod 555 ixgbe2.4.5-fbsd8.3-amd64.ko
    
    

    Afterwards you add the following line to your /boot/loader.conf.local and then reboot the system:

    
    ixgbe2.4.5-fbsd8.3-amd64_load="YES"
    
    

    You should see the ixgbe 2.4.5 version being loaded in /var/log/dmesg.boot afterwards.



  • hi,

    As i remember ixgbe version 2.4.5 has problems with VLANs. can some one check?



  • I have a couple of VLANs on one of my ix interfaces, yes they do work without hickups here as for now. fingers crossed

    You can easily check if VLANs are passing through when you fire up a ‘tcpdump -i ix<number>_vlan<id>’. If you only see traffic on ix<number>, but not on the VLAN interface then you may want to give that VLAN hw filter a try like it is found on the wiki documentation. I currently have that set from -RC0 times.

    Although remember this was not required until some builds that brough newer ixgbe drivers with lat BETA and RC builds. I also have followed Intels ixgbe instructions for loader.conf(.local) for the mbufs where they recommend larger nmbcluster than for the 1GE models. The num_queues options was select to 4 since this system has 4 cores / 4 threads so it can use up to 4 queues not use more (and possibly exhaust CPU capacity).

    
    kern.ipc.nmbclusters=262144
    kern.ipc.nmbjumbop=262144
    hw.ixgbe.num_queues="4"
    ixgbe2.4.5-fbsd8.3-amd64_load="YES"
    
    ```</number></id></number>


  • hi,

    2.4.5 don’t support one of  my 10G cards  :’( Can some one to compile 2.5.1 or 2.5.8?



  • Try 2.5.0 that is uploaded alongside, but as said.

    It made my box panic, but it may work for you, give it a try.

    It is line by line the same code as is in 8.4 and 8-STABLE.



  • i’ll try to set my own build environment for 2.1 and i’ll try to build


  • Administrator

    I built a new 2.5.15 driver, with some fixes. Could you try it and let me know how it goes?

    You can get it at http://files.pfsense.org/garga/ixgbe_modules/2.1/

    Just put the ixgbe.ko at /boot/modules and add the following line to /boot/loader.conf.local

    ixgbe_load=“YES”

    Best regards



  • @Renato:

    I built a new 2.5.15 driver, with some fixes. Could you try it and let me know how it goes?

    You can get it at http://files.pfsense.org/garga/ixgbe_modules/2.1/

    Just put the ixgbe.ko at /boot/modules and add the following line to /boot/loader.conf.local

    ixgbe_load=“YES”

    Best regards

    can you post also fixes?



  • @Renato:

    I built a new 2.5.15 driver, with some fixes. Could you try it and let me know how it goes?

    You can get it at http://files.pfsense.org/garga/ixgbe_modules/2.1/

    Just put the ixgbe.ko at /boot/modules and add the following line to /boot/loader.conf.local

    ixgbe_load=“YES”

    Best regards

    system crashed  :’(

    crashreport.txt



  • Thanks Renato, although wladikz wasn’t really lucky, it though seems that there are either different X520-DA2 revisions in the wild or he has another ix card.

    I’ll give that module tomorrow a try. At least 2.4.5 - if it recognizes the card - seems to work for “older” cards by now.



  • Hi renato, thank you for your efforts, but unfortunately I have the same issue as wladikz and had to revert to 2.4.5:

    • The system boots and loads 2.5.15
    • Seems to go until NIC initialization (post-kernel stuff, pfSense outputs) and panics

    -> I’ve uploaded the crash log through the Web-UI if that helps and additionnaly to help identify potentially different boards, here is both pciconf and dmesg.boot outputs:

    
    ix0@pci0:17:0:0:        class=0x020000 card=0x7a128086 chip=0x10fb8086 rev=0x01 hdr=0x00
        class      = network
        subclass   = ethernet
    ix1@pci0:17:0:1:        class=0x020000 card=0x7a128086 chip=0x10fb8086 rev=0x01 hdr=0x00
        class      = network
        subclass   = ethernet
    
    

    dmesg.boot

    
    ix0: <intel(r) pro="" 10gbe="" pci-express="" network="" driver,="" version="" -="" 2.4.5="">port 0x2fc0-0x2fdf mem 0x91200000-0x912fffff,0x910fc000-0x910fffff irq 32 at device 0.0 on pci17
    ix0: Using MSIX interrupts with 5 vectors
    ix0: [ITHREAD]
    ix0: [ITHREAD]
    ix0: [ITHREAD]
    ix0: [ITHREAD]
    ix0: [ITHREAD]
    ix0: PCI Express Bus: Speed 5.0Gb/s Width x8
    ix1: <intel(r) pro="" 10gbe="" pci-express="" network="" driver,="" version="" -="" 2.4.5="">port 0x2fe0-0x2fff mem 0x91100000-0x911fffff,0x910f8000-0x910fbfff irq 36 at device 0.1 on pci17
    ix1: Using MSIX interrupts with 5 vectors
    ix1: [ITHREAD]
    ix1: [ITHREAD]
    ix1: [ITHREAD]
    ix1: [ITHREAD]
    ix1: [ITHREAD]
    ix1: PCI Express Bus: Speed 5.0Gb/s Width x8</intel(r)></intel(r)> 
    

  • Administrator

    Hello guys,

    Since I don’t have hardware I cannot test the patches, but, I found 2 merge issues that were ending with duplicate code. So I did the merge again, revision by revision, and built a module for any of following versions:

    2.5.0-8
    2.5.7
    2.5.8
    2.5.13
    2.5.15

    This 2.5.15 is different from the old one I posted here few days ago.

    I built versions >= 2.5.13 with IXGBE_LEGACY_TX option defined to make it work with ALTQ.

    You can find all these versions at:

    http://files.pfsense.org/garga/ixgbe_modules/2.1/

    Please let me know the results when you are able to test it.



  • Hi Renato, awesome.

    I can’t promise an answer before weekend, but I hope to report back as quickly as possible.
    I will focus on the newer revisions since they will likely support thew newer chips found on systems like wladikz has.



  • Hi Renato,

    i’ll try to check modules tomorrow. i have big problem with mbuf reuse and crash of system every 9-12 hours. so i’ll try to check modules ASAP.

    Renato, can you put sources of compiled modules to same location with modules?



  • Hi renato and wladikz

    OK, I took some time on Saturday to give your ixgbe amd64 builds some beatings:

    • 2.5.15 (round 2): panics as before no chance to log in.

    • 2.5.13: The first module that doesn’t panic the system before login. First traffic passes and after I wanted to check other services from interconnected servers (roughly 1 minute): Panics again.

    • 2.5.8: Crashes, similar to 2.5.15 - honestly I didn’t look close when it did and whent over to 2.5.7

    • 2.5.7: Same as 2.5.13 loads I can fire up a tcpdump and stop it, then I go over and check some services, boom: panics.

    • 2.5.0-8: OK, I crossed fingers and hoped the best for this last one: Same as on 2.5.7 et al.

    In 2.5.0 I really tried log in with a device via the wifi VLAN, the AAA server (not on pfSense) allows access and the client waits for a DHCP address from pfSense - that’s the moment when I realized it just crashed.

    The method used (just in case needed to reproduce situation) was:

    • loader.conf.local tunings for ixgbe/igb set to what I previously posted
    • Default in loader.conf.local is to load the old 2.4.5 module
    • I boot the system and switch to loader prompt and did: 1) unload, 2) load /boot/kernel/kernel, 3) load /boot/kernel/ixgbe<version>.ko and finally 4) boot

    This way I did not always have to copy back and forth modules and always was able to boot back into something working in case it crashed.

    @Renato: In case it would helpful I could provide you with a remote KVM access to the box.</version>



  • I had some spare motivation to make another round  😎

    The variables changed are:

    • Removed the shellcmd “ifconfig ix0 -vlanhwfilter” (that only started to be required with 2.5.0-8 in RC)

    • Add hw.intr_storm_threshold=10000 to the system tunables as suggested by the wiki and Intel driver doc

    I hope that the shellcmd removed takes one thing out of the equation since this command seem to be executed late in the startup process.

    • Baseline is 2.4.5 (my .ko): System loads, seems to work for as before, I have VLANs working and no ECC error

    • 2.5.15: System boots up, no panic until after login - so I have to guess it has to do with the previous shellcmd. I get traffic on ix0, but not ix0_vlan111. Executing the previously enabled shellcmd makes VLANs working again. Testing a wireless clients it gets a connection, gets  a DHCP lease from dhcpd but connection thereafter is dead, the server paniced.

    • 2.5.0-8 (yours): The behaviour as with 2.5.15 is the same as in terms of bootup, VLAN issue  System boots up, no panic at first. I was able at least to browse to one page before it also panics.

    If wladikz or others would get something working but without VLANs that could be an indication that having VLANs with newer ixgbe driver causes issues. I hope this helps little further - I consider myself and my employer lucky to have the current possibility of using the older driver to get cards recognized and VLANs working.  😕



  • Hi All,

    currently i try to install one of my firewalls to ESXi and i’ll check if it’s work better or not

    01-oct-2013 01:00 GMT+3
    looks much better. MBUFs are not growing,  no errors 🙂 will give 2-3 days to run before migration of second cluster node to same platform



  • Hi All,

    current status of my system:
    1. PFSense 2.1-Release running on ESXi 5.1Update1 with two dual port Intel 10G cards and one Intel dual port 1G card
        - Intel Corporation 82599EB 10-Gigabit SFP+ Network Connection (X520-DA2)
        - Intel Corporation Ethernet Controller X540-AT2
        - Intel Corporation I350 Gigabit Network Connection
    2. system connected to 162 subnets on 136 VLANs.
    3. currently average throughput (last 24 hours) is 1.8Gb/s.

    I don’t see any packet lost or disconnects. i see only one problem that could affect my network then we run stress tests (up to 5.5 Gb/s traffic) in single queue driver for VNXNET3 adapter. as i checked, FreeBSD 9 has VNXNET3 driver with multi-queue (may be I’ll try to port it to 8.3).

    pros. : system run stable. Every virtual switch connected with 2 links to different switch.
    cons. : firewall throughput limited to 10Gb/s (no LAGGs)

    if any one want to know how to configure such cluster i can publish step by step guide


  • Banned

    That would be nice if you wanted to do that.

    Why not team the adapters and get better throughput? Have you tried the Intel drivers in ESXi?



  • Awesome!

    +1 for the guide 🙂



  • @Supermule:

    That would be nice if you wanted to do that.

    Why not team the adapters and get better throughput? Have you tried the Intel drivers in ESXi?

    i use free esxi hypervisor. first problem is that free esxi server don’t have teaming option. second problem, you can’t  create LAGG from guest. if you know how to do teaming on free esxi i’ll be happy to try it.

    i’ll write guide about my configuration tomorrow.


  • Banned

    Thanks 🙂 I run Enterprise Plus on Vcenter so a little spoiled 🙂

    Long time since I ran a free version.



  • @Supermule:

    Thanks 🙂 I run Enterprise Plus on Vcenter so a little spoiled 🙂

    Long time since I ran a free version.

    Enterprise plus license price around 6k per socket. Two dual socket servers license prise
    Very close to dell sonic appliance :). I try to build cheap solution. And even with enterprise plus license you can’t create LAGG on guest


  • Banned

    Offcourse you can!

    You just team them on the VSwitch.



  • do you have any howtos?


  • Banned



  • @Supermule:

    YEs. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1004088

    this guide is about host link aggregation. i need link aggregation on guest side.


  • Banned

    You asked for teaming on ESXi…not on PFSense.

    Host link aggregation present multiple physical adapters as one to the guest.

    Have you tried seperate physical adapters and LAGG them in Pfsense? Present more than one to the guest on ESXi?



  • yes. it’s possible just if i use pci pass-through. i try to check option with centos + kvm + openvswitch as hypervisor


 

© Copyright 2002 - 2018 Rubicon Communications, LLC | Privacy Policy