Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    MCA memory errors - which DIMM is failing?

    Hardware
    mca dimm ecc mcelog
    3
    6
    1.2k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • X
      xyzzyz
      last edited by

      Every few months, I see a handful of kernel messages like these:

      MCA: Bank 10, Status 0x8c000040000800c1
      MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
      MCA: Vendor "GenuineIntel", ID 0x50663, APIC ID 0
      MCA: CPU 0 COR (1) MS channel 1 memory error
      MCA: Address 0x17aeccf40
      MCA: Misc 0x90000400040008c
      

      How do I determine which DIMM is having the issues?

      This is on a Supermicro X10SDV-TP8F with an Intel D-1518 proc and two 8GB ECC DIMMs.

      I heard that mcelog --ascii might help so I ran the kernel messages through it and this is what mcelog decoded it to:

      Hardware event. This is not a software error.
      CPU 0 BANK 10 
      MISC 90000400040008c ADDR 17aeccf40 
      MCG status:
      MemCtrl: Corrected patrol scrub error
      STATUS 8c000040000800c1 MCGSTATUS 0
      MCGCAP 1000c16 APICID 0 SOCKETID 0 
      CPUID Vendor Intel Family 6 Model 86 Step 3
      

      However I'm still not seeing which DIMM is at fault.

      Any ideas? (Thanks in advance!)

      1 Reply Last reply Reply Quote 0
      • I
        Impatient
        last edited by

        I am not sure but that look's more like a memory controller error instead of the memory itself.

        You can alway's use memtest if you can load a compatible OS on a different drive.

        X 1 Reply Last reply Reply Quote 0
        • stephenw10S
          stephenw10 Netgate Administrator
          last edited by

          IPMI logs show anything?

          X 1 Reply Last reply Reply Quote 0
          • X
            xyzzyz @Impatient
            last edited by

            @impatient - This box is in production. However, since there are two DIMMs, I could always take one out at a time and test it on a different box with MemTest86. However, since the error messages contain so much detail, I was hoping there would be some way to identify which DIMM.

            1 Reply Last reply Reply Quote 0
            • X
              xyzzyz @stephenw10
              last edited by

              @stephenw10 - Good idea! I'm assuming you're referring to the "Health Event Log" within Supermicro's IPMI software?

              Unfortunately, I just checked it and it only has 10 "AC Power On - Assertion" events. Oddly enough, the most recent one is from Feb 2020 and this box has definitely been powered off/on since then.

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                Yeah, that's what I was suggesting. That can often show errors of that kind with more useful output.

                I'm not sure you can see which DIMM might potentially be responsible there. Not unless it specifically shows a DIMM slot and your error output does not.

                Steve

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.