• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

MCA memory errors - which DIMM is failing?

Scheduled Pinned Locked Moved Hardware
mcadimmeccmcelog
6 Posts 3 Posters 1.3k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • X
    xyzzyz
    last edited by Jun 13, 2021, 2:39 AM

    Every few months, I see a handful of kernel messages like these:

    MCA: Bank 10, Status 0x8c000040000800c1
    MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000
    MCA: Vendor "GenuineIntel", ID 0x50663, APIC ID 0
    MCA: CPU 0 COR (1) MS channel 1 memory error
    MCA: Address 0x17aeccf40
    MCA: Misc 0x90000400040008c
    

    How do I determine which DIMM is having the issues?

    This is on a Supermicro X10SDV-TP8F with an Intel D-1518 proc and two 8GB ECC DIMMs.

    I heard that mcelog --ascii might help so I ran the kernel messages through it and this is what mcelog decoded it to:

    Hardware event. This is not a software error.
    CPU 0 BANK 10 
    MISC 90000400040008c ADDR 17aeccf40 
    MCG status:
    MemCtrl: Corrected patrol scrub error
    STATUS 8c000040000800c1 MCGSTATUS 0
    MCGCAP 1000c16 APICID 0 SOCKETID 0 
    CPUID Vendor Intel Family 6 Model 86 Step 3
    

    However I'm still not seeing which DIMM is at fault.

    Any ideas? (Thanks in advance!)

    1 Reply Last reply Reply Quote 0
    • I
      Impatient
      last edited by Jun 13, 2021, 3:02 AM

      I am not sure but that look's more like a memory controller error instead of the memory itself.

      You can alway's use memtest if you can load a compatible OS on a different drive.

      X 1 Reply Last reply Jun 13, 2021, 4:17 PM Reply Quote 0
      • S
        stephenw10 Netgate Administrator
        last edited by Jun 13, 2021, 11:45 AM

        IPMI logs show anything?

        X 1 Reply Last reply Jun 13, 2021, 4:21 PM Reply Quote 0
        • X
          xyzzyz @Impatient
          last edited by Jun 13, 2021, 4:17 PM

          @impatient - This box is in production. However, since there are two DIMMs, I could always take one out at a time and test it on a different box with MemTest86. However, since the error messages contain so much detail, I was hoping there would be some way to identify which DIMM.

          1 Reply Last reply Reply Quote 0
          • X
            xyzzyz @stephenw10
            last edited by Jun 13, 2021, 4:21 PM

            @stephenw10 - Good idea! I'm assuming you're referring to the "Health Event Log" within Supermicro's IPMI software?

            Unfortunately, I just checked it and it only has 10 "AC Power On - Assertion" events. Oddly enough, the most recent one is from Feb 2020 and this box has definitely been powered off/on since then.

            1 Reply Last reply Reply Quote 0
            • S
              stephenw10 Netgate Administrator
              last edited by Jun 13, 2021, 11:06 PM

              Yeah, that's what I was suggesting. That can often show errors of that kind with more useful output.

              I'm not sure you can see which DIMM might potentially be responsible there. Not unless it specifically shows a DIMM slot and your error output does not.

              Steve

              1 Reply Last reply Reply Quote 0
              6 out of 6
              • First post
                6/6
                Last post
              Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
                This community forum collects and processes your personal information.
                consent.not_received