MCA memory errors - which DIMM is failing?
-
Every few months, I see a handful of kernel messages like these:
MCA: Bank 10, Status 0x8c000040000800c1 MCA: Global Cap 0x0000000001000c16, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x50663, APIC ID 0 MCA: CPU 0 COR (1) MS channel 1 memory error MCA: Address 0x17aeccf40 MCA: Misc 0x90000400040008c
How do I determine which DIMM is having the issues?
This is on a Supermicro X10SDV-TP8F with an Intel D-1518 proc and two 8GB ECC DIMMs.
I heard that mcelog --ascii might help so I ran the kernel messages through it and this is what mcelog decoded it to:
Hardware event. This is not a software error. CPU 0 BANK 10 MISC 90000400040008c ADDR 17aeccf40 MCG status: MemCtrl: Corrected patrol scrub error STATUS 8c000040000800c1 MCGSTATUS 0 MCGCAP 1000c16 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 86 Step 3
However I'm still not seeing which DIMM is at fault.
Any ideas? (Thanks in advance!)
-
I am not sure but that look's more like a memory controller error instead of the memory itself.
You can alway's use memtest if you can load a compatible OS on a different drive.
-
IPMI logs show anything?
-
@impatient - This box is in production. However, since there are two DIMMs, I could always take one out at a time and test it on a different box with MemTest86. However, since the error messages contain so much detail, I was hoping there would be some way to identify which DIMM.
-
@stephenw10 - Good idea! I'm assuming you're referring to the "Health Event Log" within Supermicro's IPMI software?
Unfortunately, I just checked it and it only has 10 "AC Power On - Assertion" events. Oddly enough, the most recent one is from Feb 2020 and this box has definitely been powered off/on since then.
-
Yeah, that's what I was suggesting. That can often show errors of that kind with more useful output.
I'm not sure you can see which DIMM might potentially be responsible there. Not unless it specifically shows a DIMM slot and your error output does not.
Steve