First hard crash in years on pfSense

keyser

Hmm, I upgraded my 6100 to 23.01 about 10 days ago, and the upgrade went completely as expected - no issues (It was repaved with ZFS@22.05).
I have not installed any patches yet, but I did set a fs.zfs.arc.max to 256MB in order to avoid the ARC eating all my memory with a tftp job and the nightly cron job.

But today - suddenly out of nowhere it crashed and rebooted.

When i logged in there was a crash report with the info:
Unrecoverable machine check exception

The msgbuf file (below) has very few lines since the upgrade completed booting about 10 days ago (i included the last 2 lines from then.
The never lines mostly seem to be ZFS related. There are a few promiscious mode messages, but I think they could be from when i started NtopNG a few times.:

<118>Netgate pfSense Plus 23.01-RELEASE amd64 Fri Feb 10 20:06:33 UTC 2023
<118>Bootup complete
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
<6>ix2: promiscuous mode enabled
<6>ix2.3: promiscuous mode enabled
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
<6>ix2: promiscuous mode disabled
<6>ix2.3: promiscuous mode disabled
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
len 4 vecnum: 126 sizeof (zfs_cmd_t) 4528
<6>ix2: promiscuous mode enabled
<6>ix2.3: promiscuous mode enabled
<6>ix2: promiscuous mode disabled
<6>ix2.3: promiscuous mode disabled
MCA: Bank 5, Status 0xba00000028000402
MCA: Bank 5, Status 0xba00000028000402
MCA: Bank 5, Status 0xba00000028000402
MCA: Bank 5, Status 0xba00000028000402
MCA: Global Cap 0x0000000000000c09, Status 0x0000000000000004
MCA: Global Cap 0x0000000000000c09, Status 0x0000000000000004
MCA: Vendor "GenuineIntel", ID 0x506f1, APIC ID 16
MCA: CPU 2 UNCOR EN PCC internal error 2
MCA: Misc 0x0
panic: Unrecoverable machine check exception
cpuid = 2
time = 1677754223
KDB: enter: panic

Any ideas? Is the zfs.arc.max turnable causing problems?

bmeeks

MCA means "Machine Check Architecture". That indicates a hardware-based error. Internal error 2 seems to match this description I found for MCA errors for Intel hardware: Parity error in internal microcode ROM. The error messages indicates the problem occurred in CPU 2.

Information about MCA errors appears scarce on the web. I did find this Intel document: https://xem.github.io/minix86/manual/intel-x86-and-64-manual-vol3/o_fe12b1e2a880e0ce-526.html.

keyser

@bmeeks Thanks.

Wtf... It's only a year and a half old 6100....

Strange however that is it came so close after the 23.01 update. I hope this is not turning out to be dead unit. Will keep this thread updated if it returns.

stephenw10

Yes, you should certainly open a ticket with us for this. You should not be seeing that error.
https://www.netgate.com/tac-support-request

Steve

keyser

@stephenw10 Thanks. Ticket opened at: 1474872417

Hoping to have clarification on if this is hardware, or it could be 23.01.
Seems very coincidental it came a few days after the upgrade.

dalicollins

@keyser Got the same exact error here. Just updated to 23.01 two days ago. Put the internet down from 12:47AM to 8:19 this morning.

Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
8 CPUs : 1 package(s) x 4 core(s) x 2 hardware threads
AES-NI CPU Crypto: Yes (active)
QAT Crypto: No

All my tunables are default.

keyser

@dalicollins Support said It was hardware, but I have my doubts. It has not happened since, and my box was 100% rock stable up until the Upgrade to 23.01.
So until further notice (i will update this thread), I believe this is a carry over issue from the upgrade process, and that the full reboot from the crash has cleared “the issue”.
If it happens again I can fall back to my 22.05 boot environment and see if that is stable (confirming its a 23.01 issue).

stephenw10

Whilst MCA errors are almost always hardware they can be triggered by a new version trying to use some feature the old one did not. On a 6100 that should not be the case unless you've added an expansion card? On a whitebox i7 it could be any number of things. The actual MCA error shown might narrow that down.

Steve

dalicollins

@stephenw10 Unfortunately, I did not get a crash report, just saw these errors on the console.

stephenw10

Well if you see it again note the MCA errors to reference later.

keyser

@dalicollins Well, my 6100 just crashed hard again on 23.01 (second crash). Once again it is a MCA, but the behavior and crash report is much longer and much more detailed this time.

Netgate Support is unfortunately not a help as my box is two years old - To them its dead hardware if it MCA’s. No analysis needed, buy a new box.

I still doubt it as it is far to unlikely to happen right after 23.01, but since it has taken first 8 and now 14 days between crashes, it will take a while to prove wrong.

I’ll revert my boot environment to 22.05 tonight, and then we will see.

QUESTION: Could 23.01 have a problem with either the NVMe SSD I installed or the SFP modules I’m using, that causes it to MCA at some point? If so, any way to tell from the crashdump that I saved?

stephenw10

What's the actual MCA erro you're seeing?
You can decode it to some extent using mcelog. Unfortunately the first log you had is not helpful:

admin@FreeBSD-14:~ $ mcelog --no-dmi --ascii --file mcalog1.txt 
Hardware event. This is not a software error.
CPU 0 BANK 5 
MISC 0 
MCG status:
MISC format 0 value 0
STATUS ba00000028000402 MCGSTATUS 0
APICID 0 SOCKETID 0 
(Fields were incomplete)
Hardware event. This is not a software error.
CPU 0 BANK 5 
MISC 0 
MCG status:
MISC format 0 value 0
STATUS ba00000028000402 MCGSTATUS 0
APICID 0 SOCKETID 0 
(Fields were incomplete)
Hardware event. This is not a software error.
CPU 0 BANK 5 
MISC 0 
MCG status:
MISC format 0 value 0
STATUS ba00000028000402 MCGSTATUS 0
APICID 0 SOCKETID 0 
(Fields were incomplete)
Hardware event. This is not a software error.
CPU 2 BANK 5 
MISC 0 
MCG status:MCIP 
STATUS ba00000028000402 MCGSTATUS 4
MCGCAP c09 APICID 10 SOCKETID 0 
CPUID Vendor Intel Family 6 Model 95 Step 1

keyser

@stephenw10 The MCA error is exactly the same as the first time at the end of msgbuf.txt.

But there is WAY more data in ddb.txt this time over.

stephenw10

Ah well the backtrace and/or message buffer might reveal something.

keyser

@stephenw10 Could I ask you to have a look at it? You strike me as the most knowledgeable person on this subject :-)

Is there a way to PM you the intire crashdump directly? There is a lot of text, and vetting it for IP's and other secrets will take a long time....

On another note: If the Error 2 actually implies parity error on Microcode ROM... Did the CPU microcode get updated with 23.01? Can that be re-flashed to perhaps solve read issues (decay).
Could there be something in the line of a complete power-off needed to "reset" something? I haven't had it completely powered off yet since the update.

stephenw10

Sure, if you upload it here I can review it:
https://nc.netgate.com/nextcloud/index.php/s/6AQJnY7bDd9KZym

The CPU microcode gets updated at boot by pfSense. If that was actually using bad code I'd expect to see significantly more issues.

keyser

@stephenw10 Thank you Stephen. The files are uploaded now.
I’m very grateful, and very qurious if you can see anything more specific than just defective hardware.

I’ll still be reverting to 22.05 for now - I just want to make sure/know if this is related to the 23.01 install.

stephenw10

Mmm, unfortunately there's nothing further shown there. Everything looks fine until it throws the MCA errors which also ties in with a hardware issue.
How often does it panic like that?

I suspect it will also panic in 22.05. I would be very interested to find out.

Steve

keyser

@stephenw10 Thank you for taking a look.
It has paniced twice. first time about 7 days after going 23.01, and then about 14 days after that again.
It was 100% stable on 22.01/22.05 for a 1½ years before going 23.01

stephenw10

Well if it is stable in 22.05 again that would be a very interesting result. I suspect it was just coincidence though.