First hard crash in years on pfSense
-
-
It's possible. You may be the only user with that combination of hardware. Though the error doesn't indicate that directly.
-
We have seen bad SFP modules put a NIC into a state that requires a full power cycle to clear. Not on a 6100 though as far as I know.
-
-
@stephenw10 Thanks. Based on that, my current course of action now is as follows:
1: Keep it running on 22.05 as it is now - 6 weeks considered a success criteria based on the 8 and 14 days MCA interval on 23.01.
-
If it crashes on 22.05 as well, one more test will be a full power off for a while and then resume 23.01 to see if it fails again = dead hardware.
-
If it does not crash on 22.05, I'll revert both my SFP's to a RJ45 connection using my switch for fiber termination in a closed untagged VLAN, and resume testing 23.01
Does that not sound as the most conclusive way to go from here?
-
-
Yes, that would be a great test if you can do it.
-
@stephenw10 said in First hard crash in years on pfSense:
Yes, that would be a great test if you can do it.
Statusreport: The current uptime on 22.05 without issues is 27 days now.
-
@keyser Take 23.01 and run it in a VM on the same hardware.
Then you will know for sure. I bet you it doesnt crash with MCA errors.
I bet you its driver related for 23.01
-
Yes, we have seen new drivers enable some piece of hardware that then triggers MCA errors. So although it is a hardware problem it's only a problem if you enable that hardware.
But that's unlikely in the 6100 because there are so many out there running 23.01. Unless there is any additional hardware in it. And the only things I see in the boot log are USB devices which shouldn't be capable of this. -
@stephenw10 said in First hard crash in years on pfSense:
Yes, we have seen new drivers enable some piece of hardware that then triggers MCA errors. So although it is a hardware problem it's only a problem if you enable that hardware.
But that's unlikely in the 6100 because there are so many out there running 23.01. Unless there is any additional hardware in it. And the only things I see in the boot log are USB devices which shouldn't be capable of this.Yeah that's my thinking as well. I am using a couple of SFP trancievers (one of which is a BiDi), and I have installed my own NWMe SSD. I have also connected a serialport USB cable which gives me a console backdoor to my Raspberry Pi in case it goes down. Lastly there is a Eaton USB UPS connected in NUT.
If it stays solid on 22.05 - then, apart from the full power off I earlier planned, should I remove fx. the USB to Serial port cable? Which of my "anomalies" do you consider most likely to cause a driver issue that can MCA the box?
-
Do you have the new blinkboot version installed? Not that I'm aware of anything in it that would affect this.
Of those things the NVMe drive is most likely to cause a problem since it's a PCIe device. No USB device should be able to cause that sort of error IMO. But it's easy to remove them as a test.
Steve
-
@stephenw10 said in First hard crash in years on pfSense:
Do you have the new blinkboot version installed? Not that I'm aware of anything in it that would affect this.
Of those things the NVMe drive is most likely to cause a problem since it's a PCIe device. No USB device should be able to cause that sort of error IMO. But it's easy to remove them as a test.
Steve
I do - as far as I remember I had that installed for quite a while before the 23.01 upgrade.
Regarding the SSD - If it stays solid and passes my 6 weeks 22.05 period, I’ll try the full power off/on, and remove the serial port to give it a new spin. If it fails again, I’ll look into running for a period without the SSD.
Thanks for sharing :-) -
There's a new BlinkBoot version that was just released in a new Netgate Firmware Update package
CORDOBA-03.00.00.03t
. I don't believe it will do anything here but it would be an easy test. The updated package is only in 23.01 though. -
@stephenw10 Noted
-
@stephenw10 Well, my six weeks test period has now concluded and the box has been completely stable on 22.05 during that period.
So tommorow I’ll give a full power-off + disconnect of my USB serial Port device a spin, and let it boot on 23.01 again.
Here’s crossing my fingers that this will cut it. Otherwise I’ll have to start testing without my SSD and my SFP optics.
I get that FreeBSD 14 may theoretically use some region of memory or cache or a new instruction that 12.3 does not, and thus hit an actual hardware error that 12.3 just never triggers. But I find that pretty unlikely…..
-
Yeah, it does seem unlikely. Mostly because there are thousands of 6100s running 23.01 and not hitting it.
-
@stephenw10 said in First hard crash in years on pfSense:
Yeah, it does seem unlikely. Mostly because there are thousands of 6100s running 23.01 and not hitting it.
Well, 10 days in and it crashed again on 23.01…
When i booted it 10 days ago I made sure it had a full power off, and my serialport USB device is not plugged in.
It’s still a MCA error - but really really strange it always takes about 8-12 days for it to crash - and that it is 100% stable in 22.05.What’s the best course of action now? Test with no SFPs or remove my SSD and install/boot from eMMC?
Or should I wait and try a full repave with 23.05 once released? -
Hmm, it would be good to test 23.05 but it would not surprise me at all if the issue exists there too.
It must be some hardware difference so, yes, if you can I would test without the SFP modules.
-
@stephenw10 Just curious, But does 23.05 contain a newer FreeBSD 14 kernel and driver/module versions (fixes) than 23.01?
Or is 23.05 only the fixes netgate has made to various services and components? -
It has newer drivers. It's built on a newer FreeBSD head snap.
-
@stephenw10 Okay - I’ll make sure to give 23.05 a spin first to see if that changes anything.
-
@stephenw10 At the risk of jinxing the situation it seems 23.05 makes a difference. It has been completely stable for 23 days now - no issues since the upgrade, and before it would never go more than about 14 days without a crash.
I continue to be 100% certain that even though the crashes in 23.01 reported defective hardware, it is not the case.
22.05 never crashed on me and it started the day i upgraded to 23.01 - returning to 22.05 made it stable again.
So far 23.05 seems to make it stable again. (Fingers crossed) -
Hmm, bizarre.
Really the only way I can see that happening is somthing in 23.01 that's tickling some hardware device that's marginal on your particular 6100. And than is now removed from 23.05.
That's a lot of things that have to line up... so perhaps I'm overlooking something.