6100 Failed eMMC replaced with NVme but now no longer reboots

stephenw10

Hmm, it is shutting down correctly there then. Or at least appears to be.

Does it boot back from there if you power it up by pressing the PWR button?

stbellcom

@stephenw10

No activity when trying to restart by pressing the power button. Just stays in the current state.

stephenw10

Hmm, that is odd. Momentary press? 1s press? 3s press? All of them?

stbellcom

@stephenw10

Tried all the sequences of presses, long press did restart the unit but still this isn't ideal to have someone go into a remote server room and do this after each reboot.

One of the units had stopped putting any console output so we went the next drastic step and that is remove the eMMC from the board.

Since doing this, this unit now reboots fine into the external storage and reboots fine so it does seem like the issue is with the eMMC, we are going to repeat this process with the other unit to see if it also solves its rebooting issue.

My only real concern is since doing this almost all our 6100 in the field are reporting there eMMC are at end of life, some units are less than 6 months old and are satellite offices which don't do a lot of traffic. I just hope that we aren't going to have a cascade of failing eMMC drives.

stephenw10

If it's less than 6m old that should be in warranty so would be eligible to be replaced.

That's good info on removing the eMMC though.

shley008

Yup, I have 2 of these now. Both with 'Innodisk' 128GB nVME drives. If we have to physically remove the eMMC...thats not cool. There should be a way to disable it via firmware or something. BIOS maybe, (if an interface to change config exists on these). Issue in general is pretty messed up though...

andrew_cb

@stbellcom

My only real concern is since doing this almost all our 6100 in the field are reporting there eMMC are at end of life, some units are less than 6 months old and are satellite offices which don't do a lot of traffic. I just hope that we aren't going to have a cascade of failing eMMC drives.

Sorry to hear that you are going through this! I am in the same situation - 6 devices with failed eMMC storage and another 10 of 30 that are at or over 100% estimated wear.
It is mind-blowing that you have units failing at less than 6 months - I guess I should consider myself lucky that mine seem to last 18-24 months before they start dying.

I also have 2 units with the same no-power on issue. With one, I had just finished installing pfSense to a USB stick and it had booted and I was logged-in, but on the next reboot it just stopped responding completely - no console anymore. I wonder if removing the eMMC would get it working again.

Are you running any packages on the failed units? Most of mine are remote/small office units that just run Zabbix, and yet they have impending eMMC failure.

Check out my thread for more information and discussion about failing storage in Netgate devices.

andrew_cb

@stbellcom Run these commands in the Command Prompt or console to check the eMMC health of your devices:

pkg install -y mmc-utils; rehash

mmc extcsd read /dev/mmcsd0rpmb | egrep 'LIFE|EOL'

The Type_A and Type_B wear values are hex values that you need to multiply by 10 to get the wear percentage.

The Pre-EOL value is 0x01 for under 80%, 0x02 for >80%, and 0x03 for >90% consumed reserve blocks.

https://docs.netgate.com/pfsense/en/latest/troubleshooting/disk-lifetime.html

stbellcom

@andrew_cb

We are running pfblockerng which is known to cause a lot of writes, but I am running a non-standard setup. Maxmind DB lists are generated externally and are pulled down weekly and most of the logging is disable.

We only use it to maintain geo-blocking lists and no other features.

The unit mentioned above at the remote office has no incoming forwards so doesn't have pfblockerng installed on it. Yet it still says 0x0b. It does have a IPSEC tunnel back to the main office.

The two units that have failed so far are both sites which do the highest traffic. They average around 5TB a month.

Almost all our units are show 0x0b on Type_A and Type_B with only one showing 0x05. For interest's sake I grabbed a brand-new unit out of the box and it did show 0x01 which is what I would expect.

andrew_cb

@stbellcom Yikes, that 0x0b wear is scary. How many units do you have? Most of our units that failed or showed high wear only run Zabbix Agent and Zabbix Proxy.

edoy

@stbellcom How did you remove the eMMC from the 6100? I have a dead 6100 and would like to do the same. Thanks.

luckman212

@stbellcom said in 6100 Failed eMMC replaced with NVme but now no longer reboots:

so we went the next drastic step and that is remove the eMMC from the board.

Got any pictures of this process, that show the location of the eMMC chip(s) or tips for someone who wants to try it?