Another Netgate with storage failure, 6 in total so far
-
This morning, I was surprised to find that my threads in /r/pfsense and /r/netgate have been deleted. Fortunately, I still have screenshots of an interesting post from kphillips-netgate (@kphillips), in which he says:
...I think it's important to err on the side of letting people discuss things without overbearing moderation unless it becomes necessary...
Interesting. Can you find out why both Reddit threads were deleted and who made that decision?
...[I] process RMA support tickets for devices every day...
...Beware of confirmation bias...These are good points to keep in mind as they tie into his next statement:
I haven't seen any particularly unusual numbers of RMAs for any particularly product in our lineup.
What is specifically considered to be an RMA? Is this all claims for RMA, or only claims that have been accepted as warranty? What is the ratio of approved RMAs versus RMAs denied because they are past the 1-year warranty? It appears that a significant amount of posts are about devices that failed after the 1-year warranty, and many users have had their warranty claim rejected or did not bother contacting Netgate since they were out of warranty. This would significantly alter the number and ratio of RMA claims.
...I'm not trying to admonish or belittle anybody here...
...sometime's it's totally by accident and I'm not trying to "blame shift" here...I am glad to hear that you personally are not trying to admonish, belittle, or blame shift, but...
...You will only see people who ran into issues...
It seems that others do not share your view of giving the user the benefit of the doubt. In all the posts made by users who encountered storage failure, what we do see is Netgate consistently blaming the user for causing the storage failure, and never apologizing or showing empathy.
We have a page outlining many of the packages that need an SSD here (linked to the page).
I have repeatedly mentioned that this page is NOT linked anywhere. The only way you can be aware of its existence is by searching or following a link from someone. It is impossible to for a purchaser to be aware of a) storage wear caused by high writes, b) what packages and settings can cause high writes, and c) the decision criteria for choosing a MAX model. If anyone can show me a direct link to the "Supported pfSense Plus Packages" page on the Netgate website, I will be forever humbled.
I run a 6100 as my edge for work at Netgate with on eMMC (no NVME SSD installed) and it gets worked.....HARD. It's my only 6100 I have and I use it for new release building, bug testing, package testing, and much more. It has been in continuous operation for about 3 years with little to know downtime. [screenshots showing eMMC health at 0x05, 0x06, 0x01, and a manufacturing date of 06/2022]
These VERY interesting data points are provided to us by a Netgate employee. Remember what another Netgate employee @jwt said in post #34?
...the principle difference between eMMC and NVMe or SSD device is the amount of flash present on a typical eMMC .vs SSD or NVMe drive...
Let us analyze these statements further.
jwt asserts that there is no significant difference between eMMC, SSD, or NVMe other than the total amount of flash. Okay, so then eMMC is not a technologically inferior storage medium.
Now, kphillips helpfully provides some evidence of how he has a 6100 with eMMC storage and he has worked it "HARD" for 3 years and yet the storage wear is only at 50-60%.
This is where things start getting confusing...
If the statements made by jwt is correct and the data supplied by kphillips is true and representative of the durability of Netgate devices with eMMC storage, then why are so many users experiencing eMMC failure in under 3 years? And why is there a presumption that the user is at fault for causing the failure? And why does Netgate never express concern about these "rare" incidents? If kphillips were to create a post about his 6100 dying under these conditions, I wonder what kind of response he would receive? Is kphillips trying to prove that the 16GB of onboard eMMC storage actually is fine when "worked HARD"?
On the other hand, if kphillips' data is simply anecdotal and does not represent what a user can expect from the onboard eMMC storage, then it is irrelevant to this discussion.
So, which is it?
Is eMMC endurance similar to SSD/NVMe and Netgate devices with eMMC should be durable enough to handle package usage as kphillips demonstrated and that the devices were simply equipped with faulty eMMC, in which case the users did nothing wrong, and it is simply tough luck because the failure occurred after the 1-year warranty expired;
--OR--
Is the 16GB of onboard eMMC storage a known weak point that needs to be clearly identified and the user provided with abundant warnings on the product page, on the package manager page, whenever installing a package, and on the settings page of each package? -
@stephenw10 There have been a few posts now on here and reddit about successfully reviving a dead Netgate by removing the eMMC. I have a 2-year old 4100 here that had the eMMC die and I got it working on a USB flash drive only for it to go completely dead after restoring the config.
I have not had time to dig into it further, but if it truly is dead then I will try removing the eMMC chips to see if that gets it revives it.
-
Here are screenshots of @kphillips post for reference:
-
@andrew_cb I reached out to kphillips-netgate last night on Reddit and suggested we have a call to discuss this situation further, clarify the issues, and hopefully identify solutions. He asked for clarification of what threads I was referring to, and I sent him links to this thread and two others.
He has not gotten back to me yet, so I will update here when/if I receive any further responses from him.
-
Mmm, the problem with escalating things in this way is that it suppresses actual useful posts. It moves from a technical discussion to a marketing/legal matter where I (and others) can no longer comment.
-
@andrew_cb Holy crap, it worked for me! Yanked (read carefully removed using appropriate rework methodology) the Kingston eMMC out of my bricked 4100 that wouldn't post and lo and behold I've got a console back and have booted the USB installer!
-
I'll continue to monitor and report internally about any situations I see crop up that might be trends or pattern.
Are all the posts about eMMC failure over the last few years, nor are the explicit requests/suggestions for improved messaging enough to indicate any trend or pattern with regard to eMMC failure? If the issue truly is misuse by the user, then why has nothing been done to better educate purchasers and users before they do things that could result in accelerated eMMC wear. Better education and messaging would likely eliminate or significantly reduce the frequency of eMMC failure.
Similarly, @stephenw10 and others have posted hundreds of responses in which they advise users to reduce logging (including disabling the DEFAULT logging rules) and use ramdisks.
Why have these common suggested changes not been incorporated into the default settings for pfSense or at least recommended (such as in the setup wizard)? Just what does Netgate actually consider to be a trend or pattern that needs to be actioned?
Despite being incomplete and not linked anywhere, the "Supported pfSense Plus Packages" page seems to be a "gotcha" shield to deflect any and all failures onto the user.
-
@arri Wow that is cool! I am glad to hear that it worked for you!
I will report back when I get around to trying this on the dead 4100 I have here. -
@arri said in Another Netgate with storage failure, 6 in total so far:
Yanked the Kingston eMMC out of my bricked 4100 that wouldn't post and lo and behold I've got a console back
Nice! I assume when you say 'yanked' you mean carefully removed with SMT tools?
-
@stephenw10 I can understand that. I know you try to be polite and helpful, and I know that myself and others appreciate your contributions.
It is unfortunate that the situation has escalated to this point. I feel that this could have been avoided if Netgate had simply responded to some of the questions directed at them.
In my November 2024 thread Concerns and feedback about storage lifetime wearout on Netgate devices, I gave feedback on my experience with storage wear-related issues and provided several suggestions for technical and educational improvements. That post seems to have gone unnoticed.
The February 2024 thread eMMC Write endurance raised many good points and questions, but it too seems to have gone unnoticed.
This brings us to this thread, where I again attempted to raise the issue of eMMC storage issues, initially trying to build a stronger case for how and why Netgate needs to better educate during the purchasing process, how to inform the user better before they make changes that could affect the lifetime of their device, GUI changes that could reduce the chance of activating non-recommend settings and help users better monitor storage wear, and technical changes for reducing storage wear.
Despite Netgate responding that "you have it" [our attention], "thank you for your suggestions and input. We will consider them", and "Some good points have been raised along with actionable suggestions to mitigate the issue. Thanks for the constructive feedback - the issue has our attention," nothing further has been done, and there has been no further response.
Meanwhile, users (including myself) continue to experience failure on a daily basis, and not even some simple wording on a few web pages has been updated to help inform potential purchasers on how to determine if the BASE or MAX version is right for their needs. Someone is probably purchasing a BASE model right now and unaware of the potential pitfalls that await them.
-
@SteveITS said in Another Netgate with storage failure, 6 in total so far:
It is also usually much smaller.
But this does not explain why most Netgate appliances have such small eMMC sizes, seemingly limited to the lower-end segment, like cheap hardware—though they are not. The only assumption I can make is that the hardware was developed much earlier than it was sold, or that some local retailers are restricted to whatever stock they had.
Nevertheless, the problem is generally solvable, but for some reason, it is not sufficiently covered. Perhaps this is because it was assumed that the devices are purchased by people who understand what eMMC is, that the number of write cycles is limited, and that the overall storage capacity is not very large? I don't know.
-
The number of responses to my Reddit threads from users who were completely unaware of storage health issues and the ones who discovered their device was worn or at risk of imminent failure highlights that more education and awareness are desperately needed.
What started as a simple request has now turned into this, with no resolution in sight.
-
@stephenw10 Yes, that was irresponsible of me to imply anything other than lest someone actually do so literally. I applaud the engineer who laid out the board as it was about as trivial as possible to remove.
Just finished installing 24.11 onto the NVMe which is fastened with an M2.5 instead of M2 like everyone else. At least the dang thing is included in the 4100, it's not in the 4200 for some inexplicable reason.
Looking good!
-
@arri The 4200 using an M2.5 screw confused us too. I think we ended up ordering a package of them from Amazon.
-
Yes in the past, in desperate times, I have resorted to physical violence again ICs. And it has worked! But I would never recommend that. I'm pretty sure I got extremely lucky.
-
@stephenw10 I had to replace the EEPROM chips in an old Camaro computer after it was bricked by a bad flash. That was a nervous experience with a heat gun!