Another Netgate with storage failure, 6 in total so far
-
@andrew_cb I reached out to kphillips-netgate last night on Reddit and suggested we have a call to discuss this situation further, clarify the issues, and hopefully identify solutions. He asked for clarification of what threads I was referring to, and I sent him links to this thread and two others.
He has not gotten back to me yet, so I will update here when/if I receive any further responses from him.
-
Mmm, the problem with escalating things in this way is that it suppresses actual useful posts. It moves from a technical discussion to a marketing/legal matter where I (and others) can no longer comment.
-
@andrew_cb Holy crap, it worked for me! Yanked (read carefully removed using appropriate rework methodology) the Kingston eMMC out of my bricked 4100 that wouldn't post and lo and behold I've got a console back and have booted the USB installer!
-
I'll continue to monitor and report internally about any situations I see crop up that might be trends or pattern.
Are all the posts about eMMC failure over the last few years, nor are the explicit requests/suggestions for improved messaging enough to indicate any trend or pattern with regard to eMMC failure? If the issue truly is misuse by the user, then why has nothing been done to better educate purchasers and users before they do things that could result in accelerated eMMC wear. Better education and messaging would likely eliminate or significantly reduce the frequency of eMMC failure.
Similarly, @stephenw10 and others have posted hundreds of responses in which they advise users to reduce logging (including disabling the DEFAULT logging rules) and use ramdisks.
Why have these common suggested changes not been incorporated into the default settings for pfSense or at least recommended (such as in the setup wizard)? Just what does Netgate actually consider to be a trend or pattern that needs to be actioned?
Despite being incomplete and not linked anywhere, the "Supported pfSense Plus Packages" page seems to be a "gotcha" shield to deflect any and all failures onto the user.
-
@arri Wow that is cool! I am glad to hear that it worked for you!
I will report back when I get around to trying this on the dead 4100 I have here. -
@arri said in Another Netgate with storage failure, 6 in total so far:
Yanked the Kingston eMMC out of my bricked 4100 that wouldn't post and lo and behold I've got a console back
Nice! I assume when you say 'yanked' you mean carefully removed with SMT tools?
-
@stephenw10 I can understand that. I know you try to be polite and helpful, and I know that myself and others appreciate your contributions.
It is unfortunate that the situation has escalated to this point. I feel that this could have been avoided if Netgate had simply responded to some of the questions directed at them.
In my November 2024 thread Concerns and feedback about storage lifetime wearout on Netgate devices, I gave feedback on my experience with storage wear-related issues and provided several suggestions for technical and educational improvements. That post seems to have gone unnoticed.
The February 2024 thread eMMC Write endurance raised many good points and questions, but it too seems to have gone unnoticed.
This brings us to this thread, where I again attempted to raise the issue of eMMC storage issues, initially trying to build a stronger case for how and why Netgate needs to better educate during the purchasing process, how to inform the user better before they make changes that could affect the lifetime of their device, GUI changes that could reduce the chance of activating non-recommend settings and help users better monitor storage wear, and technical changes for reducing storage wear.
Despite Netgate responding that "you have it" [our attention], "thank you for your suggestions and input. We will consider them", and "Some good points have been raised along with actionable suggestions to mitigate the issue. Thanks for the constructive feedback - the issue has our attention," nothing further has been done, and there has been no further response.
Meanwhile, users (including myself) continue to experience failure on a daily basis, and not even some simple wording on a few web pages has been updated to help inform potential purchasers on how to determine if the BASE or MAX version is right for their needs. Someone is probably purchasing a BASE model right now and unaware of the potential pitfalls that await them.
-
@SteveITS said in Another Netgate with storage failure, 6 in total so far:
It is also usually much smaller.
But this does not explain why most Netgate appliances have such small eMMC sizes, seemingly limited to the lower-end segment, like cheap hardware—though they are not. The only assumption I can make is that the hardware was developed much earlier than it was sold, or that some local retailers are restricted to whatever stock they had.
Nevertheless, the problem is generally solvable, but for some reason, it is not sufficiently covered. Perhaps this is because it was assumed that the devices are purchased by people who understand what eMMC is, that the number of write cycles is limited, and that the overall storage capacity is not very large? I don't know.
-
The number of responses to my Reddit threads from users who were completely unaware of storage health issues and the ones who discovered their device was worn or at risk of imminent failure highlights that more education and awareness are desperately needed.
What started as a simple request has now turned into this, with no resolution in sight.
-
@stephenw10 Yes, that was irresponsible of me to imply anything other than lest someone actually do so literally. I applaud the engineer who laid out the board as it was about as trivial as possible to remove.
Just finished installing 24.11 onto the NVMe which is fastened with an M2.5 instead of M2 like everyone else. At least the dang thing is included in the 4100, it's not in the 4200 for some inexplicable reason.
Looking good!
-
@arri The 4200 using an M2.5 screw confused us too. I think we ended up ordering a package of them from Amazon.
-
Yes in the past, in desperate times, I have resorted to physical violence again ICs. And it has worked! But I would never recommend that. I'm pretty sure I got extremely lucky.
-
@stephenw10 I had to replace the EEPROM chips in an old Camaro computer after it was bricked by a bad flash. That was a nervous experience with a heat gun!
-
@andrew_cb Got mine done today. Went pretty well, not show stoppers. A wee bit stressful when you do this kind of work so infrequently, so thanks for all the guidance. Here's to another 5+ years of (hopefully) flawless use.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
The number of responses to my Reddit threads from users who were completely unaware of storage health issues and the ones who discovered their device was worn or at risk of imminent failure highlights that more education and awareness are desperately needed.
We're talking about a complex network device, so it would be reasonable to assume that people buying it have some understanding of what they're purchasing. However, it seems that the topic of storage has somehow passed by a significant portion of users.
Back in 2009, I bought my first SSD for $900 and had quite a few issues with it, even though it was an Intel drive with pure SLC (X25-E). Probably thanks to that experience, I now understand how things work and remain a fan of various flash memory-based devices to this day.
As for warning users now, I believe I was actually the first to suggest doing that earlier in this thread.
As a permanent solution, at least for Netgate devices, there should be a health monitoring system that includes all types of warnings about the eMMC's condition—if the flash itself supports it. If not, then it would probably be best to strictly recommend that users utilize RAM disks or even automatically enable them during installation if the device has less than 16GB of storage.
-
@w0w
I appreciate your input. My comments below are not targeted at your specifically, but believe they are helpful for illustrating why disagree with any "should have known" arguments.it would be reasonable to assume that people buying it have some understanding of what they're purchasing. However, it seems that the topic of storage has somehow passed by a significant portion of users.
I disagree that it is a reasonable assumption to make. I have been working with firewalls for 20 years and have never had to consider the type of storage medium used. I also believe the purchaser's knowledge of storage types should be irrelevant in this matter.
Looking at the product page for the 6100, the two choices are as follows:
BASE 8GB Memory 16GB Storage
MAX 8GB Memory 128GB Storage
Further down, the storage options are clarified:
Storage: 16 GB eMMC (or optional 128 GB NVMe M.2 SSD)
and
Storage 16 GB eMMC (onboard - soldered) upgradeable to 128 GB NVMe M.2 SSD with 6100 Max
That is all the store page says with regard to storage.
The rest of the page is filled with performance ratings and all the great things that pfSense can do when using various packages.
Not including the header and footer, there are 1333 words on the page.
411 words, or 40%, are about all the capabilities and benefits of pfSense A mere 32 words, or 2%, are in the sentences related to storage.There is absolutely nothing on the page that
- Indicates that there are any differences between eMMC and regular SSD storage
- Indicates that some features/packages require an SSD and are not recommended for use with eMMC storage
- Gives endurance ratings for the eMMC and SSD storage to highlight the difference between them.
- Provide the purchaser with additional information to help inform and guide their purchasing decision.
Would you agree that if the choice of which type of storage to get is so critical, it should be significantly more prominent on the page?
We're talking about a complex network device
A major reason for purchasing a pre-built firewall from a vendor is to avoid the hassle and deep knowledge involved with building a custom device. Firewalls are a commodity item nowadays, and other firewall vendors can do IDS and IPS for years without storage failures. I have seen many 10+ year old Sonicwall and Sophos firewalls do this without any issues.
If we revisit jwt's statements regarding storage media:
- The principle difference between eMMC and NVMe or SSD device is the amount of flash present on a typical eMMC .vs SSD or NVMe drive.
- Larger devices have more sectors and as a direct result, can engage "wear leveling" algorithms in the controller to spread the erase cycles across more sectors.
- Larger devices also cost more, due largely to market dynamics.
- Used within its limitations, eMMC is a good solution. Your phone likely has eMMC inside it. Many network devices, even from companies such as Cisco and HP/Juniper have eMMC inside them for storage.
- our [high] level of effort and engagement with Silicom
Which we can reduce down to:
- No major difference between eMMC and NVMe storage other than capacity
- Larger storage devices can wear-level better
- Larger storage devices cost more
- Netgate works closely with Silicom on the hardware that is used in their devices
Taking the above into consideration, in the absence of any stated warnings, cautions, limitations, recommendations, or disclaimers, a purchaser should be able to trust that what the vendor is offering is capable of performing the advertised functions.
Why should a purchaser or user be concerned about the difference when Netgate themselves is arguing that eMMC storage is just as good as NVMe storage and makes no effort to distinguish the two other than capacity?
The product page of the 1100 describes it as
the ideal microdevice for the home and small office network
It does not sound like the target market for the 1100 is people with many years of storage technology and Unix filesystem knowledge.
Yet the 1100, which is only available with eMMC storage and cannot be upgraded to an SSD, lists all the exact same pfSense features as the 8300 MAX.But how can that be? Is it possible that there are some inaccuracies or that important information has been forgotten on the product pages?
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
I disagree that it is a reasonable assumption to make. I have been working with firewalls for 20 years and have never had to consider the type of storage medium used. I also believe the purchaser's knowledge of storage types should be irrelevant in this matter.
I don't have extensive experience with various firewalls, but I've come across cases on Reddit where Sophos internal storage failed, and even on forums, there were reports of failures with Cisco's FTD. I don't know the failure rate of such devices, but their price range is significantly higher. I'm not justifying anyone, but shit happens.
It also probably depends on usage conditions, settings, and many other factors.
Larger devices have more sectors and as a direct result, can engage "wear leveling" algorithms in the controller to spread the erase cycles across more sectors.
I would also note that if the minimum eMMC size were 16GB, we probably wouldn't be having this discussion right now.
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
Used within its limitations, eMMC is a good solution. Your phone likely has eMMC inside it.
Actually eMMC is going away from phones. UFS3.1 is a next level. But this is a bit off topic.
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
The product page of the 1100 describes it as
the ideal microdevice for the home and small office network
It does not sound like the target market for the 1100 is people with many years of storage technology and Unix filesystem knowledge.
Yet the 1100, which is only available with eMMC storage and cannot be upgraded to an SSD, lists all the exact same pfSense features as the 8300 MAX.But how can that be? Is it possible that there are some inaccuracies or that important information has been forgotten on the product pages?
You can include it in the product description, but that falls under marketing.
And today's marketing trend is: never tell the customer something they didn't ask about.
Documentation, however, should probably contain footnotes and explanations. Or, as I already mentioned, perhaps every setting or checkbox that could potentially generate a large number of logs should have a footnote or a note for users explaining the consequences.
-
@w0w said in
I would also note that if the minimum eMMC size were 16GB, we probably wouldn't be having this discussion right now.
I think you meant to say "if the minimum eMMC size were NOT 16GB, we probably wouldn't be having this discussion right now.
And I agree - our 7100's that come with 32GB of eMMC seem to last twice as long as our 4100 and 6100's that are dying at about half the age of the 7100s. Silicom offers larger eMMC sizes on several models, so just increasing the minimum eMMC to 32 or 64GB would likely significantly reduce this problem.Actually eMMC is going away from phones. UFS3.1 is a next level. But this is a bit off topic.
That is interesting to know!
You can include it in the product description, but that falls under marketing.
And today's marketing trend is: never tell the customer something they didn't ask about.
This is the #1 issue that is causing this whole problem. A lack of any useful information, but when the storage fails, everyone is quick to blame the user for not knowing.
Documentation, however, should probably contain footnotes and explanations. Or, as I already mentioned, perhaps every setting or checkbox that could potentially generate a large number of logs should have a footnote or a note for users explaining the consequences.
I completely agree. I think both you and I have mentioned this several times.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
I think you meant to say "if the minimum eMMC size were NOT 16GB
The 1100 and 2100 base units have 8 GB.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
I think you meant to say "if the minimum eMMC size were NOT 16GB, we probably wouldn't be having this discussion right now.
Exactly!
I would even rephrase it to say that 32GB would likely be the minimum sufficient for something else to fail first, such as the power supply.