Another Netgate with storage failure, 6 in total so far
-
@stephenw10 said in Another Netgate with storage failure, 6 in total so far:
My edge device here is a 3100
This one does not use ZFS, is not it?
And I also noticed that pfSense very often incorrectly displays the actual size of /tmp and /var.@SteveITS said in Another Netgate with storage failure, 6 in total so far:
Recent versions of pfSense don’t allocate the RAM disk space until it’s used, so it’s more flexible.
Yep, but for some reason (like a huge syslog file), I have been running out of space several times.
-
And I want to repeat once again: the problem is not whether the RAM disk is enabled, whether to enable it, or how to do it. The problem is that disk wear goes unnoticed by the user, and they only start paying attention when the device has already died or is in a critical "almost dead" state.
So maybe, I don’t know, it's worth updating the documentation and, through some kind of newsletter, news post, or blog, recommending that users perform checks and follow the recommendations in the updated documentation?
-
@w0w said in Another Netgate with storage failure, 6 in total so far:
or is in a critical "almost dead" state
If only this were true, unfortunately there is no system in place for tracking the wear state that I'm aware of. The only warning is failure on a stock appliance. The only tools I'm aware of to check the state require proactive installation by the user from the command line.
Since this appears to be a common problem, it's strange to me mmc-utils isn't included on at least the base appliances. I would have appreciated bars in the System Information dashboard showing the eMMC Life Time Estimations and Pre EOL states. Once in place, a selectable threshold value to trigger a notification would be nice too
-
Working backward from having had an emmc failure which forced me to further research "Troubleshooting Disk Writes" of course it's obvioius in hindsight why my base model 4100's are dying.
That article clearly warns against installing write heavy packages such as pfBlockerNG, Snort, Suricata, HAProxy, nmap, darkstat, other monitoring packages. It also says "the package list at Package List also notes when specific packages require or work better with an SSD or HDD." Recognizing the difference between eMMC, SSD and HDD is all well and good; however, warning a package will potentially harm eMMC might be more effective at discouraging idiots like me from buying base models in the first place and/or installing such packages innapropriately.
Finally, if such a warning or the existing verbiage on the web based package list were additionally included in the actual package manager where most people will decide to install said packages it might be considerably more effective in preventing accelerated eMMC wear.
-
@arri Sorry to hear that your 4100 died.
I have already made the same suggestions as you. Just some warnings and links in a few places (like the package manager and log settings) would help users avoid getting into situations that can cause excess writing.
Storage failures are a frequent occurrence and including emmc-utils was requested over 3 years ago. In all the new daily threads about storage failure, the user is at blamed, yet they are not provided with any tools for monitoring the storage.
It is puzzling why emmc-utils has not been included the base install and why the SMART and EMMC monitoring are not running by default.
-
@andrew_cb
It’s interesting how the thread went silent from the Netgate team. Maybe they’re still looking into it? -
The emmc-utils package is only available in Plus... so users of CE have absolutely no way to monitor their eMMC health. Apparently, monitoring your eMMC health is a special privilege? Maybe a way of discouraging the use of CE?
https://docs.netgate.com/pfsense/en/latest/troubleshooting/disk-lifetime.html
This package is currently only available on pfSense
Plus software and does not have a GUI component. It must be run from an SSH or console shell prompt.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
The emmc-utils package is only available in Plus... so users of CE have absolutely no way to monitor their eMMC health. Apparently, monitoring your eMMC health is a special privilege? Maybe a way of discouraging the use of CE?
https://docs.netgate.com/pfsense/en/latest/troubleshooting/disk-lifetime.html
This package is currently only available on pfSense
Plus software and does not have a GUI component. It must be run from an SSH or console shell prompt.
Well, in Netgate's defense, I suspect the number of pfSense CE users running on eMMC is miniscule. Most whitebox hardware is most likely going to have either SSD or a spinning disk. I believe eMMC is much more prevalent in the Netgate appliances, and since anyone purchasing a Netgate appliance gets pfSense Plus it's more logical to include the utility there. Maybe I missed it, but I don't recall seeing a single post from a CE user that has experienced failed eMMC. It would be trivial to add the utility to the CE package repo, but I suspect it would not be widely used there.
-
Some more recent threads about storage failure.
Overall, storage failures seem to be the most common on the 4100, possibly it is the most popular model?https://www.reddit.com/r/PFSENSE/comments/1ilhit2/my_netgate_4100_is_defect/
https://www.reddit.com/r/PFSENSE/comments/1ikprzt/4100_disassembly/
https://www.reddit.com/r/PFSENSE/comments/1ie17xz/ideas_for_an_eol_4100/
https://forum.netgate.com/topic/196253/sg-1100-storage-health-questions -
-
Hmm, not sure why the pkg isn't in the CE repo. I guess there wasn't much call for it at the time. Seems like we could add that pretty easily. Let me see....
-
@stephenw10 It would be great if you can get mmc-utils added to the CE repo!
-
@w0w I share your frustration. One minute their Netgate is working, then just dies. Then they try to reinstall pfSense and the installer says no disks were found...
Those are great suggestions on how to spread awareness. This issue has been brought up many times before but it never goes anywhere, so hopefully we can bring about some change and prevent this from happening to others.
-
@bmeeks It's possible that not many are using CE on a whitebox with eMMC, but I have seen threads about it on Reddit. I think Protectli, Firewalla, and Topton also use eMMC in some of their models, but I am not positive. Several models list 16 or 32GB storage, which is often eMMC.
-
I also want to mention the repair options. I'm not sure if it's possible to replace the eMMC chip with a larger one without modifying the BIOS, but I'm almost certain that you can replace it with the same model or a full equivalent.
Of course, this depends on the country and the price charged for the work. Again, whether the technician is truly a professional or just incompetent remains a question... But this option definitely exists.
-
-
A thread from 2022 has resurfaced and it is eerily similar to the discussion happening now in 2025:
- The expected lifetime of 16 and 32GB eMMC storage at various average write rates.
- The increased wear from running popular IDS and IPS packages.
- Request for adding mmc-utils to the base pfSense image (including a Redmine).
- Users already experiencing storage wearout.
- Suggestions to use ramdisks and disable logging of default rules.
- The effects of ZFS vs UFS on storage wear.
- TRIM appears to be disabled.
- Requests/suggestion to include storage considerations on the product pages.
I cannot understand why Netgate did not investigate or take any action on these issues in 2022, 2023, or 2024.
@dugeem checked 3 devices and noted:
eMMC drives generally support TRIM, but in all cases it was disabled.
@jwt said
TRIM (or an equivalent such as DISCARD) are required by JEDEC standards as far back as 2010.
So there seems to be a discrepancy in whether TRIM support is actually enabled and working or not.
Further, the JEDEC eMMC v5.0 standard which enables eMMC health reporting is from 2013 and is supported by many Netgate devices, so it is confusing why it is not supported by the 4200 that was released in 2024.
@Cabledude asked in 2024:
Would the 128GB SSD benefit (have extended life) if RAM disk is used?
@stephenw10 responded:
Yes. But the write cycle life on any recent SSD is likely to outlive the usefulness of the device anyway. So I'd question the value in doing so.
If a 128GB SSD "is likely to outlive the usefulness of the device", then what is the implication for the lifespan of 16GB eMMC storage?
I am not sure what conclusion to draw other than beginning in 2022 Netgate knew or should have known that 16GB of eMMC storage was insufficient for running anything other than the most basic of configurations (and even then, it is necessary to disable most of the default logging and possibly use ramdisks).
@keyser 's words from 2022 seems tragically prophetic:
This is going to become a netgate scandal
I think it officially has now.
-
I've been having some issues with my 6100 locking up and becoming unresponsive, reported the issue to Netgate TAC who didn't provide any useful feedback. Searching reddit for support and I read about the eMMC failures on 6100. Go and check mine;
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x0b
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x0b
eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01Yikes! Got an Intel Optane 16Gb in there now after a lot of pain with the installer not working with my PPPoE internet service.
I have to say this really seems like planned obsolescence on the part of Netgate, why sell a device with hardware which cannot support its operation beyond a couple of years of normal use.
Why doesn't your TAC team identify this as an issue?
Why are you trying to dissuade customers from implementing a fix?
When are you going to compensate customers for the damages? -
@punting_packets said in Another Netgate with storage failure, 6 in total so far:
my 6100 locking up and becoming unresponsive
It's usually pretty obvious if the boot drive fails. Just becoming unresponsive but rebooting back to normal operation is not what I would expect to see. So you may not be seeing a failing driver there even though the estimated ware levels are high.
Drive failures usually throw a lot of drive/controller errors. Even if the logging stops the console will be filled with errors. If you can, checking the console in the hung situation should confirm that. -
@stephenw10 Thanks for the response, the 6100 simply stopped forwarding traffic but the console was still responsive. There was nothing in the logs other than a lot of failed PPPoE sessions and the only way to restore service was a reboot. I might be conflating the ware on the eMMC with other issues, only time will tell :-)
-
@stephenw10 Drive errors are common when the storage is failing, but not always.
I have been troubleshooting a 7100 for the past few days where the internet connection was basically unusable despite being on 1Gb fibre and 300Mbps cable connections. I disabled nearly all of the port forwards and services, but the CPU load was constantly high and the GUI was very sluggish to navigate, even though there was less than 5Mbps of network traffic. The gateway monitors were constantly in warning status with latency over 100ms and 30% packet loss. There were no storage-related errors anywhere. I was still able to make configuration changes, and the unit was able to reboot.
I checked the eMMC health and found it was at 0x0a (100%).I went onsite yesterday and installed a 250GB WD Blue SSD, and the device is working great now.
To be fair, it is a 7100 and 5-6 years old, so the storage failure is less surprising, but the lack of any alerting is the biggest problem.
Before becoming aware of storage wear out, we had other devices that appeared to be working but stopped responding when a config change was made. During troubleshooting we would find that the device was no longer detecting the onboard eMMC storage.
-
Hmm, I'd be surprised if low throughput like that was caused by a drive issue. I assume you had to reinstall to the new SSD, had you tried reinstalling to the eMMC?
But it could have been indirectly caused by high CPU usage that itself was caused by some access issue. Though I would expect to be able to see that in the usage or errors logged.