Another Netgate with storage failure, 6 in total so far
-
@andrew_cb
Brutal…hard to ignore your data points. Good job on providing context. -
Just scrolling through the Official Netgate Hardware forum has these definite storage failures (and there are even more threads that might be storage-related):
- 4 days ago: - 6100 with failed eMMC
- 6 days ago - 4200 with failed eMMC
- 8 days ago - failed NVMe on a 6100 MAX
- 14 days ago - 2100 MAX reporting 48% health
- 67 days ago - 1100 with failed eMMC
In this thread @SteveITS lists suggestions for reducing storage wear that mirror what is being said by both Netgate staff and other users:
- https://www.netgate.com/supported-pfsense-plus-packages lists which packages "require" or recommend SSD over eMMC <- Many packages do not specify that they require/recommend SSD
- turn off logging of the default block rules <- why is this on by default if it can be problematic?
- turn off logging of the bogon rules <- again, why is this on by default?
- turn off Suricata logging of HTTP requests <- there is NO documentation for configuring Suricata
- turn off pfBlocker DNSBL logging <- this is not mentioned on the pfBlocker setup page
- create a "don't log" rule for IGMP <- this started occurring in 24.03 due to correcting a logging bug. Redmine and Forum discussion. Again, this can create a lot of logging, so why is it enabled by default?
- don't view the dashboard 24x7 (each widget logs the web server request to update the widget) <- Along with similar suggestions to disable various RRD graphs, this is just getting silly. How can anyone possibly know this will cause an issue?
- use RAM disk <- this requires additional planning and setup to compensate for the loss of persistent logging, and also consumes memory.
Curiously, the Hardware Sizing document does not mention storage at all. It even specifically mentions Snort and Suricata, but says nothing about storage. This seems like a logical place to mention storage write and storage space usage considerations, but unfortunately, it is another missed opportunity.
Now, let us look at the sacred Supported pfSense Plus Packages page. Only HAProxy and NtopNG say "Requires SSH/HDD", and Snort and Suricata say "SSD/HDD strongly recommended".
This would imply that the other packages are safe to use with the onboard eMMC storage, right?Just to be sure, let us look at the pfBlockerNG documentation page:
Hmm, not much detail there and certainly no mention of storage issues.What about Status Traffic Totals? Nothing there either.
Maybe some other popular packages will say something.
Arpwatch? Not listed.
Zabbix? Not listed.The switch to ZFS could very well be causing accelerated eMMC wear out, which might explain why this issue seems to have become much more common in the past 2-3 years. We have SG-3100 that are still running with no issues, possibly because they only support UFS. We had a 7100 fail to boot due to a corrupted filesystem that required using the serial console to repair. After that, we reinstalled all other UFS devices with ZFS.
Again, if I buy a truck that clearly states it can haul 20,000 lbs as standard feature, I should be able to install a trailer hitch and go. I should not have to worry about upgrading the engine, braking system, fuel pump, transmission, or suspension to haul the advertised 20,000 lbs!
I don't understand Netgate's and some community members' attitude on this issue: somehow people are using their Netgate device wrong by trying to utilize the advertised features, and they should just accept these failures and install an SSD or buy a new device.
I can understand this from CE users on third-party hardware who aren't paying Netgate anything, but anyone who purchases a device from Netgate surely must expect more than the sudden death in 1-2 years of devices that cost several hundred dollars (or even thousands) each.
The oft-repeated suggestion to "support the project" does not apply here, as no amount of pfSense licenses or TAC subscriptions will solve the inherent eMMC limitations of white-labelled Silicom hardware.
For all the pfSense power users here, how can we get Netgate's attention and bring about some kind of change?
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
For all the pfSense power users here, how can we get Netgate's attention and bring about some kind of change?
Reply
I'm going to paraphrase a bit from where i heard this statement but essentially it goes "Its hard to tell someone they are doing something wrong when they are making money".
I would bet Vegas money that the Base version of the SKUs is very profitable compared to the Max. I'm also willing to bet they are aware of the eMMC flaws. At the end of the day (granted, I'm cynical by nature), I don't think this will move the needle much. Netgate has offered eMMC storage for a very long time. I do believe a disclaimer is needed to assist those making a purchase decision.@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
I don't understand Netgate's and some community members' attitude on this issue: somehow people are using their Netgate device wrong by trying to utilize the advertised features, and they should just accept these failures and install an SSD or buy a new device.
Agree with you here as well. The suggestions essentially boil down to "don't use the software as intended". I cant really add much to your analysis and your grievance but i do hope that 2025 produces some changes.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
I can understand this from CE users on third-party hardware who aren't
.... aware of this situation, as most, may I say nearly all, in the early pfSense adoption process, in beginning, use a VM, or some "saved from the land-fill-PC", slide in a extra network card, install pfSense and before you know, its years later.
As of this, they, the CE users, can't be hit by this issue : They don't use a Netgate appliance, so most probably no emmc.And before you think : Not 'against' you, I'd say you've made some very valid points.
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
how can we get Netgate's attention ...
It's just me, "yet another user" saying, but I'm pretty sure your posts have been read by 'them'.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
How can anyone possibly know this will cause an issue?
I was just listing "lower the amount of disk writing" suggestions.
To play devil's advocate I would suggest none of these things "cause" premature wear, at least by themselves. ZFS wasn't a feature, or at least, not the default, when the 2100 and I think 1100 were released. So it could well be a combination of all these things interacting with new defaults.
Personally I don't think it would have occurred to me to keep the dashboard visible all day until I saw posts about it, in a thread about the web server logs. Perhaps it can have a checkbox to auto-update in the background like the traffic graphs do.
I would guess the logging is on by default because it avoids/answers a lot of "why can't I connect" questions. Package documentation I would think is up to the individual package maintainers, and often done via forum post. Some of the doc pages are pretty outdated.
An SSD is also significantly faster in terms of saving, upgrading, etc. since I/O is faster.
The amount of disk space used by pfSense is typically relatively small so size isn't really a factor unless downloading large lists or data like the UT1 list which is over 1 GB to extract, when it updates. A larger SSD though would have more writing capacity, I'd expect, due to more unused sectors.
I don't know that anyone here is trying to dispute your POV, or your frustration. In terms of contacting Netgate, other than the replies above, if you're a partner you have contact info. If not then you could try sales or support, I don't know. It sounds like an SSD would fit more for your usage scenarios, so I guess the question/goal is to help others or new customers who don't know about wear issues.
-
I raised this internally.
-
@michmoor said in Another Netgate with storage failure, 6 in total so far:
Netgate has offered eMMC storage for a very long time
Added to that :
Afaik : pfSense becomes more popular every day.
Something tells me that this : "pfSense was split into "CE" and "pfSense Plus+" also has something to do with the selling of Netgate appliances ( I would have done the same thing ).
Moore's law is also valid for our data storage needs.
More stats, more data (this gives also that nice feeling you are doing something about "security"), more CPU power, faster bandwidth, more data, more stats and so on ...So more and more appliances out there ....
I know they know : that's why the MAX versions exist (2 or 3 years already ?). And nobody was asking for another NAS is the house. -
@SteveITS That wasn't directed at you, I was just quoting the content of your post as it is a good example of the suggestions that are frequently proposed. I've read many of your posts and you seem to know a thing or two about pfSense
-
@andrew_cb Yeah no prob, just trying to clarify. We've been using pfSense from around the m0n0wall days and we have been a partner for quite a while, so have some history.
-
@michmoor I would think the MAX versions would be much more profitable than the BASE versions - $100 USD for a 128GB NVMe when brand-name models are available on Amazon for under $30 USD, probably under $20 in bulk. If we had known of the issues, we would have purchased 40 MAX units without hesitation, which would be another $3200 profit for Netgate.
-
Thanks everyone for your feedback and support.
I am partly just screaming into the void, but I hope this thread helps others make a more informed purchasing decision using information that, until now, has been scattered all over Reddit and this forum.
I also want to hold Netgate accountable by creating a comprehensive discussion on the root cause and potential solutions to eMMC storage failures on BASE devices, as their response is to just ignore the root cause of all the eMMC failures and simply dismiss them as "user error."
If Netgate really believes that its hardware is "enterprise-ready," then it should investigate the storage failures either through improved messaging and documentation or by improving the BASE hardware to significantly reduce the chance of failure.
-
@stephenw10 said in Another Netgate with storage failure, 6 in total so far:
I raised this internally.
Thank you! I am refreshing this page every 5 seconds to see when you respond (hmm, will that deplete my write cycles? )
-
Ha well it might. What's the rated number key presses on f5?
-
@SteveITS said in Another Netgate with storage failure, 6 in total so far:
The amount of disk space used by pfSense is typically relatively small so size isn't really a factor unless downloading large lists or data like the UT1 list which is over 1 GB to extract, when it updates. A larger SSD though would have more writing capacity, I'd expect, due to more unused sectors.
I think this is a critical part of the equation. Settings and packages that log more than the expected baseline, combined with the behavior of ZFS, lead to more storage writes, which go to a small storage device with TRIM disabled/not supported, and a limited number of spare blocks. This quickly exhausts the approximately 3000 write cycles of the eMMC storage. By my calculations, 5TB is the approximate limit of data that can be written before an 8GB eMMC dies.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
@stephenw10 said in Another Netgate with storage failure, 6 in total so far:
I raised this internally.
Thank you! I am refreshing this page every 5 seconds to see when you respond (hmm, will that deplete my write cycles? )
You can probably stop now. As @stephenw10 said, he raised this earlier.
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
Thanks everyone for your feedback and support.
I am partly just screaming into the void, but I hope this thread helps others make a more informed purchasing decision using information that, until now, has been scattered all over Reddit and this forum.
I also want to hold Netgate accountable by creating a comprehensive discussion on the root cause and potential solutions to eMMC storage failures on BASE devices, as their response is to just ignore the root cause of all the eMMC failures and simply dismiss them as "user error."
First we're going to want to establish that you understand that all flash wears. Every erase cycle causes wear. Due to the way flash works, an entire sector must be erased before any data can be written. This is true for both eMMC and the flash used on SSD and NVMe drives.
This is unlike "spinning media" where each write does not cause measurable wear, flash (including eMMC) does.
Other than the attached bus and controller, the principle difference between eMMC and NVMe or SSD device is the amount of flash present on a typical eMMC .vs SSD or NVMe drive. Larger devices have more sectors and as a direct result, can engage "wear leveling" algorithms in the controller to spread the erase cycles across more sectors.
Larger devices also cost more, due largely to market dynamics.
Someone is sure to point out the performance differences but this is largely due to the bus used (MMC/SD .vs PCIe or SCSI/SAS) as well as more sophisticated devices using multiple flash parts in parallel for I/O.
If Netgate really believes that its hardware is "enterprise-ready," then it should investigate the storage failures either through improved messaging and documentation or by improving the BASE hardware to significantly reduce the chance of failure.
When I search for "enterprise-ready" on store.netgate.com, the only two devices that come up are the 8300 Base and 8300 Max. Neither has an eMMC.
The only device in the Netgate catalog that does not have a SSD/NVMe option is the 1100. Every other device (2100, 4200, 6100) has an option for NVMe storage or only comes with NVMe storage (8200, 8300).
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
Just scrolling through the Official Netgate Hardware forum has these definite storage failures (and there are even more threads that might be storage-related):
- 4 days ago: - 6100 with failed eMMC
- 6 days ago - 4200 with failed eMMC
- 8 days ago - failed NVMe on a 6100 MAX
- 14 days ago - 2100 MAX reporting 48% health
What's the problem here, other than at one point, they said "42%" when the figure is 48%.
There is some discussion of what this metric means here:
https://superuser.com/questions/1504792/percentage-lifetime-used-on-my-ssd-90-is-that-good-or-bad
- 67 days ago - 1100 with failed eMMC
In this thread @SteveITS lists suggestions for reducing storage wear that mirror what is being said by both Netgate staff and other users:
- https://www.netgate.com/supported-pfsense-plus-packages lists which packages "require" or recommend SSD over eMMC <- Many packages do not specify that they require/recommend SSD
- turn off logging of the default block rules <- why is this on by default if it can be problematic?
- turn off logging of the bogon rules <- again, why is this on by default?
- turn off Suricata logging of HTTP requests <- there is NO documentation for configuring Suricata
- turn off pfBlocker DNSBL logging <- this is not mentioned on the pfBlocker setup page
- create a "don't log" rule for IGMP <- this started occurring in 24.03 due to correcting a logging bug. Redmine and Forum discussion. Again, this can create a lot of logging, so why is it enabled by default?
- don't view the dashboard 24x7 (each widget logs the web server request to update the widget) <- Along with similar suggestions to disable various RRD graphs, this is just getting silly. How can anyone possibly know this will cause an issue?
- use RAM disk <- this requires additional planning and setup to compensate for the loss of persistent logging, and also consumes memory.
Thank you for your suggestions and input. We will consider them.
Curiously, the Hardware Sizing document does not mention storage at all. It even specifically mentions Snort and Suricata, but says nothing about storage. This seems like a logical place to mention storage write and storage space usage considerations, but unfortunately, it is another missed opportunity.
Again, thank you for your suggestions and input. We will consider them.
Now, let us look at the sacred
I am unaware of any package which is marked as religious rather than secular.
Supported pfSense Plus Packages page. Only HAProxy and NtopNG say "Requires SSH/HDD", and Snort and Suricata say "SSD/HDD strongly recommended".
This would imply that the other packages are safe to use with the onboard eMMC storage, right?In a word: No. There is no statement about safety on that page. In fact, neither the word "safety", or "safe" occurs on that page.
Just to be sure, let us look at the pfBlockerNG documentation page:
Hmm, not much detail there and certainly no mention of storage issues.What about Status Traffic Totals? Nothing there either.
Maybe some other popular packages will say something.
Arpwatch? Not listed.
Zabbix? Not listed.Once again, thank you for your suggestions and input. We will consider them.
Even more useful would be redmine bug reports.
The switch to ZFS could very well be causing accelerated eMMC wear out, which might explain why this issue seems to have become much more common in the past 2-3 years. We have SG-3100 that are still running with no issues, possibly because they only support UFS. We had a 7100 fail to boot due to a corrupted filesystem that required using the serial console to repair. After that, we reinstalled all other UFS devices with ZFS.
Again, if I buy a truck that clearly states it can haul 20,000 lbs as standard feature, I should be able to install a trailer hitch and go.
Not if the trailer hitch you install isn't rated for 10 tons.
I should not have to worry about upgrading the engine, braking system, fuel pump, transmission, or suspension to haul the advertised 20,000 lbs!
I don't understand Netgate's and some community members' attitude on this issue: somehow people are using their Netgate device wrong by trying to utilize the advertised features, and they should just accept these failures and install an SSD or buy a new device.
I don't want to quote Steve Jobs, but... you're holding it wrong.
Used within its limitations, eMMC is a good solution. Your phone likely has eMMC inside it. Many network devices, even from companies such as Cisco and HP/Juniper have eMMC inside them for storage.
I can understand this from CE users on third-party hardware who aren't paying Netgate anything, but anyone who purchases a device from Netgate surely must expect more than the sudden death in 1-2 years of devices that cost several hundred dollars (or even thousands) each.
Please show me the Netgate device that came (only) with eMMC that cost(s) "thousands of dollars".
The oft-repeated suggestion to "support the project" does not apply here, as no amount of pfSense licenses or TAC subscriptions will solve the inherent eMMC limitations of white-labelled Silicom hardware.
I suggest you tone down rhetoric such as this. "White-labeled" does disservice to our level of effort and engagement with Silicom. You suggest that all we do is slap a label on the box. Are you aware that Silicom also builds (quite similar) devices for Dell and others? Are you suggesting that these are also "white label"? Does Apple using Foxconn to build iPhones mean that iPhones are also "white label"?
For all the pfSense power users here, how can we get Netgate's attention and bring about some kind of change?
Again, you have it.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
@SteveITS said in Another Netgate with storage failure, 6 in total so far:
The amount of disk space used by pfSense is typically relatively small so size isn't really a factor unless downloading large lists or data like the UT1 list which is over 1 GB to extract, when it updates. A larger SSD though would have more writing capacity, I'd expect, due to more unused sectors.
I think this is a critical part of the equation. Settings and packages that log more than the expected baseline, combined with the behavior of ZFS, lead to more storage writes, which go to a small storage device with TRIM disabled/not supported,
TRIM (or an equivalent such as DISCARD) are required by JEDEC standards as far back as 2010.
and a limited number of spare blocks. This quickly exhausts the approximately 3000 write cycles of the eMMC storage. By my calculations, 5TB is the approximate limit of data that can be written before an 8GB eMMC dies.
Yet you do not show these calculations, such that we may assess their accuracy. Specifically, I'm curious what you used for a Write Amplification Factor and how you determined same.
I actually came back to this thread to see if you had responded, but apparently not.