Another Netgate with storage failure, 6 in total so far

keyser

@andrew_cb I fully agree with and understand your situation.
I luckily discovered the issue 3 years back before my devices died of wear-out, and installed an SSD myself.
I created a thread (https://forum.netgate.com/topic/170128/emmc-write-endurance) on this forum, clearly identifying the potential problem and encouraging people to dial down the write intensity of packages and firewall rules. At that time pfBlockerNG had an issue causing it to write in an endless loop, so the figures were really bad at the time. But even after that was corrected, it is still a BIG problem on basic installs.

But Netgate kept the eMMC models around and have still not opted into setting up RAM disk as default on those devices (which is needed now).
So I have been expecting this to turn into a bigger problem at some point.

Not that it helps you or other customers that have dead devices, but I fully agree with you, and you have my sympathy with your current situation :-(

mcury

I got a SG-4100 (not the MAX) and the first thing I did was install a nvme.
Since I was a SG-3100 user for a long time, I was already aware of the eMMC lifespan, but I'm pretty sure that new users won't be aware of this.

One suggestion to Netgate would be, give the user more options in the shop, with a warning and a link to the docs.

Cheaper variant: SG-4200 with eMMC storage (Read about eMMC lifespan here).
20 bucks more expensive than cheaper variant: SG-4200 with a 128GB nvme (not enterprise nvme).
SG-4200 MAX (enterprise nvme).

This would help users during the variant selection, more options for buyers and a warning so users can be prepared in case they get the emmc only variant.

Gertjan

@keyser said in Another Netgate with storage failure, 6 in total so far:

I luckily discovered the issue 3 years back before my devices died of wear-out, and installed an SSD myself.
I created a thread (https://forum.netgate.com/topic/170128/emmc-write-endurance) on this forum, clearly identifying

That was one of the forum posts I've read and used to decide when I had to decide what 4100 I had to take.
The elephant mentioned overthere (== ZFS) wasn't listed here as a package. I found out what 'ZFS' does for a living ...... and I had my answer straight away.

@andrew_cb : Great write-up. It will help future potential Netgate appliance buyers very useful info (if they look for it ...).

bmeeks

@Gertjan said in Another Netgate with storage failure, 6 in total so far:

The elephant mentioned overthere (== ZFS) wasn't listed here as a package. I found out what 'ZFS' does for a living ...... and I had my answer straight away.

I believe ZFS is most definitely a strong underlying root cause of the increased wear. It does quite a bit of background disk writes as part of its resiliency processing. Add on heavy logging with a package or two and you can greatly accelerate the wear.

I'm still running UFS on the two Netgate devices I manage. I just have them each on a UPS.

keyser

@bmeeks said in Another Netgate with storage failure, 6 in total so far:

@Gertjan said in Another Netgate with storage failure, 6 in total so far:

I believe ZFS is most definitely a strong underlying root cause of the increased wear. It does quite a bit of background disk writes as part of its resiliency processing. Add on heavy logging with a package or two and you can greatly accelerate the wear.

I'm still running UFS on the two Netgate devices I manage. I just have them each on a UPS.

It definitely is since ZFS's write algorithm is both time and allocation triggered. It will always allocate new blocks rather than used blocks for writes. This causes SSDs to rewrite far more blockpages - that would otherwise be considered "static" - over time because of the way they do wear leveling. It's not a HUGE issue, but specifically for lots of logging it will up the write amplification quite noticeably.

However - given HOW prone pfsense boxes are to boot failures on UFS after power outages/hard shutdowns, it's a WELL WORTH tradeoff to make. Then comes all the other features like boot environments, optional mirroring and fault handling in upgrades.... It's see no setups where I would not opt for ZFS and then either get a SSD or enable RAMDISK.

stephenw10

Running UFS with ramdisks enabled reduces drive write to near zero and I have yet to see a UFS corruption issue with that.

But it also restricts what you can run especially on smaller systems without RAM to spare. And you do lose some logs etc in the event of a reboot which cab make troubleshooting tricky.

But on older systems running from SD card or (gasp) CF it's only real option IMO.

michmoor

@andrew_cb
Brutal…hard to ignore your data points. Good job on providing context.

andrew_cb

Just scrolling through the Official Netgate Hardware forum has these definite storage failures (and there are even more threads that might be storage-related):

4 days ago: - 6100 with failed eMMC
6 days ago - 4200 with failed eMMC
8 days ago - failed NVMe on a 6100 MAX
14 days ago - 2100 MAX reporting 48% health
67 days ago - 1100 with failed eMMC

In this thread @SteveITS lists suggestions for reducing storage wear that mirror what is being said by both Netgate staff and other users:

https://www.netgate.com/supported-pfsense-plus-packages lists which packages "require" or recommend SSD over eMMC <- Many packages do not specify that they require/recommend SSD
turn off logging of the default block rules <- why is this on by default if it can be problematic?
turn off logging of the bogon rules <- again, why is this on by default?
turn off Suricata logging of HTTP requests <- there is NO documentation for configuring Suricata
turn off pfBlocker DNSBL logging <- this is not mentioned on the pfBlocker setup page
create a "don't log" rule for IGMP <- this started occurring in 24.03 due to correcting a logging bug. Redmine and Forum discussion. Again, this can create a lot of logging, so why is it enabled by default?
don't view the dashboard 24x7 (each widget logs the web server request to update the widget) <- Along with similar suggestions to disable various RRD graphs, this is just getting silly. How can anyone possibly know this will cause an issue?
use RAM disk <- this requires additional planning and setup to compensate for the loss of persistent logging, and also consumes memory.

Curiously, the Hardware Sizing document does not mention storage at all. It even specifically mentions Snort and Suricata, but says nothing about storage. This seems like a logical place to mention storage write and storage space usage considerations, but unfortunately, it is another missed opportunity.

Now, let us look at the sacred Supported pfSense Plus Packages page. Only HAProxy and NtopNG say "Requires SSH/HDD", and Snort and Suricata say "SSD/HDD strongly recommended".
This would imply that the other packages are safe to use with the onboard eMMC storage, right?

Just to be sure, let us look at the pfBlockerNG documentation page:
Hmm, not much detail there and certainly no mention of storage issues.

What about Status Traffic Totals? Nothing there either.

Maybe some other popular packages will say something.
Arpwatch? Not listed.
Zabbix? Not listed.

The switch to ZFS could very well be causing accelerated eMMC wear out, which might explain why this issue seems to have become much more common in the past 2-3 years. We have SG-3100 that are still running with no issues, possibly because they only support UFS. We had a 7100 fail to boot due to a corrupted filesystem that required using the serial console to repair. After that, we reinstalled all other UFS devices with ZFS.

Again, if I buy a truck that clearly states it can haul 20,000 lbs as standard feature, I should be able to install a trailer hitch and go. I should not have to worry about upgrading the engine, braking system, fuel pump, transmission, or suspension to haul the advertised 20,000 lbs!

I don't understand Netgate's and some community members' attitude on this issue: somehow people are using their Netgate device wrong by trying to utilize the advertised features, and they should just accept these failures and install an SSD or buy a new device.

I can understand this from CE users on third-party hardware who aren't paying Netgate anything, but anyone who purchases a device from Netgate surely must expect more than the sudden death in 1-2 years of devices that cost several hundred dollars (or even thousands) each.

The oft-repeated suggestion to "support the project" does not apply here, as no amount of pfSense licenses or TAC subscriptions will solve the inherent eMMC limitations of white-labelled Silicom hardware.

For all the pfSense power users here, how can we get Netgate's attention and bring about some kind of change?

michmoor

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

For all the pfSense power users here, how can we get Netgate's attention and bring about some kind of change?

Reply

I'm going to paraphrase a bit from where i heard this statement but essentially it goes "Its hard to tell someone they are doing something wrong when they are making money".
I would bet Vegas money that the Base version of the SKUs is very profitable compared to the Max. I'm also willing to bet they are aware of the eMMC flaws. At the end of the day (granted, I'm cynical by nature), I don't think this will move the needle much. Netgate has offered eMMC storage for a very long time. I do believe a disclaimer is needed to assist those making a purchase decision.

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

I don't understand Netgate's and some community members' attitude on this issue: somehow people are using their Netgate device wrong by trying to utilize the advertised features, and they should just accept these failures and install an SSD or buy a new device.

Agree with you here as well. The suggestions essentially boil down to "don't use the software as intended". I cant really add much to your analysis and your grievance but i do hope that 2025 produces some changes.

Gertjan

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

I can understand this from CE users on third-party hardware who aren't

.... aware of this situation, as most, may I say nearly all, in the early pfSense adoption process, in beginning, use a VM, or some "saved from the land-fill-PC", slide in a extra network card, install pfSense and before you know, its years later.
As of this, they, the CE users, can't be hit by this issue : They don't use a Netgate appliance, so most probably no emmc.

And before you think : Not 'against' you, I'd say you've made some very valid points.

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

how can we get Netgate's attention ...

It's just me, "yet another user" saying, but I'm pretty sure your posts have been read by 'them'.

SteveITS

@andrew_cb said in Another Netgate with storage failure, 6 in total so far:

How can anyone possibly know this will cause an issue?

I was just listing "lower the amount of disk writing" suggestions.

To play devil's advocate I would suggest none of these things "cause" premature wear, at least by themselves. ZFS wasn't a feature, or at least, not the default, when the 2100 and I think 1100 were released. So it could well be a combination of all these things interacting with new defaults.

Personally I don't think it would have occurred to me to keep the dashboard visible all day until I saw posts about it, in a thread about the web server logs. Perhaps it can have a checkbox to auto-update in the background like the traffic graphs do.

I would guess the logging is on by default because it avoids/answers a lot of "why can't I connect" questions. Package documentation I would think is up to the individual package maintainers, and often done via forum post. Some of the doc pages are pretty outdated.

An SSD is also significantly faster in terms of saving, upgrading, etc. since I/O is faster.

The amount of disk space used by pfSense is typically relatively small so size isn't really a factor unless downloading large lists or data like the UT1 list which is over 1 GB to extract, when it updates. A larger SSD though would have more writing capacity, I'd expect, due to more unused sectors.

I don't know that anyone here is trying to dispute your POV, or your frustration. In terms of contacting Netgate, other than the replies above, if you're a partner you have contact info. If not then you could try sales or support, I don't know. It sounds like an SSD would fit more for your usage scenarios, so I guess the question/goal is to help others or new customers who don't know about wear issues.

stephenw10

I raised this internally.

Gertjan

@michmoor said in Another Netgate with storage failure, 6 in total so far:

Netgate has offered eMMC storage for a very long time

Added to that :
Afaik : pfSense becomes more popular every day.
Something tells me that this : "pfSense was split into "CE" and "pfSense Plus+" also has something to do with the selling of Netgate appliances ( I would have done the same thing ).
Moore's law is also valid for our data storage needs.
More stats, more data (this gives also that nice feeling you are doing something about "security"), more CPU power, faster bandwidth, more data, more stats and so on ...

So more and more appliances out there ....
I know they know : that's why the MAX versions exist (2 or 3 years already ?). And nobody was asking for another NAS is the house.

andrew_cb

@SteveITS That wasn't directed at you, I was just quoting the content of your post as it is a good example of the suggestions that are frequently proposed. I've read many of your posts and you seem to know a thing or two about pfSense

SteveITS

@andrew_cb Yeah no prob, just trying to clarify. We've been using pfSense from around the m0n0wall days and we have been a partner for quite a while, so have some history.

andrew_cb

@michmoor I would think the MAX versions would be much more profitable than the BASE versions - $100 USD for a 128GB NVMe when brand-name models are available on Amazon for under $30 USD, probably under $20 in bulk. If we had known of the issues, we would have purchased 40 MAX units without hesitation, which would be another $3200 profit for Netgate.

andrew_cb

Thanks everyone for your feedback and support.

I am partly just screaming into the void, but I hope this thread helps others make a more informed purchasing decision using information that, until now, has been scattered all over Reddit and this forum.

I also want to hold Netgate accountable by creating a comprehensive discussion on the root cause and potential solutions to eMMC storage failures on BASE devices, as their response is to just ignore the root cause of all the eMMC failures and simply dismiss them as "user error."

If Netgate really believes that its hardware is "enterprise-ready," then it should investigate the storage failures either through improved messaging and documentation or by improving the BASE hardware to significantly reduce the chance of failure.

andrew_cb

@stephenw10 said in Another Netgate with storage failure, 6 in total so far:

I raised this internally.

Thank you! I am refreshing this page every 5 seconds to see when you respond (hmm, will that deplete my write cycles? )

stephenw10

Ha well it might. What's the rated number key presses on f5?

andrew_cb

@SteveITS said in Another Netgate with storage failure, 6 in total so far:

The amount of disk space used by pfSense is typically relatively small so size isn't really a factor unless downloading large lists or data like the UT1 list which is over 1 GB to extract, when it updates. A larger SSD though would have more writing capacity, I'd expect, due to more unused sectors.

I think this is a critical part of the equation. Settings and packages that log more than the expected baseline, combined with the behavior of ZFS, lead to more storage writes, which go to a small storage device with TRIM disabled/not supported, and a limited number of spare blocks. This quickly exhausts the approximately 3000 write cycles of the eMMC storage. By my calculations, 5TB is the approximate limit of data that can be written before an 8GB eMMC dies.