Another Netgate with storage failure, 6 in total so far
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
Would it be correct to assume that you learned these issues the hard way and have experienced storage failures in the past before switching to only MAX versions?
I have been a pfsense user for quite some time. I am on the forums here and on reddit. The countless tales of unreliable eMMC storage is a tale as old as time so i knew that once i was going the MSP route i knew based on other users' experiences of what not to do.
Should there be a warning in the marketing? I don't know...eMMC may work really well depending on the deployment. Arista 7050CX3 switches have eMMC storage. Enterprise-grade vendor putting in crappy storage. Then again, there isn't heavy writing to the storage on a switch but I am just trying to illustrate to you that putting these parts in a networking device isn't uncommon. As i mentioned, i have shoved 1100s in a corner at a cafe and no issues for years. I also tune the logging down significantly.
-
@andrew_cb it’s not a product page but I think you’re asking for https://www.netgate.com/supported-pfsense-plus-packages
FWIW I don’t recall that we’ve ever had storage failure at any of our clients. Obviously, situations/setups can differ.
Also maybe useful for readers:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/disk-lifetime.html -
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
My main gripe is the complete lack of information, warnings, or disclaimers prior to purchasing and during general usage. and there is no way for a reasonable person to know about the risks with the onboard eMMC storage until it is too late.
Very true.
To reuse (not yours) words, withing 10 years, it will be known that "emmc" isn't the best choice for very write active OSes. The emmc will probably join realtek on the "don't use these - period" list.Btw, what is general usage ?
In the past, for a firewall, it was this (example) :Using username "root". Authenticating with public key "rsa-key-20230516-pfsense" Passphrase for key "rsa-key-20230516-pfsense": Netgate 4100 - Serial: 2014221462 - Netgate Device ID: e57dfbeef5a2527a *** Welcome to Netgate pfSense Plus 24.11-RELEASE (amd64) on pfSense *** Current Boot Environment: 24.11-Release Next Boot Environment: 24.11-Release WAN (wan) -> ix3 -> v4/DHCP4: 192.168.10.4/24 v6/DHCP6: 2a01:beef:907:a600:92ec:77ff:fe29:392a/64 LAN (lan) -> igc0 -> v4: 192.168.1.1/24 v6/t6: 2a01:beef:907:a6eb:92ec:77ff:fe29:392c/64 IDRAC (opt1) -> igc2 -> v4: 192.168.100.1/24 PORTAL (opt2) -> igc1 -> v4: 192.168.2.1/24 VPNS (opt3) -> ovpns1 -> v4: 192.168.3.1/24 0) Logout / Disconnect SSH 9) pfTop 1) Assign Interfaces 10) Filter Logs 2) Set interface(s) IP address 11) Restart GUI 3) Reset admin account and password 12) PHP shell + Netgate pfSense Plus tools 4) Reset to factory defaults 13) Update from console 5) Reboot system 14) Disable Secure Shell (sshd) 6) Halt system 15) Restore recent configuration 7) Ping host 16) Restart PHP-FPM 8) Shell Enter an option:
and from then on the system was idle - doing close to nothing (edit : wrong : it makes stats in the background)
These days, the new normal (example) :
and some of use wonder who picked the colors ... me, I wonder where and how all this info is stored.
Before, with our extreme dumb ISP router with 16 Kbytes of (bios ?) vram, I didn't bother. That router didn't contain any or very few settings and what the heck, the ISP replaces it after one phone call.
But I didn't have these sophisticated stats neither.It all boils down to : what where when is all this backed up ? Where is it stored ?
Look for it, and you'll see the time stamps of all those files, their sizes ... and then you start to dig it : "that is the price to pay" these days : useful, less or pure gadget, it all needed megas to store it's stuff.
That stuff gets rewritten. All the time.And yeah, being here on this forum for a while, and you know this :
You want a Netgate device with hot swap-able dual (because raid 1 !) old iron seagate red label plate based drives .... Like my NAS. No SSD newtech drives which forces me to count write cycles.
Ok, it will consume 0,10 Kwh for sure.
So, ok, plan B : where is it stored ? And can I replace it without doing SMD like soldering ?
Maybe the real question : is it reparable ?
(edit : I also want a double power unit - as all my servers - re edit : wait : Dual HA WAN/LAN ?!).Anyway, @andrew_cb, keep us posted, as I said else where : you can probably add/replace that broken eemc. You wind up having the MAX, so you can throw your 4100 back in business and come back over ... 10 years ?
-
-
Besides the other points raised about eMMC wear caused by logging and/or the data archiving of the fancy Dashboard graph widgets, another thing to consider is that if you have a ZFS install you are automatically going to experience much more background disk writes from ZFS as compared to the old UFS.
ZFS makes regular writes to the disk as part of its normal operation. And it makes quite a bit more of those than UFS does. That's where the resiliency of ZFS comes from. But that resiliency has a price on eMMC or cheaper SSD devices. Just ZFS by itself probably adds a small incremental boost to eMMC wear, but combine that with extensive application and graph widget logging and things can escalate fast.
-
@bmeeks said in Another Netgate with storage failure, 6 in total so far:
Besides the other points raised about eMMC wear caused by logging and/or the data archiving of the fancy Dashboard graph widgets, another thing to consider is that if you have a ZFS install you are automatically going to experience much more background disk writes from ZFS as compared to the old UFS.
ZFS makes regular writes to the disk as part of its normal operation. And it makes quite a bit more of those than UFS does. That's where the resiliency of ZFS comes from. But that resiliency has a price on eMMC or cheaper SSD devices. Just ZFS by itself probably adds a small incremental boost to eMMC wear, but combine that with extensive application and graph widget logging and things can escalate fast.
So just enabling some basic logging/dashboard features, combined with ZFS (which has been the default filesystem for a while now) is enough to shorten the lifespan of a Netgate with eMMC storage?
How is someone supposed to know about these issues? (@bmeeks I know you're not a Netgate employee)
Reading through an example of eMMC longevity calculations from here yields some concerning numbers:
Workload description 84% Sequential write, 16%Random write Chunk Size IOs Distribution: 30%: 4KB, 27%: 16KB, 42%: Mix of 8KB, 32KB-256KB, 1%: 512KB eMMC Cache on specific eMMC device specs (from datasheet): MLC device physical capacity = 0.0074(TB) for 8GB device endurance cycle = 3000 for MLC Write Amplication Factor (WAF): WAF = 4.5 (estimated from the workload description above with simulation) TBW = physical capacity * endurance cycle / WAF TBW = 0.0074(TB) * 3000(cycles) / 4.5(WAF) ~= 5.0 TBW 5.0 Terabytes is the total amount of data written to the device during its lifetime of use, depending on the workload.
Taking the 5.0TB writable and calculating for 3 years of life gives us:
5.0TB / 1095 days = 4.57GB per day.
4.57GB / 24 hours = 190MB per hour.
190MB / 60 minutes = 3.17MB per minute.
3.17MB per minute = 53Kb per second.So you must write no more than an average of 53Kb per second in order to get 3 years of life, 106Kbps to last 2 years, and 159KBps to last 1 year.
Being generous and increasing the numbers 10-fold still only leaves a maximum of 530Kb per second to last 3 years.
The purpose of buying a device instead of building your own is that the manufacturer is supposed to take care of choosing the correct components so that you do not have to. If the expectation is that a user must spend countless hours and years of testing and research in order to understand how to get the device to work properly, then it is far easier and more cost-effective to simply purchase a competitor's device and just pay the yearly fees.
Now I may sound bitter (and I am at the moment), but I genuinely want to provide feedback to Netgate that they can use to make changes so that others do not experience the same challenges that I am currently experiencing with device failures (and the grim prospect that more will likely fail).
With all the complexities of changing the behavior of pfSense and various packages, I believe the easiest way to avoid future problems is to either change to a more robust storage medium in the BASE versions or at least make the limitations of eMMC storage abundantly clear.
The product pages make no differentiation between the BASE and MAX versions other than increased capacity. eMMC and NVMe are mentioned, but I suspect that very few people are aware of the critical differences. After all, if Netgate chose eMMC and no further details are given, then it must not be important, right? A blurb that explains eMMC vs NVME, along with a table listing use cases/packages where the MAX version is recommended, would be a huge benefit to both users and Netgate and a great way to upsell the MAX version.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
With all the complexities of changing the behavior of pfSense and various packages, I believe the easiest way to avoid future problems is to either change to a more robust storage medium in the BASE versions or at least make the limitations of eMMC storage abundantly clear.
I definitely agree with you here. eMMC technology was initially way cheaper than NVMe drives, and that likely drove the decision to use that more than any other factor. I think the disparity has decreased some with the proliferation of NVMe choices now, and NVMe seems a more solid and long-term reliable solution.
-
I agree with all your points to be frank. I personally held the belief that offering a BASE version of any model up to the 6100 is silly especially when viewed from the price perspective where its ~100 bucks difference. For an extra $100 you get better storage. If the price is negligible than why offer the base to begin with.
In my opinion, you should pay more for better performance in terms of cpu or memory. Look at the 4200 as an example. Base and MAX is the exact same in terms of specs except storage. I find it hard to believe the nvme is an additional $100 but even if true, i still have the same performance. I can see why there are people who would select the Base sku but Netgate is doing more harm than good when a cheap unreliable drive goes bad in their own flagship product which happens quite often if forum posts are to be believed. Just offer the Max version which in reality is the Base version. Thats it. One Sku per product. They do it with the 8200Depending on the deployment, i would go white box. Grab the pfsense+ license which is very cost effective and deploy a Dell or HP 1RU system which would be far more reliable , more robust.
-
Imagine you work for a busy company that is trying a different brand of delivery trucks for its fleet that operates 24/7. The delivery trucks work well and employees like them, so the company buys many more over the next few years. At first, there were no available options, but the brochure for recent models lists two axle options: BASE or MAX. The only difference mentioned is that the BASE axle is 6-lug and the MAX axle is 8-lug.
Recently, the factory-equipped axles have begun failing much sooner than those of other truck brands and previous delivery trucks you have owned. Now, a sixth truck is stuck at the side of the road waiting for a tow, missing another important delivery and losing your customer's confidence.
You begin researching axle failures and this truck model. You find that the BASE model tires are similar to the axles used on passenger vehicles. They are fine for driving around town with light loads, but highway driving, adding additional equipment, or carrying additional weight causes the axles to wear internally and fail prematurely. Changing to the MAX axle requires a truck to be out of service for 2 days. You find that other companies using these trucks are having the same problem, and the manufacturer even recommends removing the spare tire, jack, radio, passenger seat, bumpers, and mud flaps to reduce weight and extend the life of the BASE axle.
You might think, "Ridiculous!" One of the main reasons we bought these trucks is that they support a wide variety of aftermarket accessories that allow the trucks to be customized to significantly improve their functionality, as shown in the glossy brochures.
You wonder, "Since these are sold as delivery trucks, why would the manufacturer use inferior axles without making it clear that the BASE axle option is what most customers will need to use the trucks for anything but the lightest of tasks?"
The MAX model was a $1000 option that seemed unnecessary at the time, but each premature axle failure costs you $10,000 in towing charges, labor, equipment rentals, customer goodwill, and vehicle repairs. Replacing the axles before they fail will cost $7000 of lost revenue, parts, and labor per truck. You look out your office at the 40 trucks in your loading yard and wonder how you will get through this situation.
-
So, things keep getting worse. I put together some scripts to run
mmc
and parse the health data. I will just let the data speak for itself:
Of 33 devices, 10 are over 100% Type A wear, and 8 are over 100% Type B wear.
Strangely, all are reporting Pre-EOL of 0x01.That's a failure rate of 30%, and if we include the 6 devices that have already failed, that brings it up to a 40% failure rate.
There are hundreds of discussions about storage failures in Netgate devices. It seems that most are personal users who are willing to accept this and install an SSD, but for a business with dozens of devices, this is simply unacceptable for a 2-year-old device.
Okay, what if a user wants to install an SSD in their existing 4100 or 6100? No problem, right, since the 6100 product page clearly states:
Physical Expansion Card Slots: 2x m.2 (Key-B slot) with dual-SIM (LTE, Wi-Fi, or NVMe) (PCIe, USB 2.0, USB 3.0)
But wait, there are NO published instructions for installing an SSD, and Netgate staff say it is not possible/supported/recommended!
The warranty is only 1 year. Does Netgate even track the failure rate of devices after the warranty runs out? Clearly, a 30-40% failure rate cannot be normal or acceptable.
The 6100 has only been out for 3.5 years, and the 4100 has only been out for 3 years. Why are so many users experiencing storage failures? There are even posts of 4200's with eMMC failure - these are 9 months old at the most (and have no way to monitor the storage)!
Either:- Users are using the product wrong (according to what?).
- The BASE version is inferior and not capable of doing what is advertised.
Using the information that is reasonably prominent to a purchaser, the only conclusion is that "The BASE version is inferior and not capable of doing what is advertised." and further, is unfit for anything other than using the default settings.
Multiple years of experience with hardware failures using pfSense, notes tucked away on package documentation, and documentation unintuitively named "troubleshooting,", and eMMC vs SSD differences, is not information that a regular purchaser would be aware of.
The whole point of buying a device from Netgate is to AVOID having to meticulously research hardware specs, particularly for obscure things like eMMC storage device lifetime which is not generally available.
It is very misleading to offer the BASE version when it can only do 10% of the advertised features.The 8200 and 8300 only come in a single version and only have NVMe storage. Why is this? They run the same software, the same packages, the same default config. Are they doing something special that requires more than eMMC can handle?
So, to summarize:
- There is no mention or warning about the limitations of eMMC storage in the BASE version.
- The product page makes no recommendation to get the MAX version to use the advertised features.
- The product page misleading states "No artificial limits or add-ons required to make your system fully functional" as this does not apply to the BASE version since anything more than the default configuration risks premature storage failure.
- The product pages make no mention that the BASE version cannot be upgraded to the MAX version. When the return period runs out 30 days after purchase, the user is stuck with an expensive device that cannot be upgraded and cannot be used to its full, advertised potential.
- Failure rates of the BASE versions can be 30-40% or more, depending on the packages used.
How can I contact someone at Netgate to discuss this further? I think this is a serious issue and the product pages and documentation need to be updated to clearly distinguish the limitations of the BASE versions and prevent further confusion and premature device failure.
-
@andrew_cb I fully agree with and understand your situation.
I luckily discovered the issue 3 years back before my devices died of wear-out, and installed an SSD myself.
I created a thread (https://forum.netgate.com/topic/170128/emmc-write-endurance) on this forum, clearly identifying the potential problem and encouraging people to dial down the write intensity of packages and firewall rules. At that time pfBlockerNG had an issue causing it to write in an endless loop, so the figures were really bad at the time. But even after that was corrected, it is still a BIG problem on basic installs.But Netgate kept the eMMC models around and have still not opted into setting up RAM disk as default on those devices (which is needed now).
So I have been expecting this to turn into a bigger problem at some point.Not that it helps you or other customers that have dead devices, but I fully agree with you, and you have my sympathy with your current situation :-(
-
I got a SG-4100 (not the MAX) and the first thing I did was install a nvme.
Since I was a SG-3100 user for a long time, I was already aware of the eMMC lifespan, but I'm pretty sure that new users won't be aware of this.One suggestion to Netgate would be, give the user more options in the shop, with a warning and a link to the docs.
Cheaper variant: SG-4200 with eMMC storage (Read about eMMC lifespan here).
20 bucks more expensive than cheaper variant: SG-4200 with a 128GB nvme (not enterprise nvme).
SG-4200 MAX (enterprise nvme).This would help users during the variant selection, more options for buyers and a warning so users can be prepared in case they get the emmc only variant.
-
@keyser said in Another Netgate with storage failure, 6 in total so far:
I luckily discovered the issue 3 years back before my devices died of wear-out, and installed an SSD myself.
I created a thread (https://forum.netgate.com/topic/170128/emmc-write-endurance) on this forum, clearly identifying
That was one of the forum posts I've read and used to decide when I had to decide what 4100 I had to take.
The elephant mentioned overthere (== ZFS) wasn't listed here as a package. I found out what 'ZFS' does for a living ...... and I had my answer straight away.@andrew_cb : Great write-up. It will help future potential Netgate appliance buyers very useful info (if they look for it ...).
-
@Gertjan said in Another Netgate with storage failure, 6 in total so far:
The elephant mentioned overthere (== ZFS) wasn't listed here as a package. I found out what 'ZFS' does for a living ...... and I had my answer straight away.
I believe ZFS is most definitely a strong underlying root cause of the increased wear. It does quite a bit of background disk writes as part of its resiliency processing. Add on heavy logging with a package or two and you can greatly accelerate the wear.
I'm still running UFS on the two Netgate devices I manage. I just have them each on a UPS.
-
@bmeeks said in Another Netgate with storage failure, 6 in total so far:
@Gertjan said in Another Netgate with storage failure, 6 in total so far:
I believe ZFS is most definitely a strong underlying root cause of the increased wear. It does quite a bit of background disk writes as part of its resiliency processing. Add on heavy logging with a package or two and you can greatly accelerate the wear.
I'm still running UFS on the two Netgate devices I manage. I just have them each on a UPS.
It definitely is since ZFS's write algorithm is both time and allocation triggered. It will always allocate new blocks rather than used blocks for writes. This causes SSDs to rewrite far more blockpages - that would otherwise be considered "static" - over time because of the way they do wear leveling. It's not a HUGE issue, but specifically for lots of logging it will up the write amplification quite noticeably.
However - given HOW prone pfsense boxes are to boot failures on UFS after power outages/hard shutdowns, it's a WELL WORTH tradeoff to make. Then comes all the other features like boot environments, optional mirroring and fault handling in upgrades.... It's see no setups where I would not opt for ZFS and then either get a SSD or enable RAMDISK.
-
Running UFS with ramdisks enabled reduces drive write to near zero and I have yet to see a UFS corruption issue with that.
But it also restricts what you can run especially on smaller systems without RAM to spare. And you do lose some logs etc in the event of a reboot which cab make troubleshooting tricky.
But on older systems running from SD card or (gasp) CF it's only real option IMO.
-
@andrew_cb
Brutal…hard to ignore your data points. Good job on providing context. -
Just scrolling through the Official Netgate Hardware forum has these definite storage failures (and there are even more threads that might be storage-related):
- 4 days ago: - 6100 with failed eMMC
- 6 days ago - 4200 with failed eMMC
- 8 days ago - failed NVMe on a 6100 MAX
- 14 days ago - 2100 MAX reporting 48% health
- 67 days ago - 1100 with failed eMMC
In this thread @SteveITS lists suggestions for reducing storage wear that mirror what is being said by both Netgate staff and other users:
- https://www.netgate.com/supported-pfsense-plus-packages lists which packages "require" or recommend SSD over eMMC <- Many packages do not specify that they require/recommend SSD
- turn off logging of the default block rules <- why is this on by default if it can be problematic?
- turn off logging of the bogon rules <- again, why is this on by default?
- turn off Suricata logging of HTTP requests <- there is NO documentation for configuring Suricata
- turn off pfBlocker DNSBL logging <- this is not mentioned on the pfBlocker setup page
- create a "don't log" rule for IGMP <- this started occurring in 24.03 due to correcting a logging bug. Redmine and Forum discussion. Again, this can create a lot of logging, so why is it enabled by default?
- don't view the dashboard 24x7 (each widget logs the web server request to update the widget) <- Along with similar suggestions to disable various RRD graphs, this is just getting silly. How can anyone possibly know this will cause an issue?
- use RAM disk <- this requires additional planning and setup to compensate for the loss of persistent logging, and also consumes memory.
Curiously, the Hardware Sizing document does not mention storage at all. It even specifically mentions Snort and Suricata, but says nothing about storage. This seems like a logical place to mention storage write and storage space usage considerations, but unfortunately, it is another missed opportunity.
Now, let us look at the sacred Supported pfSense Plus Packages page. Only HAProxy and NtopNG say "Requires SSH/HDD", and Snort and Suricata say "SSD/HDD strongly recommended".
This would imply that the other packages are safe to use with the onboard eMMC storage, right?Just to be sure, let us look at the pfBlockerNG documentation page:
Hmm, not much detail there and certainly no mention of storage issues.What about Status Traffic Totals? Nothing there either.
Maybe some other popular packages will say something.
Arpwatch? Not listed.
Zabbix? Not listed.The switch to ZFS could very well be causing accelerated eMMC wear out, which might explain why this issue seems to have become much more common in the past 2-3 years. We have SG-3100 that are still running with no issues, possibly because they only support UFS. We had a 7100 fail to boot due to a corrupted filesystem that required using the serial console to repair. After that, we reinstalled all other UFS devices with ZFS.
Again, if I buy a truck that clearly states it can haul 20,000 lbs as standard feature, I should be able to install a trailer hitch and go. I should not have to worry about upgrading the engine, braking system, fuel pump, transmission, or suspension to haul the advertised 20,000 lbs!
I don't understand Netgate's and some community members' attitude on this issue: somehow people are using their Netgate device wrong by trying to utilize the advertised features, and they should just accept these failures and install an SSD or buy a new device.
I can understand this from CE users on third-party hardware who aren't paying Netgate anything, but anyone who purchases a device from Netgate surely must expect more than the sudden death in 1-2 years of devices that cost several hundred dollars (or even thousands) each.
The oft-repeated suggestion to "support the project" does not apply here, as no amount of pfSense licenses or TAC subscriptions will solve the inherent eMMC limitations of white-labelled Silicom hardware.
For all the pfSense power users here, how can we get Netgate's attention and bring about some kind of change?
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
For all the pfSense power users here, how can we get Netgate's attention and bring about some kind of change?
Reply
I'm going to paraphrase a bit from where i heard this statement but essentially it goes "Its hard to tell someone they are doing something wrong when they are making money".
I would bet Vegas money that the Base version of the SKUs is very profitable compared to the Max. I'm also willing to bet they are aware of the eMMC flaws. At the end of the day (granted, I'm cynical by nature), I don't think this will move the needle much. Netgate has offered eMMC storage for a very long time. I do believe a disclaimer is needed to assist those making a purchase decision.@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
I don't understand Netgate's and some community members' attitude on this issue: somehow people are using their Netgate device wrong by trying to utilize the advertised features, and they should just accept these failures and install an SSD or buy a new device.
Agree with you here as well. The suggestions essentially boil down to "don't use the software as intended". I cant really add much to your analysis and your grievance but i do hope that 2025 produces some changes.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
I can understand this from CE users on third-party hardware who aren't
.... aware of this situation, as most, may I say nearly all, in the early pfSense adoption process, in beginning, use a VM, or some "saved from the land-fill-PC", slide in a extra network card, install pfSense and before you know, its years later.
As of this, they, the CE users, can't be hit by this issue : They don't use a Netgate appliance, so most probably no emmc.And before you think : Not 'against' you, I'd say you've made some very valid points.
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
how can we get Netgate's attention ...
It's just me, "yet another user" saying, but I'm pretty sure your posts have been read by 'them'.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
How can anyone possibly know this will cause an issue?
I was just listing "lower the amount of disk writing" suggestions.
To play devil's advocate I would suggest none of these things "cause" premature wear, at least by themselves. ZFS wasn't a feature, or at least, not the default, when the 2100 and I think 1100 were released. So it could well be a combination of all these things interacting with new defaults.
Personally I don't think it would have occurred to me to keep the dashboard visible all day until I saw posts about it, in a thread about the web server logs. Perhaps it can have a checkbox to auto-update in the background like the traffic graphs do.
I would guess the logging is on by default because it avoids/answers a lot of "why can't I connect" questions. Package documentation I would think is up to the individual package maintainers, and often done via forum post. Some of the doc pages are pretty outdated.
An SSD is also significantly faster in terms of saving, upgrading, etc. since I/O is faster.
The amount of disk space used by pfSense is typically relatively small so size isn't really a factor unless downloading large lists or data like the UT1 list which is over 1 GB to extract, when it updates. A larger SSD though would have more writing capacity, I'd expect, due to more unused sectors.
I don't know that anyone here is trying to dispute your POV, or your frustration. In terms of contacting Netgate, other than the replies above, if you're a partner you have contact info. If not then you could try sales or support, I don't know. It sounds like an SSD would fit more for your usage scenarios, so I guess the question/goal is to help others or new customers who don't know about wear issues.