Another Netgate with storage failure, 6 in total so far
-
Yesterday, I had ANOTHER 4100 die, just 2 years old. I tried to salvage the situation by remotely walking the customer through installing pfSense onto a USB drive, but unfortunately, after I logged into the fresh installation and restored the configuration, the firewall refused to boot or even power on. There is no console activity and it just sits with a flashing orange LED. This is the second unit to die completely and refuse to power on.
I have tried to defend other pfSense issues as anomalies, but these storage and device failures are now a complete disaster.
No other firewall brand stipulates that storage must be treated gently for it to last.
This limitation might not be so bad if it was clearly visible and disclosed before purchase, but I challenge anyone to find any mention of the limitations/dangers of eMMC storage on the product pages or anywhere other than the 2 troubleshooting articles.
Let's take a look at the 6100 product page:
https://shop.netgate.com/products/6100-base-pfsenseLOW TOTAL COST OF OWNERSHIP -No artificial limits or add-ons required to make your system fully functional. -This system is designed for a long deployment lifetime. GROWS WITH YOU -From firewall to Unified Threat Management, get all the security features you need to protect your home or business. -Flexible configuration and support for multi-WAN, high availability, VPN, load balancing, reporting and monitoring, etc. -Add optional packages such as Snort or Suricata for IDS/IPS and network security monitoring. EASY GUI MANAGEMENT -Manage pfSense Plus software settings through our web-based GUI. -No fumbling with a command-line interface or typing arcane commands.
The only noticeable difference between the BASE and MAX versions is the addition of a 128GB SSD for $100. The product pages mention all the great things that can be done natively and with packages, but they do not mention any storage concerns.
The product pages need a section that describes the limitations of the onboard eMMC storage with links to the "troubleshooting" documents, and advises getting the MAX version for anything other than basic, out-of-the-box usage.
Further, packages and any other system features, such as RRD graphs and logs, should include text warning of the storage issues, and contain links to the documentation. The current situation leaves a high risk that a user will make a few simple changes and unknowing turn their Netgate appliance into a ticking bomb.
To make things worse, the default configuration does not automatically perform disk health monitoring nor does it place the SMART widget on the dashboard. Monitoring eMMC storage requires using the CLI to install a package and run commands manually. Even worse, the storage of the 4200 cannot be monitored at all!
I have read recommendations from Netgate staff to disable logging the default firewall ruleset to reduce storage wear.
Enabling RAM disks could help alleviate these issues, but then all logging is lost on reboot. This would be a possible solution if the general system log was kept on disk for troubleshooting and security monitoring, but some things (like ARP watch, gateway issues etc) can still flood the log and cause disk writes.
We like pfSense, and the Netgate hardware is fine, but the Achilles heel is the eMMC storage, which is simply unfit for purpose. There are many posts online and here in the forums of people with similar issues.
For a business-class device, the onboard storage device and limitations do not make sense.
My management team is concerned and we are looking at solutions for our entire fleet of 45 Netgates before more fail and cause disruption to our customers.
If there is something we are overlooking, I would be happy to hear any suggestions.
-
-
Including the eMMC monitoring package with pfSense was requested 3 years ago but so far still has not been done:
-
@andrew_cb I agree with you. I make it a point to only deploy MAX versions of the Netgate due to the storage. The lowest spec i would go is the 4200 and must be MAX.
On top of the issues you mentioned, you have to take care about the amount of logging that you do. In my case, every single rule created is logged. Thats the policy. My 6100s would've died a year into service but because they are running SSDs, i am 2 years in and without any issues. If you are logging heavily or just have lots of I/O to your drive for whatever reason, selecting a Netgate with eMMC is going to cause you lots of pain. The only exception I make is the 1100. Thats a very cheap device you throw into a closet somewhere in a cafe not in a datacenter so the risks I take with it are worth having.I think its a huge flaw to advertise devices with eMMC storage. The standard should be SSD drives, full stop.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
Enabling RAM disks could help alleviate these issues, but then all logging is lost on reboot. This would be a possible solution if the general system log was kept on disk for troubleshooting and security monitoring, but some things (like ARP watch, gateway issues etc) can still flood the log and cause disk writes.
You should be sending logs to a remote syslog server to be fair.
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
Even worse, the storage of the 4200 cannot be monitored at all!
If this is for a business deployment then not selecting this model is a no brainer. If there are no tools to monitor the health of the device including SNMP related monitoring for it, then don't deploy it. You are putting clients at risk putting an unreliable and un-monitorable solution in their environment.
-
You should be sending logs to a remote syslog server to be fair.
I definitely agree, but again, there is no indication that just the regular device logs are a threat to longevity. It would also require more infrastructure setup than comparable devices. For example, Sonicwall, Sophos, Fortinet, Meraki, etc devices can do the same kind of logging for 10 years without an issue. Filling up the storage space is one thing, but having it outright die in 2-3 years due to logging is ridiculous.
If this is for a business deployment then not selecting this model is a no brainer. If there are no tools to monitor the health of the device including SNMP related monitoring for it, then don't deploy it. You are putting clients at risk putting an unreliable and un-monitorable solution in their environment.
The product page has no mention of the inability to monitor the 4200's eMMC storage, which is a loss of functionality compared to the 4100 and 6100 it replaces. Unfortunately, we've already purchased and deployed several 4200 (including 3 to replace failed 4100). I completely agree about not deploying devices that cannot be monitored. We have Zabbix on all pfSense firewalls and it works great.
I agree with you. I make it a point to only deploy MAX versions of the Netgate due to the storage. The lowest spec i would go is the 4200 and must be MAX.
On top of the issues you mentioned, you have to take care about the amount of logging that you do... My 6100s would've died a year into service but because they are running SSDs, i am 2 years in and without any issues. If you are logging heavily or just have lots of I/O to your drive for whatever reason, selecting a Netgate with eMMC is going to cause you lots of pain.Would it be correct to assume that you learned these issues the hard way and have experienced storage failures in the past before switching to only MAX versions?
My main gripe is the complete lack of information, warnings, or disclaimers prior to purchasing and during general usage. and there is no way for a reasonable person to know about the risks with the onboard eMMC storage until it is too late.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
Would it be correct to assume that you learned these issues the hard way and have experienced storage failures in the past before switching to only MAX versions?
I have been a pfsense user for quite some time. I am on the forums here and on reddit. The countless tales of unreliable eMMC storage is a tale as old as time so i knew that once i was going the MSP route i knew based on other users' experiences of what not to do.
Should there be a warning in the marketing? I don't know...eMMC may work really well depending on the deployment. Arista 7050CX3 switches have eMMC storage. Enterprise-grade vendor putting in crappy storage. Then again, there isn't heavy writing to the storage on a switch but I am just trying to illustrate to you that putting these parts in a networking device isn't uncommon. As i mentioned, i have shoved 1100s in a corner at a cafe and no issues for years. I also tune the logging down significantly.
-
@andrew_cb it’s not a product page but I think you’re asking for https://www.netgate.com/supported-pfsense-plus-packages
FWIW I don’t recall that we’ve ever had storage failure at any of our clients. Obviously, situations/setups can differ.
Also maybe useful for readers:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/disk-lifetime.html -
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
My main gripe is the complete lack of information, warnings, or disclaimers prior to purchasing and during general usage. and there is no way for a reasonable person to know about the risks with the onboard eMMC storage until it is too late.
Very true.
To reuse (not yours) words, withing 10 years, it will be known that "emmc" isn't the best choice for very write active OSes. The emmc will probably join realtek on the "don't use these - period" list.Btw, what is general usage ?
In the past, for a firewall, it was this (example) :Using username "root". Authenticating with public key "rsa-key-20230516-pfsense" Passphrase for key "rsa-key-20230516-pfsense": Netgate 4100 - Serial: 2014221462 - Netgate Device ID: e57dfbeef5a2527a *** Welcome to Netgate pfSense Plus 24.11-RELEASE (amd64) on pfSense *** Current Boot Environment: 24.11-Release Next Boot Environment: 24.11-Release WAN (wan) -> ix3 -> v4/DHCP4: 192.168.10.4/24 v6/DHCP6: 2a01:beef:907:a600:92ec:77ff:fe29:392a/64 LAN (lan) -> igc0 -> v4: 192.168.1.1/24 v6/t6: 2a01:beef:907:a6eb:92ec:77ff:fe29:392c/64 IDRAC (opt1) -> igc2 -> v4: 192.168.100.1/24 PORTAL (opt2) -> igc1 -> v4: 192.168.2.1/24 VPNS (opt3) -> ovpns1 -> v4: 192.168.3.1/24 0) Logout / Disconnect SSH 9) pfTop 1) Assign Interfaces 10) Filter Logs 2) Set interface(s) IP address 11) Restart GUI 3) Reset admin account and password 12) PHP shell + Netgate pfSense Plus tools 4) Reset to factory defaults 13) Update from console 5) Reboot system 14) Disable Secure Shell (sshd) 6) Halt system 15) Restore recent configuration 7) Ping host 16) Restart PHP-FPM 8) Shell Enter an option:
and from then on the system was idle - doing close to nothing (edit : wrong : it makes stats in the background)
These days, the new normal (example) :
and some of use wonder who picked the colors ... me, I wonder where and how all this info is stored.
Before, with our extreme dumb ISP router with 16 Kbytes of (bios ?) vram, I didn't bother. That router didn't contain any or very few settings and what the heck, the ISP replaces it after one phone call.
But I didn't have these sophisticated stats neither.It all boils down to : what where when is all this backed up ? Where is it stored ?
Look for it, and you'll see the time stamps of all those files, their sizes ... and then you start to dig it : "that is the price to pay" these days : useful, less or pure gadget, it all needed megas to store it's stuff.
That stuff gets rewritten. All the time.And yeah, being here on this forum for a while, and you know this :
You want a Netgate device with hot swap-able dual (because raid 1 !) old iron seagate red label plate based drives .... Like my NAS. No SSD newtech drives which forces me to count write cycles.
Ok, it will consume 0,10 Kwh for sure.
So, ok, plan B : where is it stored ? And can I replace it without doing SMD like soldering ?
Maybe the real question : is it reparable ?
(edit : I also want a double power unit - as all my servers - re edit : wait : Dual HA WAN/LAN ?!).Anyway, @andrew_cb, keep us posted, as I said else where : you can probably add/replace that broken eemc. You wind up having the MAX, so you can throw your 4100 back in business and come back over ... 10 years ?
-
-
Besides the other points raised about eMMC wear caused by logging and/or the data archiving of the fancy Dashboard graph widgets, another thing to consider is that if you have a ZFS install you are automatically going to experience much more background disk writes from ZFS as compared to the old UFS.
ZFS makes regular writes to the disk as part of its normal operation. And it makes quite a bit more of those than UFS does. That's where the resiliency of ZFS comes from. But that resiliency has a price on eMMC or cheaper SSD devices. Just ZFS by itself probably adds a small incremental boost to eMMC wear, but combine that with extensive application and graph widget logging and things can escalate fast.
-
@bmeeks said in Another Netgate with storage failure, 6 in total so far:
Besides the other points raised about eMMC wear caused by logging and/or the data archiving of the fancy Dashboard graph widgets, another thing to consider is that if you have a ZFS install you are automatically going to experience much more background disk writes from ZFS as compared to the old UFS.
ZFS makes regular writes to the disk as part of its normal operation. And it makes quite a bit more of those than UFS does. That's where the resiliency of ZFS comes from. But that resiliency has a price on eMMC or cheaper SSD devices. Just ZFS by itself probably adds a small incremental boost to eMMC wear, but combine that with extensive application and graph widget logging and things can escalate fast.
So just enabling some basic logging/dashboard features, combined with ZFS (which has been the default filesystem for a while now) is enough to shorten the lifespan of a Netgate with eMMC storage?
How is someone supposed to know about these issues? (@bmeeks I know you're not a Netgate employee)
Reading through an example of eMMC longevity calculations from here yields some concerning numbers:
Workload description 84% Sequential write, 16%Random write Chunk Size IOs Distribution: 30%: 4KB, 27%: 16KB, 42%: Mix of 8KB, 32KB-256KB, 1%: 512KB eMMC Cache on specific eMMC device specs (from datasheet): MLC device physical capacity = 0.0074(TB) for 8GB device endurance cycle = 3000 for MLC Write Amplication Factor (WAF): WAF = 4.5 (estimated from the workload description above with simulation) TBW = physical capacity * endurance cycle / WAF TBW = 0.0074(TB) * 3000(cycles) / 4.5(WAF) ~= 5.0 TBW 5.0 Terabytes is the total amount of data written to the device during its lifetime of use, depending on the workload.
Taking the 5.0TB writable and calculating for 3 years of life gives us:
5.0TB / 1095 days = 4.57GB per day.
4.57GB / 24 hours = 190MB per hour.
190MB / 60 minutes = 3.17MB per minute.
3.17MB per minute = 53Kb per second.So you must write no more than an average of 53Kb per second in order to get 3 years of life, 106Kbps to last 2 years, and 159KBps to last 1 year.
Being generous and increasing the numbers 10-fold still only leaves a maximum of 530Kb per second to last 3 years.
The purpose of buying a device instead of building your own is that the manufacturer is supposed to take care of choosing the correct components so that you do not have to. If the expectation is that a user must spend countless hours and years of testing and research in order to understand how to get the device to work properly, then it is far easier and more cost-effective to simply purchase a competitor's device and just pay the yearly fees.
Now I may sound bitter (and I am at the moment), but I genuinely want to provide feedback to Netgate that they can use to make changes so that others do not experience the same challenges that I am currently experiencing with device failures (and the grim prospect that more will likely fail).
With all the complexities of changing the behavior of pfSense and various packages, I believe the easiest way to avoid future problems is to either change to a more robust storage medium in the BASE versions or at least make the limitations of eMMC storage abundantly clear.
The product pages make no differentiation between the BASE and MAX versions other than increased capacity. eMMC and NVMe are mentioned, but I suspect that very few people are aware of the critical differences. After all, if Netgate chose eMMC and no further details are given, then it must not be important, right? A blurb that explains eMMC vs NVME, along with a table listing use cases/packages where the MAX version is recommended, would be a huge benefit to both users and Netgate and a great way to upsell the MAX version.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
With all the complexities of changing the behavior of pfSense and various packages, I believe the easiest way to avoid future problems is to either change to a more robust storage medium in the BASE versions or at least make the limitations of eMMC storage abundantly clear.
I definitely agree with you here. eMMC technology was initially way cheaper than NVMe drives, and that likely drove the decision to use that more than any other factor. I think the disparity has decreased some with the proliferation of NVMe choices now, and NVMe seems a more solid and long-term reliable solution.
-
I agree with all your points to be frank. I personally held the belief that offering a BASE version of any model up to the 6100 is silly especially when viewed from the price perspective where its ~100 bucks difference. For an extra $100 you get better storage. If the price is negligible than why offer the base to begin with.
In my opinion, you should pay more for better performance in terms of cpu or memory. Look at the 4200 as an example. Base and MAX is the exact same in terms of specs except storage. I find it hard to believe the nvme is an additional $100 but even if true, i still have the same performance. I can see why there are people who would select the Base sku but Netgate is doing more harm than good when a cheap unreliable drive goes bad in their own flagship product which happens quite often if forum posts are to be believed. Just offer the Max version which in reality is the Base version. Thats it. One Sku per product. They do it with the 8200Depending on the deployment, i would go white box. Grab the pfsense+ license which is very cost effective and deploy a Dell or HP 1RU system which would be far more reliable , more robust.
-
Imagine you work for a busy company that is trying a different brand of delivery trucks for its fleet that operates 24/7. The delivery trucks work well and employees like them, so the company buys many more over the next few years. At first, there were no available options, but the brochure for recent models lists two axle options: BASE or MAX. The only difference mentioned is that the BASE axle is 6-lug and the MAX axle is 8-lug.
Recently, the factory-equipped axles have begun failing much sooner than those of other truck brands and previous delivery trucks you have owned. Now, a sixth truck is stuck at the side of the road waiting for a tow, missing another important delivery and losing your customer's confidence.
You begin researching axle failures and this truck model. You find that the BASE model tires are similar to the axles used on passenger vehicles. They are fine for driving around town with light loads, but highway driving, adding additional equipment, or carrying additional weight causes the axles to wear internally and fail prematurely. Changing to the MAX axle requires a truck to be out of service for 2 days. You find that other companies using these trucks are having the same problem, and the manufacturer even recommends removing the spare tire, jack, radio, passenger seat, bumpers, and mud flaps to reduce weight and extend the life of the BASE axle.
You might think, "Ridiculous!" One of the main reasons we bought these trucks is that they support a wide variety of aftermarket accessories that allow the trucks to be customized to significantly improve their functionality, as shown in the glossy brochures.
You wonder, "Since these are sold as delivery trucks, why would the manufacturer use inferior axles without making it clear that the BASE axle option is what most customers will need to use the trucks for anything but the lightest of tasks?"
The MAX model was a $1000 option that seemed unnecessary at the time, but each premature axle failure costs you $10,000 in towing charges, labor, equipment rentals, customer goodwill, and vehicle repairs. Replacing the axles before they fail will cost $7000 of lost revenue, parts, and labor per truck. You look out your office at the 40 trucks in your loading yard and wonder how you will get through this situation.
-
So, things keep getting worse. I put together some scripts to run
mmc
and parse the health data. I will just let the data speak for itself:
Of 33 devices, 10 are over 100% Type A wear, and 8 are over 100% Type B wear.
Strangely, all are reporting Pre-EOL of 0x01.That's a failure rate of 30%, and if we include the 6 devices that have already failed, that brings it up to a 40% failure rate.
There are hundreds of discussions about storage failures in Netgate devices. It seems that most are personal users who are willing to accept this and install an SSD, but for a business with dozens of devices, this is simply unacceptable for a 2-year-old device.
Okay, what if a user wants to install an SSD in their existing 4100 or 6100? No problem, right, since the 6100 product page clearly states:
Physical Expansion Card Slots: 2x m.2 (Key-B slot) with dual-SIM (LTE, Wi-Fi, or NVMe) (PCIe, USB 2.0, USB 3.0)
But wait, there are NO published instructions for installing an SSD, and Netgate staff say it is not possible/supported/recommended!
The warranty is only 1 year. Does Netgate even track the failure rate of devices after the warranty runs out? Clearly, a 30-40% failure rate cannot be normal or acceptable.
The 6100 has only been out for 3.5 years, and the 4100 has only been out for 3 years. Why are so many users experiencing storage failures? There are even posts of 4200's with eMMC failure - these are 9 months old at the most (and have no way to monitor the storage)!
Either:- Users are using the product wrong (according to what?).
- The BASE version is inferior and not capable of doing what is advertised.
Using the information that is reasonably prominent to a purchaser, the only conclusion is that "The BASE version is inferior and not capable of doing what is advertised." and further, is unfit for anything other than using the default settings.
Multiple years of experience with hardware failures using pfSense, notes tucked away on package documentation, and documentation unintuitively named "troubleshooting,", and eMMC vs SSD differences, is not information that a regular purchaser would be aware of.
The whole point of buying a device from Netgate is to AVOID having to meticulously research hardware specs, particularly for obscure things like eMMC storage device lifetime which is not generally available.
It is very misleading to offer the BASE version when it can only do 10% of the advertised features.The 8200 and 8300 only come in a single version and only have NVMe storage. Why is this? They run the same software, the same packages, the same default config. Are they doing something special that requires more than eMMC can handle?
So, to summarize:
- There is no mention or warning about the limitations of eMMC storage in the BASE version.
- The product page makes no recommendation to get the MAX version to use the advertised features.
- The product page misleading states "No artificial limits or add-ons required to make your system fully functional" as this does not apply to the BASE version since anything more than the default configuration risks premature storage failure.
- The product pages make no mention that the BASE version cannot be upgraded to the MAX version. When the return period runs out 30 days after purchase, the user is stuck with an expensive device that cannot be upgraded and cannot be used to its full, advertised potential.
- Failure rates of the BASE versions can be 30-40% or more, depending on the packages used.
How can I contact someone at Netgate to discuss this further? I think this is a serious issue and the product pages and documentation need to be updated to clearly distinguish the limitations of the BASE versions and prevent further confusion and premature device failure.
-
@andrew_cb I fully agree with and understand your situation.
I luckily discovered the issue 3 years back before my devices died of wear-out, and installed an SSD myself.
I created a thread (https://forum.netgate.com/topic/170128/emmc-write-endurance) on this forum, clearly identifying the potential problem and encouraging people to dial down the write intensity of packages and firewall rules. At that time pfBlockerNG had an issue causing it to write in an endless loop, so the figures were really bad at the time. But even after that was corrected, it is still a BIG problem on basic installs.But Netgate kept the eMMC models around and have still not opted into setting up RAM disk as default on those devices (which is needed now).
So I have been expecting this to turn into a bigger problem at some point.Not that it helps you or other customers that have dead devices, but I fully agree with you, and you have my sympathy with your current situation :-(
-
I got a SG-4100 (not the MAX) and the first thing I did was install a nvme.
Since I was a SG-3100 user for a long time, I was already aware of the eMMC lifespan, but I'm pretty sure that new users won't be aware of this.One suggestion to Netgate would be, give the user more options in the shop, with a warning and a link to the docs.
Cheaper variant: SG-4200 with eMMC storage (Read about eMMC lifespan here).
20 bucks more expensive than cheaper variant: SG-4200 with a 128GB nvme (not enterprise nvme).
SG-4200 MAX (enterprise nvme).This would help users during the variant selection, more options for buyers and a warning so users can be prepared in case they get the emmc only variant.
-
@keyser said in Another Netgate with storage failure, 6 in total so far:
I luckily discovered the issue 3 years back before my devices died of wear-out, and installed an SSD myself.
I created a thread (https://forum.netgate.com/topic/170128/emmc-write-endurance) on this forum, clearly identifying
That was one of the forum posts I've read and used to decide when I had to decide what 4100 I had to take.
The elephant mentioned overthere (== ZFS) wasn't listed here as a package. I found out what 'ZFS' does for a living ...... and I had my answer straight away.@andrew_cb : Great write-up. It will help future potential Netgate appliance buyers very useful info (if they look for it ...).
-
@Gertjan said in Another Netgate with storage failure, 6 in total so far:
The elephant mentioned overthere (== ZFS) wasn't listed here as a package. I found out what 'ZFS' does for a living ...... and I had my answer straight away.
I believe ZFS is most definitely a strong underlying root cause of the increased wear. It does quite a bit of background disk writes as part of its resiliency processing. Add on heavy logging with a package or two and you can greatly accelerate the wear.
I'm still running UFS on the two Netgate devices I manage. I just have them each on a UPS.
-
@bmeeks said in Another Netgate with storage failure, 6 in total so far:
@Gertjan said in Another Netgate with storage failure, 6 in total so far:
I believe ZFS is most definitely a strong underlying root cause of the increased wear. It does quite a bit of background disk writes as part of its resiliency processing. Add on heavy logging with a package or two and you can greatly accelerate the wear.
I'm still running UFS on the two Netgate devices I manage. I just have them each on a UPS.
It definitely is since ZFS's write algorithm is both time and allocation triggered. It will always allocate new blocks rather than used blocks for writes. This causes SSDs to rewrite far more blockpages - that would otherwise be considered "static" - over time because of the way they do wear leveling. It's not a HUGE issue, but specifically for lots of logging it will up the write amplification quite noticeably.
However - given HOW prone pfsense boxes are to boot failures on UFS after power outages/hard shutdowns, it's a WELL WORTH tradeoff to make. Then comes all the other features like boot environments, optional mirroring and fault handling in upgrades.... It's see no setups where I would not opt for ZFS and then either get a SSD or enable RAMDISK.
-
Running UFS with ramdisks enabled reduces drive write to near zero and I have yet to see a UFS corruption issue with that.
But it also restricts what you can run especially on smaller systems without RAM to spare. And you do lose some logs etc in the event of a reboot which cab make troubleshooting tricky.
But on older systems running from SD card or (gasp) CF it's only real option IMO.