Another Netgate with storage failure, 6 in total so far
-
It has been nearly two months since Netgate acknowledged the issue, and there have been no changes.
[literally responding from a Starbucks in So Colorado at 5:30 on a Friday]
You are wrong, there are a lot of changes in-progress, but I’m not getting I to this with you, here, right now.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
just like you cannot sell a manual transmission vehicle without an instrument panel and then say it is the user's fault when the engine blows up.
I have several old Toyota trucks with manual transmissions that do not have tachometers.
-
@jwt said in Another Netgate with storage failure, 6 in total so far:
I have several old Toyota trucks with manual transmissions that do not have tachometers.
A 22R requires some patience and effort to get it revving high enough to blow... not to mention the noise it makes will provide some obvious auditory clues.
Most 22RE are limited to 5800 RPM.
In both cases, I am pretty sure that Toyota has the maximum safe rev limit specified in the owner's manual.Are 40-year-old Toyota trucks the best comparison for Netgate firewalls? Both offer legacy/carb and EFI options, but the evidence suggests that Toyota is the more reliable of the two.
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
@jwt said in Another Netgate with storage failure, 6 in total so far:
I have several old Toyota trucks with manual transmissions that do not have tachometers.
A 22R requires some patience and effort to get it revving high enough to blow... not to mention the noise it makes will provide some obvious auditory clues.
Most 22RE are limited to 5800 RPM.
In both cases, I am pretty sure that Toyota has the maximum safe rev limit specified in the owner's manual.Are 40-year-old Toyota trucks the best comparison for Netgate firewalls? Both offer legacy/carb and EFI options, but the evidence suggests that Toyota is the more reliable of the two.
[Now close to New Mexico]
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
@jwt said in Another Netgate with storage failure, 6 in total so far:
I have several old Toyota trucks with manual transmissions that do not have tachometers.
A 22R requires some patience and effort to get it revving high enough to blow... not to mention the noise it makes will provide some obvious auditory clues.
Most 22RE are limited to 5800 RPM.
In both cases, I am pretty sure that Toyota has the maximum safe rev limit specified in the owner's manual.Nothing I own has a 22R. Some have F or 2F engines. One has a 3F (belongs to the CEO, technically), but that one has an auto, and a tach.
You were talking about instrumentation on the dash, now you’re bailing for the owner’s manual.
Are 40-year-old Toyota trucks the best comparison for Netgate firewalls? Both offer legacy/carb and EFI options, but the evidence suggests that Toyota is the more reliable of the two.
Go ahead, show us all the proof in this FJ40 manual
https://www.slideshare.net/slideshow/toyota-land-cruiser-owners-manual-1968-1971-fj40-fj43-fj45-pdf/269783892
Seems like you still don’t know what you’re talking about, and anre only here to fight.
-
@jwt said in Another Netgate with storage failure, 6 in total so far:
It has been nearly two months since Netgate acknowledged the issue, and there have been no changes.
[literally responding from a Starbucks in So Colorado at 5:30 on a Friday]
You are wrong, there are a lot of changes in-progress, but I’m not getting I to this with you, here, right now.
Nice of you to chime in with vague and unverifiable claims.
Perhaps your time would have been better spent doing something to help protect your 'valued' customers from premature, catastrophic failures of your hardware, such as adding warnings and/or disclaimers to the product pages in the Netgate store?
-
@jwt said in Another Netgate with storage failure, 6 in total so far:
Seems like you still don’t know what you’re talking about, and anre only here to fight.
You are right - I am here fighting to protect current and prospective Netgate customers against sudden, premature failure of Netgate firewalls that they are spending their hard-earned money on. My posts on Reddit and this forum have helped many users avoid disaster, particularly those who became aware that their Netgate firewall was terminally damaged and at risk of imminent failure. How many users have you helped avoid storage failure today, this week, this month, this year?
You seem passionate about discussing old Toyotas - unfortunately, you do not seem to share the same passion for educating and protecting Netgate customers from storage failure.
I will concede that the 1968-1971 Toyota FJ40 did not come with a tachometer and the owner's manual does not specify a maximum RPM for the engine. Your example of a 55-year-old vehicle proves that everything else posted in the nearly 200 comments on this thread is wrong. Clearly, nothing is wrong, and all the premature failures of Netgate devices must be some sort of shared mass-hallucination.
On the subject of Toyota LandCruisers, do you ever go out cruising with fellow enthusiasts Netgate's marketing director or gonzopancho from Reddit? Gonzopancho also lives in Colorado, and he is the co-owner of Netgate and head of engineering. You and gonzopancho have very similar writing styles, boy, it sure would be a surprise if you turned out to be him!
Maybe you guys can meet up sometime and discuss solutions to improve storage health monitoring and lifetime, and ways to better educate customers on the storage limitations of the BASE/eMMC versions of Netgate firewalls?
-
@andrew_cb while gonzo isn’t exactly dead, I try to not let him out of his cage. Everyone prefers this.
On the subject of old Toyotas, you happened to wander into something else where TJ, (whom you mention by reference), and I both have a lot of experience. TJ went to work at HP in their non-volitile storage group out of B-school. All in Colorado.
I for one, am back home in Tejas.
As I said upthread, there are changes underway. Thanks for being part of the conversation and community.
-
@jwt said in Another Netgate with storage failure, 6 in total so far:
As I said upthread, there are changes underway. Thanks for being part of the conversation and community.
I look forward to seeing the changes and hope they can help prevent more unexpected storage failures.
Will these upcoming changes be in 25.03 or will we they be another 6-months out until the next release?
Can you confirm that the changes will include:
- Monitoring and reporting of the onboard storage included and enabled by default in pfSense;
- Notice on the product pages in the store about the usage limitations of the Base/eMMC storage;
- Clearer warnings in the package documentation about the risks of increased storage wear;
- Warnings about storage wear when installing packages in pfSense;
- Changes to the default pfSense logging settings to match what is commonly recommended for reducing storage wear (disabling default logging etc.);
-
@andrew_cb said in Another Netgate with storage failure, 6 in total so far:
Will these upcoming changes be in 25.03 or will we they be another 6-months out until the next release?
Can you confirm that the changes will include:
...I want to be very clear -- I don't speak in any way for Netgate in any way, shape or form. However, honestly, I think you are pushing a little too hard.
Most of the writes come from installed packages. You are forcing a situation in which Netgate may retreat into a corner and say that installation of packages invalidates the hardware warranty. I doubt this is what you, or anyone else, wants.
Maybe take a breath, let it lie for a while, and see what comes.
-
My reading is write amplification caused by changing file system was not accompanied by increase write endurance of the non volatile memory.
Give that is purely within Netgate design decision, I'm think it likely they will fix it going forward. They may seek to limit warranty repair costs though but that's a reputation damage vs upfront cost marketing decision.
-
@dennypage I understand what you are saying, but I believe that clarity and urgency is needed in this situation.
Having a gray area of "it's not supported but people can get their hardware replaced if it dies under warranty" is not a sustainable situation. If Netgate were to declare that installing packages voids the hardware warranty (whether on Base and/or Max versions), then that is a clear line that users can follow. It would then be less likely that a user would purchase a Netgate device and use it in a manner that would cause the onboard storage to fail before the 1-year warranty elapses, and if that were to happen, there would at least be a clear disclaimer to point to.
Netgate has not released any RMA figures, so it is hard to know the real scope of the problem. Plus, many failures happen after the 1-year warranty, so the failure rate is likely to be severely understated. Do you suspect that many devices fail during the warranty period due to the use of packages?
If the store pages simply had an info bubble that recommended getting the Max version for most use cases, then people would likely buy the Max version - there would be no reason to complain, and storage failures would be rare. Masking the problem behind tribal knowledge and a secret decoder ring is not helping anybody.
Just today I spoke with a user with a 4100 with dead eMMC, and another user that discovered the storage on their 2100 is critically worn. Both were unaware of the issue until they read my posts. Concerns about eMMC storage wearout were raised over 3 years ago. Simply waiting quietly has not improved the situation. How many more devices will be unknowingly purchased and how many more will fail while we continue waiting?
Ask yourself how is it that myself and others are putting in so much effort to spread awareness and help users when so far Netgate's position on the matter is "you're using it wrong."
We should expect Netgate to do more to rectify the problem, not less. -
FWIW, I carelessly burned through the eMMC on my own 6100. After installing a NVMe drive, I spent some time diving into disk writes to discover where the writes originated from. On my system, it turned out that over 90% of the writes resulted from package operations. Yes, over 90% and this is what killed my eMMC. Ultimately, I felt that I was responsible for my own decisions in this regard. You may feel differently.
-
@dennypage said in Another Netgate with storage failure, 6 in total so far:
On my system, it turned out that over 90% of the writes resulted from package operations.
Would you be so kind and name the packages?
-
@dennypage said in Another Netgate with storage failure, 6 in total so far:
Ultimately, I felt that I was responsible for my own decisions in this regard.
This is true IF you were aware that using packages could have such a negative impact on the drive, and then you decided to do it anyway.
I bought the 6100 base because I knew I didn’t need a lot of storage space, I thought that was the differentiating factor. I didn’t realize I was getting a neutered device.
-
@fireodo said in Another Netgate with storage failure, 6 in total so far:
Would you be so kind and name the packages?
Probably something on this list ("Storage Requirements" column):
@SteveITS said in Another Netgate with storage failure, 6 in total so far:
Some packages (https://www.netgate.com/supported-pfsense-plus-packages),
-
@fireodo said in Another Netgate with storage failure, 6 in total so far:
Would you be so kind and name the packages?
I can (and will) only speak to the packages that I wrote and/or maintain [Avahi, lldpd, mDNS Bridge, ntopng, nut, and the coming ANDwatch]. Of those, the only one that I would say is a problem would be ntopng, and I recommend using ntopng as a diagnostic tool rather than as a continuous service. FWIW, the need to keep disk writes under control was a significant consideration in ANDwatch development.
There are other commonly used packages that produce significant amounts of disk writes, not all of which are immediately obvious. I believe their maintainers are generally aware of these issues, and are working to address them.
-
@dennypage
Thank you for the explanation!
Regards,
fireodo -
Something that might help is increase the default async txg timer, defaults to 5 seconds. I am about to go sleep, but on a couple of pfsense VMs I tested the impact of increasing it and it made a significant dents on writes logged by the hypervisor for the VM. This has no impact on sync writes.
Or maybe if a UPS is detected via one of the the UPS packages, it could reconfigure it or something.When I wake up if I remember I will post the exact tunable to change and the exact savings on writes I got.
-
@chrcoluk said in Another Netgate with storage failure, 6 in total so far:
Something that might help is increase the default async txg timer, defaults to 5 seconds.
See here: Tuning
-
@fireodo The txg timeout is the one.
I did also configure 'zfs set sync=disabled' to test and found that made absolutely no difference, all the writes or the vast majority must be async.
The txg timeout also doesnt need to go as high as 120, boosting it to 30 is enough.
So keep zfs set sync as default, and boost 'vfs.zfs.txg.timeout' to 30 is my recommendation to netgate developers.