Another Netgate with storage failure, 6 in total so far
-
@dane_h I used an Intel Optane 16Gb drive https://www.ebay.co.uk/itm/395684843954
-
@punting_packets That appears to be the same as this one on Amaz.
-
@dane_h Yep, I'd agree.
-
These eMMC topics have been here for a while. I’ve had two SG-1100 units fail on eMMC, one of which I fixed using an old SSD in a USB enclosure.
It was enough for me to order a Max model (SG-2100 with 128GB SSD preinstalled by Netgate), just to stay on the safe side of things.
-
@dane_h I ordered and installed this one, up and running. Price about the same.
https://www.amazon.com/dp/B08TTDQ5WH?ref=fed_asin_title&th=1
-
-
@dstaylor FWIW, I use the same one.
-
@andrew_cb
I am beyond frustrated with Netgate. The whole point of buying Netgate as opposed to using cheap Mini PCs and installing pfSense was to avoid these kind of surprises.This particular unit is installed at an office with no IT staff on the other side of the world. We might have to send them a new unit and getting it swapped out may be a challenge.
We'll have to audit the other units that we have deployed (thankfully stateside) and see which ones are eMMC and which are SSD.
-
@andrew_cb and @stephenw10 Some questions:
#1 Is every eMMC equipped netgate prone to be affected? Or are there just a limited number of occurrences? #2 Does the eMMC production series have any influence or is it simply more writes = issue?
A good friend of mine is running a 4100 base. He believes he’s fine regarding the eMMC issue because he doesn’t do much logging. I don’t believe he even checks his eMMC health periodically, he’s not concerned about it. -
@Cabledude said in Another Netgate with storage failure, 6 in total so far:
doesn’t do much logging
It's very relative hence the (my) list of mitigating settings above. The default deny rules log. Is pfSense behind an ISP router that blocks incoming? Suricata logs HTTP requests. Some people leave the dashboard open which logs every web request for each widget update. pfBlocker DNSBL logs DNS requests, and a few feeds like UT1 are gigantic.
My 2100 at home is from October 2020 and it shows 10% used:
eMMC Life Time Estimation A [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A]: 0x01
eMMC Life Time Estimation B [EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B]: 0x01
eMMC Pre EOL information [EXT_CSD_PRE_EOL_INFO]: 0x01eMMC (as a technology) has less "disk writes per day" than SSD. It is also usually much smaller. So writing (completely making up a number here) 5 GB per day has way more impact on an 8 GB eMMC than a 128 GB SSD. Which, overall, is the point of this thread.
-
Yup, the larger the drive the less write each individual 'bit' sees for a fixed total drive writes. So larger drives are less affected.
-
Wildly speculating around the 4100, it appears enough damage to the eMMC can brick the boxes too!
One of mine won't post now precluding my ability to install NVMe at all. The leds on the board indicate activity on one flash drive after the reset indicator flickers without any console output or getting past the orange circle of death. This even after pulling the cmos battery, NVMe etc. -
Of the confirmed eMMC failures we've seen most do not fail like that. In fact I don't think we've seen a single failure that presented like that in person. There was one user here on the forum who reported removing the eMMC chip and that that allowed it to bot from NMVe. So far unconfirmed though. So it could be some other failure.
-
@stephenw10 Figures I'd be a unicorn. To be fair, I suspect once a unit is deemed a brick, they probably seldom make it back to your bench from customers unless they're in the short window.
-
Indeed if it fails to POST entirely it's difficult to confirm any sort of cause.
-
@stephenw10 But hey, thanks for the reminder this box is now into the "can't hurt to try" zone where I get to play with solder! Now to go find my magnifier and a certain flash chip ;)
-
This morning, I was surprised to find that my threads in /r/pfsense and /r/netgate have been deleted. Fortunately, I still have screenshots of an interesting post from kphillips-netgate (@kphillips), in which he says:
...I think it's important to err on the side of letting people discuss things without overbearing moderation unless it becomes necessary...
Interesting. Can you find out why both Reddit threads were deleted and who made that decision?
...[I] process RMA support tickets for devices every day...
...Beware of confirmation bias...These are good points to keep in mind as they tie into his next statement:
I haven't seen any particularly unusual numbers of RMAs for any particularly product in our lineup.
What is specifically considered to be an RMA? Is this all claims for RMA, or only claims that have been accepted as warranty? What is the ratio of approved RMAs versus RMAs denied because they are past the 1-year warranty? It appears that a significant amount of posts are about devices that failed after the 1-year warranty, and many users have had their warranty claim rejected or did not bother contacting Netgate since they were out of warranty. This would significantly alter the number and ratio of RMA claims.
...I'm not trying to admonish or belittle anybody here...
...sometime's it's totally by accident and I'm not trying to "blame shift" here...I am glad to hear that you personally are not trying to admonish, belittle, or blame shift, but...
...You will only see people who ran into issues...
It seems that others do not share your view of giving the user the benefit of the doubt. In all the posts made by users who encountered storage failure, what we do see is Netgate consistently blaming the user for causing the storage failure, and never apologizing or showing empathy.
We have a page outlining many of the packages that need an SSD here (linked to the page).
I have repeatedly mentioned that this page is NOT linked anywhere. The only way you can be aware of its existence is by searching or following a link from someone. It is impossible to for a purchaser to be aware of a) storage wear caused by high writes, b) what packages and settings can cause high writes, and c) the decision criteria for choosing a MAX model. If anyone can show me a direct link to the "Supported pfSense Plus Packages" page on the Netgate website, I will be forever humbled.
I run a 6100 as my edge for work at Netgate with on eMMC (no NVME SSD installed) and it gets worked.....HARD. It's my only 6100 I have and I use it for new release building, bug testing, package testing, and much more. It has been in continuous operation for about 3 years with little to know downtime. [screenshots showing eMMC health at 0x05, 0x06, 0x01, and a manufacturing date of 06/2022]
These VERY interesting data points are provided to us by a Netgate employee. Remember what another Netgate employee @jwt said in post #34?
...the principle difference between eMMC and NVMe or SSD device is the amount of flash present on a typical eMMC .vs SSD or NVMe drive...
Let us analyze these statements further.
jwt asserts that there is no significant difference between eMMC, SSD, or NVMe other than the total amount of flash. Okay, so then eMMC is not a technologically inferior storage medium.
Now, kphillips helpfully provides some evidence of how he has a 6100 with eMMC storage and he has worked it "HARD" for 3 years and yet the storage wear is only at 50-60%.
This is where things start getting confusing...
If the statements made by jwt is correct and the data supplied by kphillips is true and representative of the durability of Netgate devices with eMMC storage, then why are so many users experiencing eMMC failure in under 3 years? And why is there a presumption that the user is at fault for causing the failure? And why does Netgate never express concern about these "rare" incidents? If kphillips were to create a post about his 6100 dying under these conditions, I wonder what kind of response he would receive? Is kphillips trying to prove that the 16GB of onboard eMMC storage actually is fine when "worked HARD"?
On the other hand, if kphillips' data is simply anecdotal and does not represent what a user can expect from the onboard eMMC storage, then it is irrelevant to this discussion.
So, which is it?
Is eMMC endurance similar to SSD/NVMe and Netgate devices with eMMC should be durable enough to handle package usage as kphillips demonstrated and that the devices were simply equipped with faulty eMMC, in which case the users did nothing wrong, and it is simply tough luck because the failure occurred after the 1-year warranty expired;
--OR--
Is the 16GB of onboard eMMC storage a known weak point that needs to be clearly identified and the user provided with abundant warnings on the product page, on the package manager page, whenever installing a package, and on the settings page of each package? -
@stephenw10 There have been a few posts now on here and reddit about successfully reviving a dead Netgate by removing the eMMC. I have a 2-year old 4100 here that had the eMMC die and I got it working on a USB flash drive only for it to go completely dead after restoring the config.
I have not had time to dig into it further, but if it truly is dead then I will try removing the eMMC chips to see if that gets it revives it.
-
Here are screenshots of @kphillips post for reference:
-
@andrew_cb I reached out to kphillips-netgate last night on Reddit and suggested we have a call to discuss this situation further, clarify the issues, and hopefully identify solutions. He asked for clarification of what threads I was referring to, and I sent him links to this thread and two others.
He has not gotten back to me yet, so I will update here when/if I receive any further responses from him.