Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Another Netgate with storage failure, 6 in total so far

    Official Netgate® Hardware
    32
    264
    38.9k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • P
      Patch @dennypage
      last edited by

      My reading is write amplification caused by changing file system was not accompanied by increase write endurance of the non volatile memory.

      Give that is purely within Netgate design decision, I'm think it likely they will fix it going forward. They may seek to limit warranty repair costs though but that's a reputation damage vs upfront cost marketing decision.

      1 Reply Last reply Reply Quote 0
      • A
        andrew_cb @dennypage
        last edited by

        @dennypage I understand what you are saying, but I believe that clarity and urgency is needed in this situation.

        Having a gray area of "it's not supported but people can get their hardware replaced if it dies under warranty" is not a sustainable situation. If Netgate were to declare that installing packages voids the hardware warranty (whether on Base and/or Max versions), then that is a clear line that users can follow. It would then be less likely that a user would purchase a Netgate device and use it in a manner that would cause the onboard storage to fail before the 1-year warranty elapses, and if that were to happen, there would at least be a clear disclaimer to point to.

        Netgate has not released any RMA figures, so it is hard to know the real scope of the problem. Plus, many failures happen after the 1-year warranty, so the failure rate is likely to be severely understated. Do you suspect that many devices fail during the warranty period due to the use of packages?

        If the store pages simply had an info bubble that recommended getting the Max version for most use cases, then people would likely buy the Max version - there would be no reason to complain, and storage failures would be rare. Masking the problem behind tribal knowledge and a secret decoder ring is not helping anybody.

        Just today I spoke with a user with a 4100 with dead eMMC, and another user that discovered the storage on their 2100 is critically worn. Both were unaware of the issue until they read my posts. Concerns about eMMC storage wearout were raised over 3 years ago. Simply waiting quietly has not improved the situation. How many more devices will be unknowingly purchased and how many more will fail while we continue waiting?

        Ask yourself how is it that myself and others are putting in so much effort to spread awareness and help users when so far Netgate's position on the matter is "you're using it wrong."
        We should expect Netgate to do more to rectify the problem, not less.

        dennypageD 1 Reply Last reply Reply Quote 1
        • dennypageD
          dennypage @andrew_cb
          last edited by

          FWIW, I carelessly burned through the eMMC on my own 6100. After installing a NVMe drive, I spent some time diving into disk writes to discover where the writes originated from. On my system, it turned out that over 90% of the writes resulted from package operations. Yes, over 90% and this is what killed my eMMC. Ultimately, I felt that I was responsible for my own decisions in this regard. You may feel differently.

          fireodoF K J 3 Replies Last reply Reply Quote 2
          • fireodoF
            fireodo @dennypage
            last edited by

            @dennypage said in Another Netgate with storage failure, 6 in total so far:

            On my system, it turned out that over 90% of the writes resulted from package operations.

            Would you be so kind and name the packages?

            Kettop Mi4300YL CPU: i5-4300Y @ 1.60GHz RAM: 8GB Ethernet Ports: 4
            SSD: SanDisk pSSD-S2 16GB (ZFS) WiFi: WLE200NX
            pfsense 2.7.2 CE
            Packages: Apcupsd Cron Iftop Iperf LCDproc Nmap pfBlockerNG RRD_Summary Shellcmd Snort Speedtest System_Patches.

            dennypageD 1 Reply Last reply Reply Quote 0
            • K
              kingsleyadam @dennypage
              last edited by

              @dennypage said in Another Netgate with storage failure, 6 in total so far:

              Ultimately, I felt that I was responsible for my own decisions in this regard.

              This is true IF you were aware that using packages could have such a negative impact on the drive, and then you decided to do it anyway.

              I bought the 6100 base because I knew I didn’t need a lot of storage space, I thought that was the differentiating factor. I didn’t realize I was getting a neutered device.

              1 Reply Last reply Reply Quote 1
              • S
                SteveITS Galactic Empire @SteveITS
                last edited by

                @fireodo said in Another Netgate with storage failure, 6 in total so far:

                Would you be so kind and name the packages?

                Probably something on this list ("Storage Requirements" column):

                @SteveITS said in Another Netgate with storage failure, 6 in total so far:

                Some packages (https://www.netgate.com/supported-pfsense-plus-packages),

                Pre-2.7.2/23.09: Only install packages for your version, or risk breaking it. Select your branch in System/Update/Update Settings.
                When upgrading, allow 10-15 minutes to restart, or more depending on packages and device speed.
                Upvote 👍 helpful posts!

                1 Reply Last reply Reply Quote 2
                • dennypageD
                  dennypage @fireodo
                  last edited by

                  @fireodo said in Another Netgate with storage failure, 6 in total so far:

                  Would you be so kind and name the packages?

                  I can (and will) only speak to the packages that I wrote and/or maintain [Avahi, lldpd, mDNS Bridge, ntopng, nut, and the coming ANDwatch]. Of those, the only one that I would say is a problem would be ntopng, and I recommend using ntopng as a diagnostic tool rather than as a continuous service. FWIW, the need to keep disk writes under control was a significant consideration in ANDwatch development.

                  There are other commonly used packages that produce significant amounts of disk writes, not all of which are immediately obvious. I believe their maintainers are generally aware of these issues, and are working to address them.

                  fireodoF 1 Reply Last reply Reply Quote 4
                  • fireodoF
                    fireodo @dennypage
                    last edited by

                    @dennypage
                    Thank you for the explanation!
                    Regards,
                    fireodo

                    Kettop Mi4300YL CPU: i5-4300Y @ 1.60GHz RAM: 8GB Ethernet Ports: 4
                    SSD: SanDisk pSSD-S2 16GB (ZFS) WiFi: WLE200NX
                    pfsense 2.7.2 CE
                    Packages: Apcupsd Cron Iftop Iperf LCDproc Nmap pfBlockerNG RRD_Summary Shellcmd Snort Speedtest System_Patches.

                    1 Reply Last reply Reply Quote 2
                    • C
                      chrcoluk
                      last edited by

                      Something that might help is increase the default async txg timer, defaults to 5 seconds. I am about to go sleep, but on a couple of pfsense VMs I tested the impact of increasing it and it made a significant dents on writes logged by the hypervisor for the VM. This has no impact on sync writes.
                      Or maybe if a UPS is detected via one of the the UPS packages, it could reconfigure it or something.

                      When I wake up if I remember I will post the exact tunable to change and the exact savings on writes I got.

                      pfSense CE 2.7.2

                      fireodoF 1 Reply Last reply Reply Quote 2
                      • fireodoF
                        fireodo @chrcoluk
                        last edited by

                        @chrcoluk said in Another Netgate with storage failure, 6 in total so far:

                        Something that might help is increase the default async txg timer, defaults to 5 seconds.

                        See here: Tuning

                        Kettop Mi4300YL CPU: i5-4300Y @ 1.60GHz RAM: 8GB Ethernet Ports: 4
                        SSD: SanDisk pSSD-S2 16GB (ZFS) WiFi: WLE200NX
                        pfsense 2.7.2 CE
                        Packages: Apcupsd Cron Iftop Iperf LCDproc Nmap pfBlockerNG RRD_Summary Shellcmd Snort Speedtest System_Patches.

                        C 1 Reply Last reply Reply Quote 2
                        • C
                          chrcoluk @fireodo
                          last edited by

                          @fireodo The txg timeout is the one.

                          I did also configure 'zfs set sync=disabled' to test and found that made absolutely no difference, all the writes or the vast majority must be async.

                          The txg timeout also doesnt need to go as high as 120, boosting it to 30 is enough.

                          So keep zfs set sync as default, and boost 'vfs.zfs.txg.timeout' to 30 is my recommendation to netgate developers.

                          pfSense CE 2.7.2

                          1 Reply Last reply Reply Quote 2
                          • J
                            Jare 0 @dennypage
                            last edited by

                            @dennypage said in Another Netgate with storage failure, 6 in total so far:

                            FWIW, I carelessly burned through the eMMC on my own 6100. After installing a NVMe drive, I spent some time diving into disk writes to discover where the writes originated from. On my system, it turned out that over 90% of the writes resulted from package operations. Yes, over 90% and this is what killed my eMMC. Ultimately, I felt that I was responsible for my own decisions in this regard. You may feel differently.

                            Don't kick yourself. I have two 6100's that couldn't be more vanilla, zero packages from day one and they only push what I would consider to be light traffic for these units. Over 100% used up...

                            6100_wear.png

                            Probably the default logging rules and ZFS writes did mine in, but I'm not nearly qualified enough to stand by that statement. One is just over 3 years and the other is 2 years. The newer one has some general system logs that look suspect (missing file errors) but again I don't really know what I'm looking at. I will install SSD's in both and hopefully move on.

                            It would have been nice if there was a doc outlining optimal setup for a base model, or some sort of warning about the limitation of the eMMC. No doubt I would of have ponied up the extra 100 per unit to get the max version. It's a shame, these units just chug along, I would even dare to say bulletproof. That opinion took a little hit after this experience...

                            M A 2 Replies Last reply Reply Quote 3
                            • M
                              Mission-Ghost @Jare 0
                              last edited by

                              @Jare-0 said in Another Netgate with storage failure, 6 in total so far:

                              FWIW, I carelessly burned through the eMMC on my own 6100.

                              I'm all for people taking responsibility for their actions when they should probably know their actions will have adverse consequences and they've been warned or could reasonably figure out what they're about to do is damaging.

                              I'm happy with my Netgate products and pfSense. But it's not reasonable to expect people to know better or be responsible for their actions when an ordinary and customary use for a computing device (including documented packages) can run it to failure in barely enough time for the warranty to run out.

                              Louis Rossmann would love this.

                              A 1 Reply Last reply Reply Quote 1
                              • A
                                andrew_cb @Jare 0
                                last edited by

                                @Jare-0 said in Another Netgate with storage failure, 6 in total so far:

                                I have two 6100's that couldn't be more vanilla, zero packages from day one and they only push what I would consider to be light traffic for these units. Over 100% used up...

                                Nearly all of our devices are the same - very basic and only have the Zabbix package for monitoring. We are seeing eMMC wearout between 2-3 years in service.

                                Probably the default logging rules and ZFS writes did mine in, but I'm not nearly qualified enough to stand by that statement. One is just over 3 years and the other is 2 years. The newer one has some general system logs that look suspect (missing file errors) but again I don't really know what I'm looking at.

                                Our data shows that devices using ZFS have an average write-rate that's 2.5 to 6.5 times more than devices using UFS, so that appears to be what is wearing out the eMMC. This is further supported by the fact that our old 3100 and 7100 devices using UFS that are 6 to 7 years old are still under 50% wear, while our newer 4100 and 6100 with ZFS are the ones that are at 100%+ in under 3 years.

                                It would have been nice if there was a doc outlining optimal setup for a base model, or some sort of warning about the limitation of the eMMC.

                                Word is that changes are in the works so we can look forward to that.

                                No doubt I would of have ponied up the extra 100 per unit to get the max version.

                                I think many others feel the same way since the cost of an SSD is a fraction of cost of failure. An SSD essentially required one way or the other, so there is no downside to getting the Max version.

                                I suggest that it makes more sense to consider the "Max" to be the regular version, and the Base is really more of a "Lite" since it cannot perform most pfSense functions without significant compromises.

                                1 Reply Last reply Reply Quote 0
                                • A
                                  andrew_cb @Mission-Ghost
                                  last edited by

                                  @Mission-Ghost said in Another Netgate with storage failure, 6 in total so far:

                                  @Jare-0 said in Another Netgate with storage failure, 6 in total so far:

                                  FWIW, I carelessly burned through the eMMC on my own 6100.

                                  I'm all for people taking responsibility for their actions when they should probably know their actions will have adverse consequences and they've been warned or could reasonably figure out what they're about to do is damaging.

                                  I'm happy with my Netgate products and pfSense. But it's not reasonable to expect people to know better or be responsible for their actions when an ordinary and customary use for a computing device (including documented packages) can run it to failure in barely enough time for the warranty to run out.

                                  I fully agree.

                                  1 Reply Last reply Reply Quote 0
                                  • J
                                    jared.silva
                                    last edited by

                                    Apparently I migrated to USB thumb drive just in time. I rebooted my Netgate 1100 the other day and occasionally it does not recognize the USB thumb drive it is now installed to. Tried to boot from the eMMC and it no longer can.

                                    A 1 Reply Last reply Reply Quote 2
                                    • S
                                      serbus
                                      last edited by

                                      Hello!

                                      Is the 4200 BASE with only eMMC still for sale?
                                      I only see the 4200 MAX (with nvme ssd) available.
                                      Also, I just bought a 4200 MAX a week or so ago for $649. The MAX is now $599???

                                      John

                                      Lex parsimoniae

                                      1 Reply Last reply Reply Quote 0
                                      • A
                                        andrew_cb @jared.silva
                                        last edited by

                                        @jared-silva Did you clean the eMMC as per these steps when you installed the USB drive. If not, then your 1100 might still have been booting from the eMMC which would explain why it doesn't recognize the USB drive.

                                        J 1 Reply Last reply Reply Quote 0
                                        • J
                                          jared.silva @andrew_cb
                                          last edited by

                                          @andrew_cb Thanks, I was aware of wiping the eMMC but I was not aware of this page. I avoided doing it should things go wrong in the migration. I ran the following commands to set the boot order for USB and then eMMC when migrating:

                                          Marvell>> setenv bootcmd 'run usbboot; run emmcboot;'
                                          Marvell>> saveenv
                                          Saving Environment to SPI Flash... SF: Detected mx25u3235f with page size 256 Bytes, erase size 64 KiB, total 4 MiB
                                          Erasing SPI flash...Writing to SPI flash...done
                                          OK
                                          Marvell>> run usbboot
                                          

                                          There are times when the USB is not detected (due to timing?) so it will then try to boot from the eMMC.

                                          Current situation is:

                                          zpool status
                                            pool: pfSense
                                           state: ONLINE
                                          status: Some supported and requested features are not enabled on the pool.
                                                  The pool can still be used, but some features are unavailable.
                                          action: Enable all features using 'zpool upgrade'. Once this is done,
                                                  the pool may no longer be accessible by software that does not support
                                                  the features. See zpool-features(7) for details.
                                          config:
                                          
                                                  NAME        STATE     READ WRITE CKSUM
                                                  pfSense     ONLINE       0     0     0
                                                    da0p3     ONLINE       0     0     0
                                          
                                          geom -t
                                          Geom                  Class      Provider
                                          flash/spi0            DISK       flash/spi0
                                            flash/spi0          DEV
                                          mmcsd0                DISK       mmcsd0
                                            mmcsd0              DEV
                                            mmcsd0              PART       mmcsd0s1
                                              mmcsd0s1          DEV
                                              mmcsd0s1          LABEL      msdosfs/EFISYS
                                                msdosfs/EFISYS  DEV
                                            mmcsd0              PART       mmcsd0s2
                                              mmcsd0s2          DEV
                                              mmcsd0s2          LABEL      msdosfs/DTBFAT0
                                                msdosfs/DTBFAT0 DEV
                                            mmcsd0              PART       mmcsd0s3
                                              mmcsd0s3          DEV
                                              mmcsd0s3          PART       mmcsd0s3a
                                                mmcsd0s3a       DEV
                                          mmcsd0boot0           DISK       mmcsd0boot0
                                            mmcsd0boot0         DEV
                                          mmcsd0boot1           DISK       mmcsd0boot1
                                            mmcsd0boot1         DEV
                                          da0                   DISK       da0
                                            da0                 DEV
                                            da0                 PART       da0p1
                                              da0p1             DEV
                                              da0p1             LABEL      gpt/efiboot1
                                                gpt/efiboot1    DEV
                                            da0                 PART       da0p2
                                              da0p2             DEV
                                            da0                 PART       da0p3
                                              da0p3             DEV
                                              zfs::vdev         ZFS::VDEV
                                          

                                          I am confused as to what commands to run from Wipe Metadata, as the examples don't seem to match the Using the Geom Tree section. gmirror status has no output.

                                          I take it I should run:

                                          zpool labelclear -f /dev/mmcsd0 (example has /dev/mmcsd0p4)
                                          gpart destroy -F mmcsd0
                                          dd if=/dev/zero of=/dev/mmcsd0 bs=1M count=1 status=progress
                                          

                                          ?

                                          Thanks!

                                          stephenw10S A 2 Replies Last reply Reply Quote 0
                                          • stephenw10S
                                            stephenw10 Netgate Administrator @jared.silva
                                            last edited by

                                            @jared-silva said in Another Netgate with storage failure, 6 in total so far:

                                            zpool labelclear -f /dev/mmcsd0 (example has /dev/mmcsd0p4)

                                            You don't have an mmcsd0p4 device. gpart list would show you if you need to run that and on what. You may not have ZFS there.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.