Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Extreme SSD wear on Netgate 8200 MAX - ntopng likely culprint with ZFS

    Scheduled Pinned Locked Moved Hardware
    8 Posts 5 Posters 645 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A
      axellarsson
      last edited by axellarsson

      I replaced an older SG-5100 with an 8200 MAX just shy of two years ago. Prior to the 8200 I had been using the SG-5100 with default eMMC storage, including write heavy packages like pfBlockerNG and ntopNG without any issues for 4 years.

      Due to some of the posts I've been seeing on disk wear, I decided to look at my S.M.A.R.T stats for the 8200 this weekend, and was alarmed by what I saw with only 2 years of usage:

      ====== START OF SMART DATA SECTION ===
      SMART overall-health self-assessment test result: PASSED
      
      SMART/Health Information (NVMe Log 0x02)
      Critical Warning:                   0x00
      Temperature:                        38 Celsius
      Available Spare:                    100%
      Available Spare Threshold:          1%
      Percentage Used:                    99%
      Data Units Read:                    450,632 [230 GB]
      Data Units Written:                 367,252,380 [188 TB]
      Host Read Commands:                 4,492,871
      Host Write Commands:                3,452,232,979
      Controller Busy Time:               31,299
      Power Cycles:                       38
      Power On Hours:                     6,553
      Unsafe Shutdowns:                   17
      Media and Data Integrity Errors:    0
      Error Information Log Entries:      128
      Warning  Comp. Temperature Time:    0
      Critical Comp. Temperature Time:    0
      Temperature Sensor 1:               63 Celsius
      Temperature Sensor 2:               38 Celsius
      Temperature Sensor 3:               38 Celsius
      Temperature Sensor 4:               37 Celsius
      

      188 TB written, 99% of lifetime used, but it appears that the SSD has not tapped into any spare capacity yet. The disk is very under-utilized, with < 6 GB used of the 128 GB SSD that is included in the 8200 MAX, and it appears that Netgate configured the ZFS pools in these things with autotrim turned on, so this is probably giving the SSD controller plenty of room for wear leveling.

      I believe the culprit for the extreme excessive writing is ntopng. When I disable ntopng, iostat -x shows kw/s going down from 4000+ to 100-300. I've disabled ntopng for now to reduce further wear until I have this figured out.

      I suspect I never ran into any issues with eMMC wear-out on the old SG-5100 because that used a UFS filesystem and IIRC on those smaller systems, the ramdisks were set up by default.

      Questions for the forum:

      1. Has anyone else seen SSD wear this extreme on one of the MAX systems and what should I expect in terms of the lifetime of this SSD? Am I already on borrowed time or should I take some comfort that it hasn't tapped into the spare yet?
      2. Netgate says these devices aren't user replaceable, but if this SSD is already running on borrowed time, I obviously don't want to replace a less then 2 year firewall to swap out the SSD. Anyone done a replacement of one of these drives that can share experience?
      3. Any recommendations to reduce writes from ntopng? I already have timeseries retention set to 7 days (holdover from when I migrated the config from the SG-5100 which had very limited space). This is a decent sized home network and homelab with about 115 devices. Total size of /var/db/ntopng is < 600MB so it seems like we are seeing some extreme write amplification here.
      keyserK 1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        For comparison on a test 8200 here I see:

        === START OF SMART DATA SECTION ===
        SMART overall-health self-assessment test result: PASSED
        
        SMART/Health Information (NVMe Log 0x02)
        Critical Warning:                   0x00
        Temperature:                        42 Celsius
        Available Spare:                    100%
        Available Spare Threshold:          1%
        Percentage Used:                    1%
        Data Units Read:                    177,777 [91.0 GB]
        Data Units Written:                 6,000,284 [3.07 TB]
        Host Read Commands:                 3,006,035
        Host Write Commands:                261,017,481
        Controller Busy Time:               501
        Power Cycles:                       301
        Power On Hours:                     6,553
        Unsafe Shutdowns:                   236
        Media and Data Integrity Errors:    0
        Error Information Log Entries:      117
        Warning  Comp. Temperature Time:    0
        Critical Comp. Temperature Time:    0
        Temperature Sensor 1:               66 Celsius
        Temperature Sensor 2:               42 Celsius
        Temperature Sensor 3:               42 Celsius
        Temperature Sensor 4:               43 Celsius
        

        One notable value is that the power-on hours show exactly the same as yours which seems very unlikely.

        But certainly yours is writing at a much higher rate compared to the read values.

        Replacing those SSDs is not that hard if you have some experience IMO.

        johnpozJ 1 Reply Last reply Reply Quote 0
        • keyserK
          keyser Rebel Alliance @axellarsson
          last edited by

          @axellarsson If you are just a little handy with a screwdriver replacing the SSD is very very simple. I found a Youtube guide a long time ago when I added a SSD to my 6100 (Same chassis as the 8200).
          Also the SSD’s for this thing is dirt cheap. I bought a 512Gb SSD to avoid wearing out to quickly, and it set me back less than a 100$

          Love the no fuss of using the official appliances :-)

          1 Reply Last reply Reply Quote 0
          • JonathanLeeJ
            JonathanLee
            last edited by

            Good job noticing this, that is extreme use. I can safely say SSD drives can do some work wow !!

            Make sure to upvote

            1 Reply Last reply Reply Quote 0
            • johnpozJ
              johnpoz LAYER 8 Global Moderator @stephenw10
              last edited by

              @stephenw10 said in Extreme SSD wear on Netgate 8200 MAX - ntopng likely culprint with ZFS:

              Power On Hours: 6,553

              That has to be some sort of glitch, he stated he switched to this a little over 2 years ago.. Well many hours is only 273 days. No where close to 2 years, I doubt it was turned off more days than its been on in a year year period ;)

              An intelligent man is sometimes forced to be drunk to spend time with his fools
              If you get confused: Listen to the Music Play
              Please don't Chat/PM me for help, unless mod related
              SG-4860 24.11 | Lab VMs 2.8, 24.11

              A 1 Reply Last reply Reply Quote 0
              • A
                axellarsson @johnpoz
                last edited by axellarsson

                Yes, the Power On hours is incorrect and seems to be an issue in the SMART value reporting as it matches the number Stephen reported above exactly on a test system.

                The firewall has been up since I put it in April 2023. Only reboots for pfSense upgrades.

                1 Reply Last reply Reply Quote 0
                • A
                  axellarsson
                  last edited by

                  My firewall just locked up hard (no ping response and no response from the serial console) and required a power cycle to reboot. Interestingly, after reboot, the S.M.A.R.T stats shows a reduction by more then half for data units read and written as well as the host read/write commands. These S.M.A.R.T results are from less then an hour after reboot, so clearly these are not the values "since reboot".

                  I also note that the uptime value is stuck exactly as it was before, leading credence to it not being a reliable number.

                  Any thoughts as to what is going on here?

                  === START OF SMART DATA SECTION ===
                  SMART overall-health self-assessment test result: PASSED
                  
                  SMART/Health Information (NVMe Log 0x02)
                  Critical Warning:                   0x00
                  Temperature:                        40 Celsius
                  Available Spare:                    100%
                  Available Spare Threshold:          1%
                  Percentage Used:                    100%
                  Data Units Read:                    138,622 [70.9 GB]
                  Data Units Written:                 113,308,631 [58.0 TB]
                  Host Read Commands:                 2,535,666
                  Host Write Commands:                1,142,150,528
                  Controller Busy Time:               18,241
                  Power Cycles:                       39
                  Power On Hours:                     6,553
                  Unsafe Shutdowns:                   18
                  Media and Data Integrity Errors:    0
                  Error Information Log Entries:      127
                  Warning  Comp. Temperature Time:    0
                  Critical Comp. Temperature Time:    0
                  Temperature Sensor 1:               66 Celsius
                  Temperature Sensor 2:               41 Celsius
                  Temperature Sensor 3:               41 Celsius
                  Temperature Sensor 4:               41 Celsius
                  
                  Error Information (NVMe Log 0x01, 16 of 64 entries)
                  Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
                    0        127     0  0x0008  0x4004  0x000            0     0     -  Invalid Field in Command
                  
                  Self-tests not supported
                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by

                    Hmm, not seen that. It's the original SSD?

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post
                    Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.