Extreme SSD wear on Netgate 8200 MAX - ntopng likely culprint with ZFS

axellarsson

I replaced an older SG-5100 with an 8200 MAX just shy of two years ago. Prior to the 8200 I had been using the SG-5100 with default eMMC storage, including write heavy packages like pfBlockerNG and ntopNG without any issues for 4 years.

Due to some of the posts I've been seeing on disk wear, I decided to look at my S.M.A.R.T stats for the 8200 this weekend, and was alarmed by what I saw with only 2 years of usage:

====== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        38 Celsius
Available Spare:                    100%
Available Spare Threshold:          1%
Percentage Used:                    99%
Data Units Read:                    450,632 [230 GB]
Data Units Written:                 367,252,380 [188 TB]
Host Read Commands:                 4,492,871
Host Write Commands:                3,452,232,979
Controller Busy Time:               31,299
Power Cycles:                       38
Power On Hours:                     6,553
Unsafe Shutdowns:                   17
Media and Data Integrity Errors:    0
Error Information Log Entries:      128
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               63 Celsius
Temperature Sensor 2:               38 Celsius
Temperature Sensor 3:               38 Celsius
Temperature Sensor 4:               37 Celsius

188 TB written, 99% of lifetime used, but it appears that the SSD has not tapped into any spare capacity yet. The disk is very under-utilized, with < 6 GB used of the 128 GB SSD that is included in the 8200 MAX, and it appears that Netgate configured the ZFS pools in these things with autotrim turned on, so this is probably giving the SSD controller plenty of room for wear leveling.

I believe the culprit for the extreme excessive writing is ntopng. When I disable ntopng, iostat -x shows kw/s going down from 4000+ to 100-300. I've disabled ntopng for now to reduce further wear until I have this figured out.

I suspect I never ran into any issues with eMMC wear-out on the old SG-5100 because that used a UFS filesystem and IIRC on those smaller systems, the ramdisks were set up by default.

Questions for the forum:

Has anyone else seen SSD wear this extreme on one of the MAX systems and what should I expect in terms of the lifetime of this SSD? Am I already on borrowed time or should I take some comfort that it hasn't tapped into the spare yet?
Netgate says these devices aren't user replaceable, but if this SSD is already running on borrowed time, I obviously don't want to replace a less then 2 year firewall to swap out the SSD. Anyone done a replacement of one of these drives that can share experience?
Any recommendations to reduce writes from ntopng? I already have timeseries retention set to 7 days (holdover from when I migrated the config from the SG-5100 which had very limited space). This is a decent sized home network and homelab with about 115 devices. Total size of /var/db/ntopng is < 600MB so it seems like we are seeing some extreme write amplification here.

stephenw10

For comparison on a test 8200 here I see:

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          1%
Percentage Used:                    1%
Data Units Read:                    177,777 [91.0 GB]
Data Units Written:                 6,000,284 [3.07 TB]
Host Read Commands:                 3,006,035
Host Write Commands:                261,017,481
Controller Busy Time:               501
Power Cycles:                       301
Power On Hours:                     6,553
Unsafe Shutdowns:                   236
Media and Data Integrity Errors:    0
Error Information Log Entries:      117
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               66 Celsius
Temperature Sensor 2:               42 Celsius
Temperature Sensor 3:               42 Celsius
Temperature Sensor 4:               43 Celsius

One notable value is that the power-on hours show exactly the same as yours which seems very unlikely.

But certainly yours is writing at a much higher rate compared to the read values.

Replacing those SSDs is not that hard if you have some experience IMO.

keyser

@axellarsson If you are just a little handy with a screwdriver replacing the SSD is very very simple. I found a Youtube guide a long time ago when I added a SSD to my 6100 (Same chassis as the 8200).
Also the SSD’s for this thing is dirt cheap. I bought a 512Gb SSD to avoid wearing out to quickly, and it set me back less than a 100$

JonathanLee

Good job noticing this, that is extreme use. I can safely say SSD drives can do some work wow !!

johnpoz

@stephenw10 said in Extreme SSD wear on Netgate 8200 MAX - ntopng likely culprint with ZFS:

Power On Hours: 6,553

That has to be some sort of glitch, he stated he switched to this a little over 2 years ago.. Well many hours is only 273 days. No where close to 2 years, I doubt it was turned off more days than its been on in a year year period ;)

axellarsson

Yes, the Power On hours is incorrect and seems to be an issue in the SMART value reporting as it matches the number Stephen reported above exactly on a test system.

The firewall has been up since I put it in April 2023. Only reboots for pfSense upgrades.

axellarsson

My firewall just locked up hard (no ping response and no response from the serial console) and required a power cycle to reboot. Interestingly, after reboot, the S.M.A.R.T stats shows a reduction by more then half for data units read and written as well as the host read/write commands. These S.M.A.R.T results are from less then an hour after reboot, so clearly these are not the values "since reboot".

I also note that the uptime value is stuck exactly as it was before, leading credence to it not being a reliable number.

Any thoughts as to what is going on here?

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          1%
Percentage Used:                    100%
Data Units Read:                    138,622 [70.9 GB]
Data Units Written:                 113,308,631 [58.0 TB]
Host Read Commands:                 2,535,666
Host Write Commands:                1,142,150,528
Controller Busy Time:               18,241
Power Cycles:                       39
Power On Hours:                     6,553
Unsafe Shutdowns:                   18
Media and Data Integrity Errors:    0
Error Information Log Entries:      127
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               66 Celsius
Temperature Sensor 2:               41 Celsius
Temperature Sensor 3:               41 Celsius
Temperature Sensor 4:               41 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0        127     0  0x0008  0x4004  0x000            0     0     -  Invalid Field in Command

Self-tests not supported

stephenw10

Hmm, not seen that. It's the original SSD?