Extreme SSD wear on Netgate 8200 MAX - ntopng likely culprint with ZFS
-
I replaced an older SG-5100 with an 8200 MAX just shy of two years ago. Prior to the 8200 I had been using the SG-5100 with default eMMC storage, including write heavy packages like pfBlockerNG and ntopNG without any issues for 4 years.
Due to some of the posts I've been seeing on disk wear, I decided to look at my S.M.A.R.T stats for the 8200 this weekend, and was alarmed by what I saw with only 2 years of usage:
====== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 38 Celsius Available Spare: 100% Available Spare Threshold: 1% Percentage Used: 99% Data Units Read: 450,632 [230 GB] Data Units Written: 367,252,380 [188 TB] Host Read Commands: 4,492,871 Host Write Commands: 3,452,232,979 Controller Busy Time: 31,299 Power Cycles: 38 Power On Hours: 6,553 Unsafe Shutdowns: 17 Media and Data Integrity Errors: 0 Error Information Log Entries: 128 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 63 Celsius Temperature Sensor 2: 38 Celsius Temperature Sensor 3: 38 Celsius Temperature Sensor 4: 37 Celsius
188 TB written, 99% of lifetime used, but it appears that the SSD has not tapped into any spare capacity yet. The disk is very under-utilized, with < 6 GB used of the 128 GB SSD that is included in the 8200 MAX, and it appears that Netgate configured the ZFS pools in these things with autotrim turned on, so this is probably giving the SSD controller plenty of room for wear leveling.
I believe the culprit for the extreme excessive writing is ntopng. When I disable ntopng, iostat -x shows kw/s going down from 4000+ to 100-300. I've disabled ntopng for now to reduce further wear until I have this figured out.
I suspect I never ran into any issues with eMMC wear-out on the old SG-5100 because that used a UFS filesystem and IIRC on those smaller systems, the ramdisks were set up by default.
Questions for the forum:
- Has anyone else seen SSD wear this extreme on one of the MAX systems and what should I expect in terms of the lifetime of this SSD? Am I already on borrowed time or should I take some comfort that it hasn't tapped into the spare yet?
- Netgate says these devices aren't user replaceable, but if this SSD is already running on borrowed time, I obviously don't want to replace a less then 2 year firewall to swap out the SSD. Anyone done a replacement of one of these drives that can share experience?
- Any recommendations to reduce writes from ntopng? I already have timeseries retention set to 7 days (holdover from when I migrated the config from the SG-5100 which had very limited space). This is a decent sized home network and homelab with about 115 devices. Total size of /var/db/ntopng is < 600MB so it seems like we are seeing some extreme write amplification here.
-
For comparison on a test 8200 here I see:
=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02) Critical Warning: 0x00 Temperature: 42 Celsius Available Spare: 100% Available Spare Threshold: 1% Percentage Used: 1% Data Units Read: 177,777 [91.0 GB] Data Units Written: 6,000,284 [3.07 TB] Host Read Commands: 3,006,035 Host Write Commands: 261,017,481 Controller Busy Time: 501 Power Cycles: 301 Power On Hours: 6,553 Unsafe Shutdowns: 236 Media and Data Integrity Errors: 0 Error Information Log Entries: 117 Warning Comp. Temperature Time: 0 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 66 Celsius Temperature Sensor 2: 42 Celsius Temperature Sensor 3: 42 Celsius Temperature Sensor 4: 43 Celsius
One notable value is that the power-on hours show exactly the same as yours which seems very unlikely.
But certainly yours is writing at a much higher rate compared to the read values.
Replacing those SSDs is not that hard if you have some experience IMO.
-
@axellarsson If you are just a little handy with a screwdriver replacing the SSD is very very simple. I found a Youtube guide a long time ago when I added a SSD to my 6100 (Same chassis as the 8200).
Also the SSD’s for this thing is dirt cheap. I bought a 512Gb SSD to avoid wearing out to quickly, and it set me back less than a 100$ -
Good job noticing this, that is extreme use. I can safely say SSD drives can do some work wow !!
-
@stephenw10 said in Extreme SSD wear on Netgate 8200 MAX - ntopng likely culprint with ZFS:
Power On Hours: 6,553
That has to be some sort of glitch, he stated he switched to this a little over 2 years ago.. Well many hours is only 273 days. No where close to 2 years, I doubt it was turned off more days than its been on in a year year period ;)
-
Yes, the Power On hours is incorrect and seems to be an issue in the SMART value reporting as it matches the number Stephen reported above exactly on a test system.
The firewall has been up since I put it in April 2023. Only reboots for pfSense upgrades.