Netgate 2100 Life Expectancy
-
Hi, I bought a Netgate 2100 Max about 1.5 years ago, and am since using it as a very basic firewall (no fancy packages installed, really just aws-wizard, bandwidthd, cron, ipsec-profile-wizard, openvpn-client-export, Status_Traffic_Totals). VPN access is just for emergency purposes, so performance doesn't matter - didn't ever use it, yet.
I bought the Max version (with SATA SSD) so that I wouldn't have to worry about eMMC wearout. Out of curiosity, I just checked SMART stats, and saw:
=== START OF INFORMATION SECTION === Model Family: Silicon Motion based SSDs Device Model: TS32GMTS400S Serial Number: H643800641 LU WWN Device Id: 5 7c3548 20335ee41 Firmware Version: S0903D User Capacity: 32,017,047,552 bytes [32.0 GB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device TRIM Command: Available, deterministic, zeroed Device is: In smartctl database 7.3/5528 ATA Version is: ACS-2 (minor revision not indicated) SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Tue Jan 7 09:19:49 2025 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Disabled APM feature is: Unavailable Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, NOT FROZEN [SEC1] Wt Cache Reorder: Unavailable === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x71) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0002) Does not save SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 1) minutes. Conveyance self-test routine recommended polling time: ( 1) minutes. SMART Attributes Data Structure revision number: 1 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE 1 Raw_Read_Error_Rate ------ 100 100 000 - 0 5 Reallocated_Sector_Ct ------ 100 100 010 - 0 9 Power_On_Hours ------ 100 100 000 - 661 12 Power_Cycle_Count ------ 100 100 000 - 18 160 Uncorrectable_Error_Cnt ------ 100 100 000 - 0 161 Valid_Spare_Block_Cnt ------ 100 100 012 - 27 163 Initial_Bad_Block_Count ------ 100 100 000 - 5 164 Total_Erase_Count ------ 100 100 000 - 1038587 165 Max_Erase_Count ------ 100 100 000 - 1618 166 Min_Erase_Count ------ 100 100 000 - 1518 167 Average_Erase_Count ------ 100 100 027 - 1571 168 Max_Erase_Count_of_Spec ------ 100 100 000 - 3000 169 Remaining_Lifetime_Perc ------ 100 100 000 - 48 175 Program_Fail_Count_Chip ------ 100 100 000 - 0 176 Erase_Fail_Count_Chip ------ 100 100 000 - 0 177 Wear_Leveling_Count ------ 100 100 000 - 6342 178 Runtime_Invalid_Blk_Cnt ------ 100 100 000 - 0 181 Program_Fail_Cnt_Total ------ 100 100 000 - 0 182 Erase_Fail_Count_Total ------ 100 100 000 - 0 192 Power-Off_Retract_Count ------ 100 100 000 - 6 194 Temperature_Celsius ------ 100 100 000 - 46 195 Hardware_ECC_Recovered ------ 100 100 000 - 150166626 196 Reallocated_Event_Count ------ 100 100 000 - 0 197 Current_Pending_Sector ------ 100 100 000 - 0 198 Offline_Uncorrectable ------ 100 100 000 - 0 199 UDMA_CRC_Error_Count ------ 100 100 000 - 0 232 Available_Reservd_Space ------ 100 100 000 - 100 241 Host_Writes_32MiB ------ 100 100 000 - 393919 242 Host_Reads_32MiB ------ 100 100 000 - 82 245 TLC_Writes_32MiB ------ 100 100 000 - 1606564
I'm a bit surprised to see the wearout on this SSD. Remaining lifetime perc 48 - am I misreading this, or will my SSD die within 3 years of a purchase of this unit despite my really light use?
I'm tempted to be disappointed about the quality of this 500 bucks (official hardware, official distributor in Europe - to support Netgate and this project) purchase of an ARM device. But I still have the hope that I'm misreading the stats (also because Power_On_Hours looks ridiculously low with 661 hours...)? I have a Raspberry Pi 4 put to use with a SATA SSD at about the same time, and that SSD still shows 0% wearout.
Thanks!
-
Mmm, I'm not sure that's accurate. I agree the power on hours cannot be correct. And it also shows 100% reserved space still available.
-
@highc Depending on you config of the basic firewall rules and their logging settings, they can also REALLY punish a SSD logging wise. 32Gb SSD's does not have a an enormous amount of write capacity simply because of their size and limited amount of flashcells.
-
@stephenw10 said in Netgate 2100 Life Expectancy:
Mmm, I'm not sure that's accurate. I agree the power on hours cannot be correct. And it also shows 100% reserved space still available.
Yes, indeed. Any idea for how to find out which figure is right, which is wrong? I guess I could try to monitor SMART values much closer over time now and see how they evolve. But not sure if that makes sense if figures are not plausible/reliable...
In terms of a possible replacement, as SATA m.2 SSDs on the market are getting fewer and fewer: Is there any limitation (size-, brand-, controller-, or otherwise) of SATA m.2 SSDs which would be supported? Is there space for a bigger one than 2242, 2280 maybe?
@keyser said in Netgate 2100 Life Expectancy:
@highc Depending on you config of the basic firewall rules and their logging settings, they can also REALLY punish a SSD logging wise. 32Gb SSD's does not have a an enormous amount of write capacity simply because of their size and limited amount of flashcells.
Yep, I understand. I've (for 24 hours now) moved /tmp and /var to a RAM disk. Of that, /tmp is currently using 1.3 MB, /var 30 MB. Not sure that is excessive, but I also don't know how much of the data in /var gets rewritten rather than growing, and how often.
I did google for the specs of the SSD which smartctl is showing. It appears to be the 32 GB variant of this one: https://www.transcend-info.com/Products/No-642
The page gives 90 TBW for the 32 GB model, which, in my math, for three years overall life span (about 50% wearout after 1.5 years) would give me more than 2.5 drive writes a day. That seems much more capacity than the figures above suggest to me I have been using. But, again, I have very limited insight into the details. -
The 2100 board has an additional mounting hole for m.2 drives at 2260 BUT I don't know if the stud is movable on production devices. I would stick with 2242 if possible.
I have a different SSD in a 2100 here, it doesn't list a wear percentage value directly like that. But it does also show low power on hours.
-
@highc Yeah, it can be really hard to tell what’s going on. Generelly SSD and logging does NOT combine well because of how SSD’s work. Internally they operate using a minimum page size (a page is a block of cells that has to written simulanously). Those pages are ususally in the 64Kb size and up, and depending on the quality of the SSD and how logging is done in OS/Filesystem it can be a REAL killer of SSD’s.
Fx. if the OS always flushes writes on each logline, and the SSD does not have a SLC/Memory cache, then each logline in the worst case could cause a real 64Kb write even though it’s only a 100 byte logline.
That kind of Write amplification can kill even large SSD’s in short order. -
@keyer: Yes, I know. I'm running a handfull of proxmox servers in a few locations, and for those, for this very reason, am always going for reliable SSDs and/or run log2ram to reduce write stress on the SSDs (in particular with underlying ZFS). I've had a pretty extensive pfSense setup (including also firewalling between 10G subnets) on a SuperMicro 5018D-FN4T before, with much more extensive logging than currently and additional services running. No wear-out issue on the 128GB SSD used therein at all over 3 years.
@stephenw10 Thanks for the additional details. Sticking to 2242 essentially also means sticking to Trendnet (not a lot of alternatives available in the market here). And the SSD in the Netgate 2100 box is from the series of allegedly most reliable Trendnet SSDs. Not sure replacing the SSD with one from the same series seems like a promising route to me.
I have been monitoring SMART values. Power-On-Hours don't move. The readings on blocks written etc. do. There is a second reading on device health also reflecting the 42% left:
Device Statistics (GP Log 0x04) Page Offset Size Value Flags Description 0x01 ===== = = === == General Statistics (rev 2) == 0x07 0x008 1 52 --- Percentage Used Endurance Indicator
I might just monitor things and try to get a backup box (tbd) so that I have a solution in my backpocket if this one really should give up soon.
-
FWIW for anyone finding this, there are a handful of ways to reduce disk writes I've posted in other threads, that we routinely do...
- https://www.netgate.com/supported-pfsense-plus-packages lists which packages "require" or recommend SSD over eMMC
- turn off logging of the default block rules
- turn off logging of the bogon rules
- turn off Suricata logging of HTTP requests
- turn off pfBlocker DNSBL logging
- create a "don't log" rule for IGMP
- (can enable these for logging if ever needed)
- don't view the dashboard 24x7 (each widget logs the web server request to update the widget)
- use RAM disk
-
Thanks for pulling this together again, @SteveITS. Other than the use of a RAM disk, all of this has been implemented here from the start.