SSD read/write - how long will it last

michmoor

Hey everyone,
Does anyone have the details on how many reads/writes the ssds are supposed to have on the MAX units? I have a 6100 and as expected, Suricata just hammers the ssd with tons of writes.
I cant find any data on the spec sheet.

SMART status is below

smartctl 7.3 2022-02-28 r5338 [FreeBSD 12.3-STABLE amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       M.2 (P80) 3TE6
Serial Number:                      YCA12203170170376
Firmware Version:                   V20B09
PCI Vendor/Subsystem ID:            0x1bc0
IEEE OUI Identifier:                0x24693e
Controller ID:                      0
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          120,034,123,776 [120 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            24693e 36ba029988
Local Time is:                      Fri Feb  3 18:53:04 2023 EST
Firmware Updates (0x02):            1 Slot
Optional Admin Commands (0x0006):   Format Frmw_DL
Optional NVM Commands (0x000f):     Comp Wr_Unc DS_Mngmt Wr_Zero
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         256 Pages
Warning  Comp. Temp. Threshold:     70 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.50W       -        -    0  0  0  0        5       5
 1 +     5.80W       -        -    1  1  1  1       30      30
 2 +     3.60W       -        -    2  2  2  2      100     100
 3 -   0.0500W       -        -    3  3  3  3     5000   10000
 4 -   0.0025W       -        -    4  4  4  4     5000   45000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        59 Celsius
Available Spare:                    100%
Available Spare Threshold:          1%
Percentage Used:                    3%
Data Units Read:                    7,935 [4.06 GB]
Data Units Written:                 12,477,847 [6.38 TB]
Host Read Commands:                 99,971
Host Write Commands:                285,963,709
Controller Busy Time:               6,392
Power Cycles:                       23
Power On Hours:                     3,301
Unsafe Shutdowns:                   16
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               83 Celsius
Temperature Sensor 2:               59 Celsius
Temperature Sensor 3:               59 Celsius
Temperature Sensor 4:               61 Celsius
Thermal Temp. 1 Transition Count:   1
Thermal Temp. 1 Total Time:         25290

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

johnpoz

@michmoor you have written 6TB in 137 days (3300 power on hours)? That seems a bit crazy...

Vollans

@johnpoz I’m not amazingly surprised. My new build last August hammered my new SATA drive with over 300GB of written data in a couple of weeks. I then bumped up RAM and went to a RAM drive for logging etc and since then have only got to 374GB total in 6 months.

SteveITS

@michmoor That seems crazy high. I know Netgate recommends SSD for several packages but we haven’t had an issue running Suricata on client routers without SSD. Log display often covers days to weeks. It’s all relative though so how do you have it configured and how much traffic? If on WAN it will scan and alert on everything before it gets blocked by the firewall.

Anyway the PCI vendor is Innodisk, not that it helps much without a model. SSDs are usually measured in “n” drive writes per day so a lower number of 0.5 is 64 GB per day. Total guesstimate. 1 dwpd would be 128GB. 6000/137=43.8 GB per day.

We also usually use RAM drives now but with that amount of logging it might not be feasible.

michmoor

@steveits @johnpoz Thanks. So for the firewall rules, i do have logging enabled on every rule. Thats sent to syslog server and its needed for compliance. But i really feel the intensity of the writes came from Suricata. By default, it logs almost everything about a flow plus DNS/TLS.

So on your own boxes, what kind of reads/writes are you seeing?

SteveITS

@michmoor Ah, I do uncheck "Enable HTTP Log" which could make a big difference too. ("Suricata will log decoded HTTP traffic for the interface. Default is Checked.")

One router on CE has a spinning drive so doesn't log that info. It has this:

/var/log/suricata/suricata_em01532: ls -l
total 2196
-rw-r--r--  1 root  wheel   334962 Feb  3 22:55 alerts.log
-rw-r--r--  1 root  wheel        0 Nov 15 17:55 block.log
-rw-r--r--  1 root  wheel  1242618 Feb  3 22:55 block.log.2022_1115_1755
-rw-r--r--  1 root  wheel    89352 Feb  3 18:23 sid_changes.log
-rw-r--r--  1 root  wheel   576904 Feb  3 18:25 suricata.log

and

Filesystem              Size    Used   Avail Capacity  Mounted on
pfSense/ROOT/default    405G    804M    404G     0%    /
devfs                   1.0K    1.0K      0B   100%    /dev
pfSense/home            404G     96K    404G     0%    /home
pfSense                 404G     96K    404G     0%    /pfSense
pfSense/cf              404G     96K    404G     0%    /cf
pfSense/var             404G    164K    404G     0%    /var
pfSense/cf/conf         404G    5.2M    404G     0%    /cf/conf
pfSense/var/empty       404G     96K    404G     0%    /var/empty
pfSense/var/tmp         404G     96K    404G     0%    /var/tmp
pfSense/var/log         404G    120K    404G     0%    /var/log
pfSense/var/cache       404G     96K    404G     0%    /var/cache
pfSense/var/db          404G    372K    404G     0%    /var/db
tmpfs                   512M    356K    512M     0%    /tmp
tmpfs                   1.0G     45M    979M     4%    /var
devfs                   1.0K    1.0K      0B   100%    /var/dhcpd/dev
pfSense/reservation     449G     96K    449G     0%    /pfSense/reservation

On that one the 250 log entry page goes back to early December. Our data center is different, its 250 entries go back an hour or two.

I had looked at several Netgate routers with eMMC drives a while back and it was something reasonable like 10-20% for most, and they were not new, so I was not concerned. Plus we were enabling the RAM disk at the time.

For anyone curious here are Netgate's package vs. disk recommendations.

johnpoz

@steveits Nice link, I hadn't come across that page before.. While I get haproxy if actually with heavy use.. I use it for a hand full of users to access a couple resources now and then.. I think my emmc should be fine ;)

michmoor

@johnpoz @SteveITS Thanks both of you. So reviewing the config history of this box i think i found the culprits

Someone had ntopng running for a few days.
Suricata had logging on multiple traffic types - http/dns/tls. This was me so im guilty. :)

Ok im not to concerned then about the usage on the disk. Thanks for the tip on the Enable HTTP Log @SteveITS . I'm thinking about disabling TLS logging as well. Maybe DNS. I already have unbound logging anyways.

Patch

@johnpoz said in SSD read/write - how long will it last:

written 6TB in 137 days

Agree that is a lot given the terabyte(s) written (TBW) rating for a SSD in that size range is probably 60 and 150 TB but probably last longer in practice https://www.ontrack.com/en-au/blog/how-long-do-ssds-really-last and https://www.ionos.com/digitalguide/server/security/ssd-life-span/

AdriftAtlas

I was running pfSense Plus 22.05 and now 23.01 under Proxmox. SMART for my SSD shows roughly 40-50GB written per day. Some of that is write amplification due to nested ZFS but pfSense is still writing an insane amount per day.

There's not that much logging going on, it's a home internet connection. I don't use a IDS/IPS package. All I have installed is pfBlockerNG, acme, iperf, Status_Traffic_Totals, and System_patches. I am running a DNS resolver if that matters. I only see syslogd writing stuff every few seconds in top.

My Proxmox box is using a Samsung 980 which is relatively durable but I'd rather not thrash it needlessly either. Something is extremely inefficient in how it writes in pfSense.

Here is what the IO graph for pfSense looks like in Proxmox:
Screenshot 2023-02-23 at 4.28.45 PM.png

Here is zpool iostat output:

zpool iostat -y 1

              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0     87      0  1.16M
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0     89      0  1.17M
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0     97      0  2.21M
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0     85      0  1.16M
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0     91      0  1.20M
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0     87      0  1.17M
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0     86      0  1.12M
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0
pfSense     3.53G  59.0G      0      0      0      0

SteveITS

@adriftatlas As noted above, if it’s logging then a RAM disk will help.
pfBlockerNG can be set to log DNSBL. PfSense logs the default block rule by default.
Could it be using swap? (Low on memory)

AdriftAtlas

It has 4GB of RAM and it's only using 33%.

I stopped the syslogd service temporarily and something was still writing to disk excessively so it's not the culprit.

This command shows zfskern writing a lot but what and why is it writing?

top -m io -o write -IS -d 1

last pid: 97225;  load averages:  0.17,  0.17,  0.14                                                                                      up 2+20:53:44  20:25:57
80 processes:  2 running, 76 sleeping, 2 waiting
CPU:  0.3% user,  0.1% nice,  1.2% system,  0.2% interrupt, 98.2% idle
Mem: 89M Active, 191M Inact, 1383M Wired, 2166M Free
ARC: 842M Total, 678M MFU, 91M MRU, 2298K Anon, 9317K Header, 59M Other
     699M Compressed, 1473M Uncompressed, 2.11:1 Ratio
Swap: 1024M Total, 1024M Free

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
    6 root      2207881  26057    313 1142462      0 1142775  82.97% zfskern
54361 root       35902   7684    313 151637     80 152030  11.04% php-fpm
  359 root       34589   8604    111  76066     28  76205   5.53% php-fpm
14040 dhcpd     969664  20228     12   2585     20   2617   0.19% dhcpd
84008 root        3607    224      0   2060      0   2060   0.15% syslogd
  358 root       36037   7802     54    400     35    489   0.04% php-fpm
82882 root        2927    152      0    353      0    353   0.03% php_pfb
18850 _dhcp       5303    182      0    287      0    287   0.02% dhclient
80651 root         235     53      0    216      0    216   0.02% vnstatd
 3607 root      258782   7119      0     25      0     25   0.00% ntpd
98600 unbound   857611 289739      1      7      8     16   0.00% unbound
 4737 root       14994   5587     71      4      2     77   0.01% nginx
38639 root           9      4      7      2      5     14   0.00% login
 5065 root        4456   2591      4      2      1      7   0.00% nginx
80827 root          18      3      0      2      0      2   0.00% sh
 4892 root        5399   2003      0      2      0      2   0.00% nginx
63231 root          18      3      0      2      0      2   0.00% sh
22895 root      282285  17201      0      0      2      2   0.00% filterlog
    1 root         272      7     85      0     15    100   0.01% init
  357 root      243687   8432      1      0      0      1   0.00% php-fpm
67497 root        3140    244      5      0      8     13   0.00% tcsh
  397 root         626    182      1      0      4      5   0.00% check_reload_status
    0 root      92073647  98734     42      0      0     42   0.00% kernel
41040 root          10     26      5      0      4      9   0.00% sh
62768 root        9323    597      2      0      9     11   0.00% sshd
83376 root           6      1      4      0      5      9   0.00% iperf3
38864 root           4      2      1      0      0      1   0.00% getty
 2194 root       13892   1543      1      0      0      1   0.00% cron
42041 root          14      1      1      0      0      1   0.00% sh
66300 root        9007    859      0      0      2      2   0.00% qemu-ga
12509 root          10      1      1      0      0      1   0.00% sshd

keyser

@michmoor If you look closelt in that report, it says %used = 3%
So there is lots and lots of life left in your SSD :-)

michmoor

@keyser you sure percentage used isn’t taking about disk space?

keyser

@michmoor said in SSD read/write - how long will it last:

@keyser you sure percentage used isn’t taking about disk space?

Well, pretty much 100%. The SSD itself cannot know how much diskspace the OS considers allocated and used - especially not if the OS does not support TRIM. Also, the S.M.A.R.T. Health tools always reports on the actual physical state of the SSD, not the filesystem state on the SSD.

SteveITS

@adriftatlas said in SSD read/write - how long will it last:

  PID USERNAME     VCSW  IVCSW   READ  WRITE  FAULT  TOTAL PERCENT COMMAND
    6 root      2207881  26057    313 1142462      0 1142775  82.97% zfskern

WRITE is a counter, I believe? So it should increment forever. I pulled up a router on 2.6 with ZFS and see 5315859 (incrementing slowly) but that's 284 days of uptime. zfskern should include the ZFS scrub.

What does iostat -x show? On that same router, which is also using a RAM disk, I see:

: iostat -x
                        extended device statistics
device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b
ada0           0       0      0.0      1.2     6     0    24     3    0   0
cd0            0       0      0.0      0.0     0     0     1     1    0   0
pass0          0       0      0.0      0.0     4     1    79    22    0   0
pass1          0       0      0.0      0.0     0     0     0     0    0   0

Dobby_

I would go with a small RaspBerry PI 3/4 with 2/4 GB and a
big mSATA or M.2 SSD with TRIM support as a logging server.

AdriftAtlas

                        extended device statistics  
device       r/s     w/s     kr/s     kw/s  ms/r  ms/w  ms/o  ms/t qlen  %b  
da0            0      11      2.1    149.9     0     0     5     1    0   0 
pass0          0       0      0.0      0.0     0     0     0     0    0   0

I'd like to understand what is writing though. I'd rather not resort to using a RAM disk as it can cause other issues. There is no reason anything should be writing that much on a mostly idle router.

keyser

@adriftatlas The usual HUGE suspects are pfSense add-on packages (in order of typical write activity:)

1: Suricata
2: Snort
3: pfBlockerNG (especially with reply logging turned on)
4: NtopNG

All of them has measures to manually configure how much (or little) logging they actually do.

SteveITS

@adriftatlas Hmm yeah that’s 149x more. Not that I have a comparison handy.

Re:suspects, also any package mentioning SSD here:
https://www.netgate.com/supported-pfsense-plus-packages