ZFS and the upcomming 2.6

johnpoz

@mcury said in ZFS and the upcomming 2.6:

This will be my opportunity to clean my aliases and rules

Never a bad idea to take the down time as time to clean up..

I wanted to test the recovery process of the config, which worked pretty freaking good if you ask me.. Other than some oddness with a couple of packages, which was no biggy and could of expected that if done my due diligence and looked over any recent redmines reported ;)

If you take that out of the picture went as smooth as you would expect.. Literally a reboot and back up and running.

edit: If I was going to nitpick - the longer than normal time to get a link to the media for install. In the past its been as fast as like 4 minutes or something. But hey make such a request a few minutes after the release announced - I am more than happy with the couple of hours it took.. I would expect they got bombed..

bingo600

@johnpoz
Are you the first to take a ZFS Snapshot on 2.6.0 , and test "Rollback"

https://forum.netgate.com/topic/165335/fun-with-zfs-snapshots-and-rollback?_=1644864380266

mcury

@johnpoz Pendrive is ready.. :)
Now its going to be a painful wait..

$ sudo dd if=/dev/zero of=/dev/sdc bs=1M count=1
1+0 registros de entrada
1+0 registros de saída
1048576 bytes (1,0 MB, 1,0 MiB) copiados, 0,026902 s, 39,0 MB/s
$ sudo dd if=pfSense-plus-Netgate-3100-recovery-22.01-RELEASE-armv7.img of=/dev/sdc bs=4M
332+1 registros de entrada
332+1 registros de saída
1392517632 bytes (1,4 GB, 1,3 GiB) copiados, 56,4994 s, 24,6 MB/s

@johnpoz said in ZFS and the upcomming 2.6:

edit: If I was going to nitpick - the longer than normal time to get a link to the media for install. In the past its been as fast as like 4 minutes or something. But hey make such a request a few minutes after the release announced - I am more than happy with the couple of hours it took.. I would expect they got bombed..

Yes, previous releases were maximum of 30 minutes to get a response.. But this is a good sign, it means that they are getting more Plus customers, which is always good.
Also, around 4 hours for a response from the TAC is perfectly fine

occamsrazor

Upgraded from 2.5.2 to 2.6. Was already running ZFS with the "old" layout and did not do a fresh install to get the new layout. Everything working fine - I just mention in case anyone plans to do the same. I'd like to update ZFS at some point, but that will wait for another day.....

mcury

Installed from scratch, SG-3100 running perfectly with this release.
Packages tested so far, NUT, pfblockerng, wireguard, acme.

mcury

@johnpoz This version is running faster and smoothly than 21.05_2... Not sure if its the placebo effect or the fact that they fixed the bug #11466

johnpoz

@mcury yeah not sure - could be placebo.. But does seem a bit snappier on the interface..

mcury

@johnpoz said in ZFS and the upcomming 2.6:

@mcury yeah not sure - could be placebo.. But does seem a bit snappier on the interface..

The internet seems faster, I have asked a few users here to confirm what they think, and they all told me that things are opening faster.. So, this placebo virus got us all I guess hehe

johnpoz

@mcury well I have never had any issues with pegging my download/upload from isp.. I just tested and all still there like normal.

Never really had any issues with dns, etc. The overall internet seems same, but the pfsense webgui could be maybe a bit snappier.. Not anything significant where I could quantify it..

As long as they are not complaining it slower ;) We would normally never tell users when we were making anything that should make anything faster - because as soon as you mention anything about changing anything.. Something is slower or not working because of it ;)

I could say I changed the black toner in the printer, and they would say that broke the freaking internet ;)

mcury

@johnpoz said in ZFS and the upcomming 2.6:

As long as they are not complaining it slower ;) We would normally never tell users when we were making anything that should make anything faster - because as soon as you mention anything about changing anything.. Something is slower or not working because of it ;)
I could say I changed the black toner in the printer, and they would say that broke the freaking internet ;)

ahhaha, users are like mothers, we all have, and they are pretty similiar, the only thing that changes is the address.. :)

karlfife

@jimp
The new ZFS filesystem defaults make me think flash drive endurance can be significantly improved by (simply) changing the ZFS maximum recordsize property on the /var/log dataset (or any dataset with frequent, short, appended writes).

In its current default setup on pfsense, ZFS needlessly amplifies flash wear, which may be significant to installs using flash media with SATADOM-like write endurance. If my understanding of ZFS is complete, it goes like this:

For a new logfile, (e.g. the first line of a new logfile written a file) ZFS lays down a small ZFS record, (far below the default max 'recordsize' of 128K). The small ZFS record fits into a single 4K disk block, so it results in a single Program/Erase cycle on flash.

When the next line is appended to the logfile, The Copy-On-Write nature of ZFS results in creation of an entirely new ZFS record written to disk (containing the original line and the appended line, and a new checksum for the ZFS record), followed by an atomic pointer switch to reference the new record. The new record is larger but fits into a single 4K disk block. So far, each appended line results in a single P/E cycle (as one might expect with any filesystem on flash).

As the logfile grows, eventually the growing ZFS record becomes too large to fit into a single 4K disk block, thus each appended line results in a new ZFS record being written that spans 2 disk blocks. Thus, with each new line appended to the logfile there are two P/E cycles (instead of just one with a 'normal' filesystem). You may see where I'm going with this.

When the ZFS record has grown to > 124K (i.e. just shy of the default 128K max), each appended line results in a new ZFS record spanning 32 4K blocks each time a measly ~100 bytes are appended the logfile. This means 32 p/e cycles instead of one. This continues until the 128K maxrecord is reached, and ZFS adds a NEW ZFS record to the file, and begins filling the next ZFS record, one line at a time.

Obviously, write/wear amplification will be (on average) half of the worst-case scenario (16x vs 32x), and ZFS records are 'maxrecord' in size BEFORE they are compressed, so compression improves this scenario by the compression ratio (which for text logfiles and LZ4, would be about 3.5 to 1 best case). Still that pencils out to be more than 4x increase in drive wear than would exist by (simply) setting maxrecord to roughly 12K from 128K. At 12K, assuming 3:1 compression, a ZFS record would always fit into one 4K disk block, thus result in a single P/E cycle per log event.

We have several systems that have been running pfSense ZFS-on-root since it was first available on pfSense (using the defaults for the ZFS install option). Most of those systems are on track to write-exhaust their SATADOMs (16 GB Innodisk wear-leveling) in 4-8 years (depending on system's log churn). I will test setting setting maxrecord to 12K on the /var/log dataset and confirm SMART report an improvement in the rate of drive wear in line with what I expect based on the description here.

The described write-amplification behavior bears itself out in testing (of bserved wear), but if anyone is able able to refine my characterization (or correct me where it's wrong) that would be appreciated.

I know some write endurance inefficiencies can be made irrelevant by way of write-cache mechanisms on performance-oriented SSDs, but for typical SATADOMs in widespread use, I think it is a worthy discussion.

Perhaps a well-chosen change in default maxrecord for certain datasets would prevent a nasty surprise for someone who hasn't accounted for a > 4x increase in flash wear under the new defaults.

keyser

@karlfife Interesting analysis, but I don’t think it is intirely correct. You need to take the erase block size of the SSD into account.

Yes, you can write to unused/erased SSD blocks i 4K pages, but the SSD can only erase/overwrite previously written pages in larger blocks (typically from 128K -> 512K blocks)
Read about the inner workings of a SSD here: https://www.anandtech.com/show/2738/5

But still you make a good point if the ZFS append/rewrite description you are making is correct.

Btw: I believe most SATADOMS and eMMCs use SLC caching of page writes to allow high performance fills / erases of blocks of 4K pages. So a lot of wear might be “cancelled” by the fact that ZFS immidiately releases a 4K block once it’s written again with a new appended line. So the block might not have made it to the MLC/TLC nand pages.
But regardless - your point remains. A lot of write amplification might be avoided by setting a lower ZFS page size.

CLEsports

@jimp I have a question on the new /pfSense/reservation volume. Is this configured so that if all disks run out of space, there is more to allocate the ZFS dataset? If so, wouldn't it be better to put a quota on /var/log since that's generally where runaway space gets consumed?

karlfife

@keyser What was your thought about the impact to ZFS of needing to erase larger groups of 4K disk blocks at a time (on flash modules, up to 128 of them in larger 512KB 'block' in the referenced article)?

Top-of-mind: in the scenario of a very full disk, I can imagine ANY filesystem would be in a bad place if each write required first burning down 512KB, changing a few bytes, and writing it all back out to disk.

Are you thinking this has special performance considerations for ZFS?

keyser

@karlfife My point was it’s actually a lot worse (theoretically) than your suggested penalty, but that very fact, may also mean that the wear the current page size brings, might not be much smaller with smaller page sizes - because the block erase penaly could still be triggered on almost the same scale.