NanoBSD Image Block Alignment [MisAlignment]
-
Can assure you that no allignment will fix my CFs that take over a minute to /etc/rc.conf_mount_ro; nor will allignment fix the issue for the guy linked with config.xml corruption above where it takes 3+ minutes for him. There's a serious kernel bug somewhere:
And you know… because you've aligned your pfSense partitions?
No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Sigh. Reminds me of the debates with the 4K drive guys who were screaming in a similar manner. No, writing to misaligned partitions is NOT 110x slower.
-
Reading back through that, it occurs to me that using an I/O scheduler that behaves like Linux's Deadline would help a ton, once things are aligned that is. Something FIFO based but will combine writes to the same filesystem blocks.
This would save a ton of code re-writing in order to write things like small config changes to flash disks optimally, which we wouldn't want to do anyways, because it would create the very issues we are trying to solve here, on non flash storage based installs.
-
No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Right, because package installs are the best case 0.04MB/sec we talked about. Remount read only does indeed write data to disk. Go read the kernel patch that was pulled. And try to keep up.
Sigh. Reminds me of the debates with the 4K drive guys who were screaming in a similar manner. No, writing to misaligned partitions is NOT 110x slower.
Sigh indeed. Turns out, those screaming 4K drive guys were right. Do your homework. And flash devices, because they cannot simply overwrite a sector, and don't have 128MB of controller cache to work with, are many times more effected by this.
I know these things because I've not only done the research, but I've done the real world implementation testing to go with it.
Here, I'll throw you a bone. Pop quiz:
Q: Why does OpenBSD align to sector 64, Windows newer than XP, along with mdadm, LVM2, Linux newer than forever ago, ALL align to 1MB now?
A: Because alignment.
-
No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Right, because package installs are the best case 0.04MB/sec we talked about.
And somehow that does not happen. Also, the subsequent RO remount does not take weeks. Miracle, I guess. You have a gaping hole somewhere in your write amplification theory.
Look, I'm not against proper alignment. I'm merely tired of this "oh noes, we are misaligned, the sky will fall and universe will collapse into a blackhole" bullshit. Seriously tired. Bye.
-
@cmb:
nano never had any of the corruption issues because it's always run with sync.
Fixing the ro->rw mount slowness is definitely a priority for 2.2.4. It wasn't as pronounced on the hardware we have as it is for a number of others. We'll review the alignment as well. There's a bug ticket open. https://redmine.pfsense.org/issues/4814
It would appear, that people who understand the issue, simply don't agree with you.
No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Right, because package installs are the best case 0.04MB/sec we talked about.
And somehow that does not happen. Also, the subsequent RO remount does not take weeks. Miracle, I guess. You have a gaping hole somewhere in your write amplification theory.
Look, I'm not against proper alignment. I'm merely tired of this "oh noes, we are misaligned, the sky will fall and universe will collapse into a blackhole" bullshit. Seriously tired. Bye.
Then you should learn how to do math, so people listen to you.
0.04MB/sec does not take weeks for a, let's say, 5MB package. It would, however, using this thing called math, take roughly 2 minutes. Let's double that for the sake of argument for the download + the install = 4 minutes.
Where in the because dumb do you get weeks, especially for a simple remount read only?
We simply get a magnitude of time, between pfSense 2.1.x and 2.2.x, in the order of roughly 110x on descent CF cards like mine, same or more for most, less for some lucky / smart flash buyers, that it takes to do a simple remount read only. We have always seen this slowdown with non remounting flash based writes! Always. We just didn't see the remount read only delay, because it was kernel patched.
No wonder your tired, get some sleep kiddo.
-
No shit. So writing 200KB takes the same as writing 5MB now? Yeah, get some sleep with your theories and fix your math. Bye.
-
No shit. So writing 200KB takes the same as writing 5MB now? Yeah, get some sleep with your theories and fix your math. Bye.
Not a single word of that makes any sense at all.
Either show your math, cite your sources, or stop trolling.
-
My math is pretty simple. With your 0.04MB/sec "math" - or, as you said, "BEST, full consecutive filesystem block write scenario", I could write max. 2,4MB / minute. Shockingly, those 150+ megs worth of packages takes some 15 minutes to reinstall on the shitty Alix box, out of which large part is spent with downloading the stuff and configuration.
Your "math" also totally fails to explain why "/etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Sigh. How about finding the real bug?
-
Quoting Jim Thompson with his permission from internal discussion:
All forms of flash suffer from write amplification. Since there are different forms of flash, some have a larger issue write application than others.
This is also true for different workloads.So that much is true, thus, there is “some” validity to the claims made in that post.
What is not valid is his recommended path toward a fix.
Quoting the relevant part of the OpenBSD page he cites:
(fdisk platforms) Leave first track free: On platforms using fdisk(8), you should leave the first logical track unused, both in disklabel(8) and in fdisk(8). On "modern" computers (i.e., almost everything that will run OpenBSD), the exact amount doesn't really matter, though for performance reasons on the newest disks, having partitions aligned at 4k boundaries is good for performance. For this reason, OpenBSD now defaults to starting the first partition at block 64 instead of 63.
So first let’s define some terms.
Ordinary spinning rust hard disks are made of platters that have tracks. Tracks are the thin concentric circular strips of sectors.
Cylinders are a collection of tracks, stacked vertically.In CHS addressing the sector numbers always start at 1, there is no sector 0.
The Unix communities employ the term block to refer to a sector or group of sectors.
The CHS addressing supported in IBM-PC compatible BIOSes code used eight bits for, theoretically, up to 256 heads counted as head 0 up to 255. However, a bug in all versions of DOS up to and including 7.10 will cause these operating systems to crash on boot when encountering volumes with 256 heads. Therefore, all compatible BIOSes will use mappings with up to 255 heads only, including in virtual 255×63 geometries.
So, CHS addressing starts at 0/0/1 with a maximal value 1023/255/63
The fdisk utility normally displays partition table information using 1024-byte blocks, but also uses the word sector to help describe a disk's size in the phrase, 63 sectors per track.
In other words, the LBA sector number 63 corresponds to cylinder 0, head 1, sector 1 in the CHS format, which is the first sector you can use in the MBR format. The source of the confusion is that 63 is not (evenly) divisible by 8.
Now is when we mention that flash-based devices don’t have platters, or heads, or tracks. They are, quite simply, a group of flash sectors, the whole C/H/S thing flies out the window, and we can address any sector as easily as another. Moreover, many flash sectors these days are 2K in size.
Now, remember that CompactFlash doesn’t have very good algorithms for dealing with modern systems. This is one of the reasons we have moved away from it. A CF device contains an ATA controller
and appears to the host device as if it were a hard disk. It is literally plugged into an IDE channel. CF wear leveling algorithms are proprietary and undocumented, they are "secret sauce”. Some will be better
than others.It is often believed that disc partitions have to be aligned to cylinder or track boundaries. This is not in fact true and never really has been. There are alignment considerations for disc partitions, but they have nothing to do with cylinders, and they aren't mandatory. Operating systems will still work with misaligned partitions, just more slowly for some (not all) disc unit models.
The idea that disc partitions have to aligned to cylinder boundaries is nonsense on its face. Millions of people have had discs where the first primary partition began on track zero, sector one, head one with no ill effect whatsoever on operating systems from MS-DOS through Windows NT to OS/2. That was, after all, the default that fdisk/Disk Manager on those operating systems used for almost two decades. At best, the purported alignment requirement would have been a track alignment, with all partitions starting at sector one (Sectors are numbered from one, remember.) of any given track.
But this is not true, either. No version of any operating system has actually required this. Even MS-DOS was quite happy to have disc partitions starting at sectors other than 1. The only things that have required this have been disc partitioning utilities. There's been a bit of circular logic about this. The disc partitioning utilities enforced the requirement because their authors thought that it was a requirement, but people only thought that it was a requirement because fdisk and the like enforced it. It was what the partitioning utility programs enforced — so the logic went — so it must have been a restriction. In fact it never was, and no operating system itself has any trouble with this.
The idea of track alignment is daft anyway. It's pointless because it doesn't align things to any valid boundary on the disc unit itself. There's no performance or other benefit, because the physical layout of the partitions on the disc will not be aligned to the actual physical tracks on the disc by aligning them to the software-visible track size.
• The "tracks" that system softwares see at the ATA command register level aren't actually the real tracks on the disc itself, and haven't been since the advent of zoned bit recording (ZBR) in the early 1990s. Tracks are not, in fact, equally sized across the whole disc with ZBR; even though that's how discs are presented to software via the (old) cylinder+head+sector I/O command interface for ATA disc units.
• Unlike ATA, the SCSI command level has always operated in terms of logical block numbers, and not in terms of a cylinder+head+sector system in the first place. In the SCSI world, from the start the idea that system software necessarily even knew where the physical track boundaries were was incorrect. Indeed, PC firmwares for SCSI hard discs have to invent disc geometries, largely from thin air, for the benefits of old PC/AT and PC98 firmwares and operating systems that expect discs to be addressed, at the disc unit I/O command level, in terms of a three-dimensional CHS geometry. Alignment to a geometry that's just made up anyway by the machine firmware is just pointless.
With much fanfare, Microsoft finally eradicated enforcement of this entirely useless and pointless notion from the Windows NT Disk Manager, in 2008 (i.e. with the releases of Windows Vista Service Pack 1 and Windows Server 2008). Indeed, for years before that, since 2003, it had been recommending to Exchange Server and Microsoft SQL Server administrators that they use diskpart to align disc partitions to 4KiB multiples for performance reasons. (Some of the performance reasons given in early years were spurious, since they were based upon the erroneous premise that software-visible track boundaries were also physical track boundaries. But the end result, in light of later hardware developments, was right despite that.)
Neither FreeBSD nor Linux has caught up. The fdisk utility in FreeBSD and Linux still complains about partitions not aligned to track boundaries.
There is a disc partition alignment rule that does reflect the actual hardware. It is the rule that partitions be aligned to 4KiB boundaries. However, this rule only makes sense for some hard disc models. In some hard disc models, the internal sector size has been increased from 512B to 4KiB. At the I/O command level, as system softwares access the disc, the sector size is still 512B. Such discs are known as "512 byte emulation" discs. There are also "4KiB native" discs, where the sector size at the I/O command level is also 4KiB. But it was a while before any but a few operating systems could cope with sector sizes other than 512 bytes at the ATA/SCSI I/O command level, so we got 512e disks for a while.
What happens on such "512e" discs is that whenever the operating system or the firmware reads a 512B sector, the disc unit itself is actually reading a whole 4KiB and handing the firmware/operating system the appropriate one-eighth; and whenever the firmware/operating system writes a 512B sector, the disc unit is actually reading a whole 4KiB sector, modifying one eighth, and writing the whole 4KiB back again.
This may seem like a performance killer, as every I/O operation is, under the covers, eight times its apparent size. Fortunately, there's a way to hide the performance cost. This takes advantage of the fact that many operating systems like to do most of their I/O in 4KiB multiples anyway. All paging I/O on x86 operating systems is done in 4KiB multiples, for example, and many operating systems, including Windows, FreeBSD, Linux and Solaris, use the paging mechanism for ordinary file I/O. So the operating system will usually be reading and writing (a multiple of) eight 0.5KiB sectors in a single I/O operation.
So it's simply necessary to ensure that those eight 512B sectors are contiguous and aligned to an actual 4KiB sector on the disc. The "natural" I/O boundaries used by the operating system must align with the internal, hidden, 4KiB boundaries of the physical disc. The eight 512B sectors in the I/O command must not span two or more 4KiB physical sectors; but must be exactly one 4KiB sector, and in the right order within that sector.
The way that this 4KiB alignment is achieved is threefold:
1) Partitions are aligned to 4KiB boundaries relative to the start of the entire disc. The start, and end, of every partition is an integral number of 4KiB sectors from the start of the entire disc.
2) On-disc data structures within a volume are aligned to 4KiB boundaries relative to the start of their containing partitions. If a disc volume format employs concepts such as "zones", "cylinder groups", and so forth, as volume formats with BSD Unix influences such as UFS and EXT2/3/4 do, they must be integer multiples of 4KiB in size. While neither FAT nor NTFS have such concepts. But FAT volumes similarly have to ensure that the total size of the FATs and reserved sectors at the beginning of a volume is an integer multiple of 4KiB, so that the data clusters following them are aligned to 4KiB multiples.
3) The volume's space allocation unit ("cluster") size is an integer multiple of 4KiB. You'll find that Windows NT's tools nowadays discourage cluster sizes for FAT and NTFS volumes that are less than 4KiB. (Since cluster sizes are powers of two, larger clusters sizes are always going to be multiples of 4KiB.) UFS has had a basic allocation unit of 4KiB for several decades.
TLDR version: no such alignment problem exists.
-
My math is pretty simple. With your 0.04MB/sec "math" - or, as you said, "BEST, full consecutive filesystem block write scenario", I could write max. 2,4MB / minute. Shockingly, those 150+ megs worth of packages takes some 15 minutes to reinstall on the shitty Alix box, out of which large part is spent with downloading the stuff and configuration.
Your "math" also totally fails to explain why "/etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Sigh. How about finding the real bug?
Some of that would almost make sense if you had the exact same CF card I do, but since you don't, and since I don't run my pfSense install on an ALIX board, nothing about your setup is comparable. You haven't once posted the write speeds you see on your setup, and nobody smart will pay you any attention until you do.
-
@cmb:
TLDR version: no such alignment problem exists.
For those two decades, all disks used 512 byte blocks, and thus, it was impossible to misalign them, as any number of sectors * 512 will indeed divide evenly back into 512. Even mentioning 512 byte block magnetic drives in this discussion makes no sense what so ever, as it has nothing to do with it, and shows how little understanding one has of this whole topic.
Disk sectors actually start at 0, again, hardware common knowledge. Sector 0 is reserved for specific data, which is why you can't start a partition there.
No OS's required track alignment because drives generally hid this physical information, as they still do, and as flash drives emulate it, without any regard for the actual physical layer.
The middle lost me, it again has nothing to do with this conversation. The thread is not titled "history and behavior of old ass hardware & software" for a reason.
If you read what I wrote, we aren't trying to align to disk geometry, we are aligning to 4K or 4MB from sector 1, as 0 is not counted in this alignment, being reserved. Again, why are we posting the history of disk geometry?
Old Linux fdisk does not complain about track boundaries if you run it with the correct parameters, and newer (like, last 5+ years) Linux fdisk doesn't complain at all. And it does indeed default to 1MB, or more commonly known as sector 2048. Common knowledge.
Honestly, who on the pfSense dev team doesn't know about 512 byte, 512e, and 4K magnetic discs. And why are we still posting about magnetic disks? Oighhh.
Ok, sense is finally being made, in the last paragraph. But we are still on magnetic disks, and haven't moved on to flash disks, which is the main point of this thread.
TLDR version: such alignment problem exists, as partitions are not aligned to 4K, thus, the filesystem blocks of 4K each from the start of the partition, are all equally misaligned. Does anybody read the thread before they post?
Using the most compatible fake disk geometry, the first partition should start at a minimum of sector 64. NanoBSD's start at fake sector 63. Sector 0 is not counted here, as it is reserved anyways, flash makers know this, and account accordingly. Or, if you want to trust Jim without fact checking, there is no sector 0, either way, I don't care, the math is the same:
63 * 512 = 32256 / 4096 = 7.875. Or, not evenly divisible by 4K.
64 * 512 = 32768 / 4096 = 8. Or, evenly divisible by 4K.
Remember, Jim says this is extremely important for alignment. It is step 1 after all.
Thank you and good night.
If you need references, I will be happy to post a set of links for fact checking every single word I've typed.
-
Seriously, this is ALL info I've covered in the OP of this thread. I can only say the same exact thing, so many different ways. Please read it. If you don't understand, fine, ask questions, or fact check.
If you don't want this fixed, like, ever. Or you just love the idea of manually rebuilding full NanoBSD images yourself to fix the alignment issue, just make posts about how wrong I am, without any legitimate reason as to why, and totally fuck this thread to the point that reasonable people don't even want to read through it.
-
Some of what I snipped out in the interest of (some) brevity probably left that making less sense, I added part of it back in there in the previous post.
-
What your quote ends with, is exactly what I've been saying, and what I posted in response to it still applies exactly the same.
-
Ok, my bad, one added thing doesn't check out, which is the 2K flash erase block size said to be "common". So common in fact I've never ever seen it that small, unless we are getting into SSD's.
Cheap flash, that we should be focused on, has an erase block size of 4K or larger. Some USB flash drives have been reported to use minimum erase blocks as large as 1MB, also all noted in my first post. Yeah, generally people just throw these away because the performance is so horrible, but it's good to let people know they might be dealing with such a device.
All of this can be found using a simple dd raw write test script, I can post my version of it if anyone is interested. SD cards on a native (non USB) interface report this value to Linux where it can easily be read, not sure about FreeBSD. A native SD interface would be one such as found on most Android devices, which handily is already running Linux.
-
Honestly, who on the pfSense dev team doesn't know about 512 byte, 512e, and 4K magnetic discs.
No one, but if you get Jim on his soap box… :)
If you need references, I will be happy to post a set of links for fact checking every single word I've typed.
Please do. Better yet, post real world results of "wrong" vs. "right" as relevant to our embedded images.
And don't get all worked up with me because doktornotor is a dick.
-
@cmb:
Please do. Better yet, post real world results of "wrong" vs. "right" as relevant to our embedded images.
The results I have posted are real world, my references will be as well. There are some on the forum I've linked to already, see my second post, and comments under it. Anything that is misaligned, more specifically to sector 63, I would think is relevant, would you agree?
Someone would have to manually rebuild one of the NanoBSD images to test proper alignment on an actual NanoBSD install. I don't have a good FreeBSD VM to work with right now, or time unfortunately. Someone else could do this much easier / faster.
The fact that OpenBSD has made this change already, noting performance, is pretty solid as far as I'm concerned.
@cmb:
And don't get all worked up with me because doktornotor is a dick.
Fair enough, lol.
References in next post…
-
References
Relevant Alignment Info for FAT32 SD. All applies except for the bits about FAT, obviously. Pay attention to the MBR / partition layout using sectors to calculate alignment.
http://3gfp.com/wp/2014/07/formatting-sd-cards-for-speed-and-lifetime/Block Device Attributes I spoke about.
https://www.kernel.org/doc/Documentation/mmc/mmc-dev-attrs.txtMeasuring Flash Block Size, wrote my test script based on this. I just dd write from /dev/zero to smaller block sizes in descending order starting from 8MB, it works great. The first size that bombs out as being way slower / longer, is one below your smallest erase block. This is for ANY flash based storage device.
http://kim.oyhus.no/FlashBlockSize.htmlOp, I was off, here we have flash devices using erase block sizes of 8MB, I said much smaller, so the issue can be worse than I laid out.
https://www.raspberrypi.org/forums/viewtopic.php?t=11258&p=123670Speed of USB Flash Devices. So wish they recorded more filesystem details, but still a good reference. You can easily see how the exact same device reads / writes far slower in different circumstances, even with the same format. And which devices to look for when buying.
http://usbflashspeed.com/Edit: FreeBSD Specific References Added
Good Alignment Info for UFS / ZFS on FreeBSD
http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.htmlDiscussion about forcing alignment with fdisk, and why it's not updated I suppose (gpart).
https://forums.freebsd.org/threads/gpart-trying-to-force-mbr-partitions-to-be-cylinder-aligned.36439/Awesome example on FreeBSD using an actual drive that reports 4K sectors. This specific example I had not seen before. Remember, our flash drives still report 512 byte sectors, the NanoBSD images will end up aligned to KB or MB boundaries, not specific sectors. And the existing UFS filesystem is already using a 4K fragment size, so the partition is all that's left to fix.
https://forums.freebsd.org/threads/ufs-sector-and-alignment-explanation.42208/ -
Dumb dick question: When you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere. How does that fit your write amplification theories? For some mysterious reason still unanswered. ::)
Regards,
Mr. Dumb Dick
-
Dumb dick question: When you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere. How does that fit your write amplification theories? For some mysterious reason still unanswered. ::)
Regards,
Mr. Dumb Dick
A. Because you keep referring to documented referenced facts as theories, like an ignorant dick.
B. Because no one likes teaching ignorant dicks how to use Google.