NanoBSD Image Block Alignment [MisAlignment]
-
This was verified very recently by the pfSense devs themselves in recent commits to 2.2.3 and an issue with corrupt user account files, see here for details: https://redmine.pfsense.org/issues/4523
And then you get this with the miraculous sync solution: https://redmine.pfsense.org/issues/4803 - go figure.
Not to mention, nothing like corrupt user account files was observed before. Really. No massive FS corruption reports anywhere. Until 2.2.x recently.
The short version: async breaks stuff, sync doesn't. So lets fix the sync mode issues with proper alignment, and get the best of both worlds.
Can assure you that no allignment will fix my CFs that take over a minute to /etc/rc.conf_mount_ro; nor will allignment fix the issue for the guy linked with config.xml corruption above where it takes 3+ minutes for him. There's a serious kernel bug somewhere:
$ ls -la /cf/conf/config.xml -rw-r--r-- 1 root wheel 205753 Jul 1 09:30 /cf/conf/config.xml
It does not take minutes to write 0,2 MB to CF on a sane (file)system, with even the worst misalignment you could ever imagine. Even at the 0.04MB/sec figure you pulled out of somewhere it would only take 5 secs. Not over a minute. Not 3 minutes.
-
I will add a +1 for getting back more sane times for the RW->RO transition on nanoBSD.
On 2.2.3 I now go to Diags->nanoBSD and set the thing to RW, then do a bunch of changes, then set it back to RO. That way I get just a single transition back to RO. Of course I often forget, press Save on something, then waiting… and realize "oh, I should have switched to RW! I need to make more than 1 change.".
And if I forget to switch back to RO at the end then the system is left with pending async writes for goodness knows how long. -
nano never had any of the corruption issues because it's always run with sync.
Fixing the ro->rw mount slowness is definitely a priority for 2.2.4. It wasn't as pronounced on the hardware we have as it is for a number of others. We'll review the alignment as well. There's a bug ticket open. https://redmine.pfsense.org/issues/4814
-
And then you get this with the miraculous sync solution: https://redmine.pfsense.org/issues/4803 - go figure.
Yup, that's completely expected, since the flash storage device is busy writing data the entire time we are waiting for the remount read only to complete.
Not to mention, nothing like corrupt user account files was observed before. Really. No massive FS corruption reports anywhere. Until 2.2.x recently.
Yup, kernel change (new disk modules) + old pulled kernel patch = current behavior.
Can assure you that no allignment will fix my CFs that take over a minute to /etc/rc.conf_mount_ro; nor will allignment fix the issue for the guy linked with config.xml corruption above where it takes 3+ minutes for him. There's a serious kernel bug somewhere:
And you know… because you've aligned your pfSense partitions?
$ ls -la /cf/conf/config.xml -rw-r--r-- 1 root wheel 205753 Jul 1 09:30 /cf/conf/config.xml
It does not take minutes to write 0,2 MB to CF on a sane (file)system, with even the worst misalignment you could ever imagine. Even at the 0.04MB/sec figure you pulled out of somewhere it would only take 5 secs. Not over a minute. Not 3 minutes.
Actually, it absolutely can. 0.04MB/sec is a BEST, full consecutive filesystem block write scenario. And you're completely forgetting the remount read only that comes after. Let's take this value, for examples sake. On a CF device that gets 4.4MB/sec writes all day long, we are looking at a decrease in speed of 110x. Lets take this, apply it to the guy who's config writes are taking 3+ minutes, and we get a time of, 1.6 seconds. Still think it won't help him? Ok, how about you, your remount time turns into 0.5 seconds. Isn't math fun?
Now, let's consider that this best case scenario write speed, is not even close to the use case. The use case is typically far slower, taking even longer. When writing config.xml changes (actually not all that bad, if kept mounted read write), remounting the filesystem read only, or really anything but a dd test write, things break down considerably…
Let's assume the most common erase block size of 4K, and the most common controller cache size of 4K, to read-erase-write-verifyread each and every 4K flash block, 1 block at a time. This needs to be done where even 1 single byte of data changes in that 4K flash block. NanoBSD uses a FIFO disk buffer, so writes to the same blocks are not combined in any way.
So, best case scenario, we want to rewrite the entire config.xml. But we don't, do we? We want to change a few bytes. So the small few byte write gets passed to the flash device, which has to (after all the OS level stuff) read-erase-write-verifyread an ENTIRE 4K block, best case. Worst case, the few bytes we changed, are not written consecutively, broken up into XML sections, some even crossing an erase block boundary, because the filesystem is not aligned with the storage device, and now, our small little few byte writes, turn into multiple read-erase-write-verifyread's for no less than 4K-8K of data, each, that nothing but one of the cheapest flash controllers there is has to deal with, with only 4K of memory to work with. Welcome to write amplification. Here, we deal with it in many orders of magnitude greater than our best case scenario of a 110x slowdown.
Now, toss in a remount read only on top of all this. Sweet. Remounting read only not only has to wait for all of that awesomeness we talked about in the last paragraph to happen, but it comes with it's own slew of writes to things like metadata, inodes, and other random "few bytes here, one byte there" type of writes, that are again, FIFO'd out, so that every single one of them, causes a 4K read-erase-write-verifyread, or 8K if also falling on a flash boundary, and we have our delay. A delay so bad mind you, that they patched the kernel to get rid of it up to this point, with a patch that deliberately discards data.
The old kernel patch that was removed, dumped all that data that a remount read only writes, so the "issue" was gone as far as most were concerned. This is also why this patch causes filesystem instability, and why they will not reapply it for any reason. NanoBSD pfSense was able to get away with doing this without corruption, because the discarded data that would break a "normal" FreeBSD system, has to do with the data that is stored on RAM disks for NanoBSD, RAM disks that are never remounted. Specifically not the data stored on the root slice, especially because NanoBSD doesn't expect to be able to write to the root slice 99% of the time, and doesn't.
And finally, even if this doesn't "solve" the remount read only issue, or filesystem corruption, or even make your teeth whiter, it's just the right way to do things. Flash writes WILL be faster, flash device wear WILL go down, we can enjoy all of the things it DOES fix, and all the future bugs it will prevent.
-
Can assure you that no allignment will fix my CFs that take over a minute to /etc/rc.conf_mount_ro; nor will allignment fix the issue for the guy linked with config.xml corruption above where it takes 3+ minutes for him. There's a serious kernel bug somewhere:
And you know… because you've aligned your pfSense partitions?
No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Sigh. Reminds me of the debates with the 4K drive guys who were screaming in a similar manner. No, writing to misaligned partitions is NOT 110x slower.
-
Reading back through that, it occurs to me that using an I/O scheduler that behaves like Linux's Deadline would help a ton, once things are aligned that is. Something FIFO based but will combine writes to the same filesystem blocks.
This would save a ton of code re-writing in order to write things like small config changes to flash disks optimally, which we wouldn't want to do anyways, because it would create the very issues we are trying to solve here, on non flash storage based installs.
-
No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Right, because package installs are the best case 0.04MB/sec we talked about. Remount read only does indeed write data to disk. Go read the kernel patch that was pulled. And try to keep up.
Sigh. Reminds me of the debates with the 4K drive guys who were screaming in a similar manner. No, writing to misaligned partitions is NOT 110x slower.
Sigh indeed. Turns out, those screaming 4K drive guys were right. Do your homework. And flash devices, because they cannot simply overwrite a sector, and don't have 128MB of controller cache to work with, are many times more effected by this.
I know these things because I've not only done the research, but I've done the real world implementation testing to go with it.
Here, I'll throw you a bone. Pop quiz:
Q: Why does OpenBSD align to sector 64, Windows newer than XP, along with mdadm, LVM2, Linux newer than forever ago, ALL align to 1MB now?
A: Because alignment.
-
No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Right, because package installs are the best case 0.04MB/sec we talked about.
And somehow that does not happen. Also, the subsequent RO remount does not take weeks. Miracle, I guess. You have a gaping hole somewhere in your write amplification theory.
Look, I'm not against proper alignment. I'm merely tired of this "oh noes, we are misaligned, the sky will fall and universe will collapse into a blackhole" bullshit. Seriously tired. Bye.
-
@cmb:
nano never had any of the corruption issues because it's always run with sync.
Fixing the ro->rw mount slowness is definitely a priority for 2.2.4. It wasn't as pronounced on the hardware we have as it is for a number of others. We'll review the alignment as well. There's a bug ticket open. https://redmine.pfsense.org/issues/4814
It would appear, that people who understand the issue, simply don't agree with you.
No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Right, because package installs are the best case 0.04MB/sec we talked about.
And somehow that does not happen. Also, the subsequent RO remount does not take weeks. Miracle, I guess. You have a gaping hole somewhere in your write amplification theory.
Look, I'm not against proper alignment. I'm merely tired of this "oh noes, we are misaligned, the sky will fall and universe will collapse into a blackhole" bullshit. Seriously tired. Bye.
Then you should learn how to do math, so people listen to you.
0.04MB/sec does not take weeks for a, let's say, 5MB package. It would, however, using this thing called math, take roughly 2 minutes. Let's double that for the sake of argument for the download + the install = 4 minutes.
Where in the because dumb do you get weeks, especially for a simple remount read only?
We simply get a magnitude of time, between pfSense 2.1.x and 2.2.x, in the order of roughly 110x on descent CF cards like mine, same or more for most, less for some lucky / smart flash buyers, that it takes to do a simple remount read only. We have always seen this slowdown with non remounting flash based writes! Always. We just didn't see the remount read only delay, because it was kernel patched.
No wonder your tired, get some sleep kiddo.
-
No shit. So writing 200KB takes the same as writing 5MB now? Yeah, get some sleep with your theories and fix your math. Bye.
-
No shit. So writing 200KB takes the same as writing 5MB now? Yeah, get some sleep with your theories and fix your math. Bye.
Not a single word of that makes any sense at all.
Either show your math, cite your sources, or stop trolling.
-
My math is pretty simple. With your 0.04MB/sec "math" - or, as you said, "BEST, full consecutive filesystem block write scenario", I could write max. 2,4MB / minute. Shockingly, those 150+ megs worth of packages takes some 15 minutes to reinstall on the shitty Alix box, out of which large part is spent with downloading the stuff and configuration.
Your "math" also totally fails to explain why "/etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Sigh. How about finding the real bug?
-
Quoting Jim Thompson with his permission from internal discussion:
All forms of flash suffer from write amplification. Since there are different forms of flash, some have a larger issue write application than others.
This is also true for different workloads.So that much is true, thus, there is “some” validity to the claims made in that post.
What is not valid is his recommended path toward a fix.
Quoting the relevant part of the OpenBSD page he cites:
(fdisk platforms) Leave first track free: On platforms using fdisk(8), you should leave the first logical track unused, both in disklabel(8) and in fdisk(8). On "modern" computers (i.e., almost everything that will run OpenBSD), the exact amount doesn't really matter, though for performance reasons on the newest disks, having partitions aligned at 4k boundaries is good for performance. For this reason, OpenBSD now defaults to starting the first partition at block 64 instead of 63.
So first let’s define some terms.
Ordinary spinning rust hard disks are made of platters that have tracks. Tracks are the thin concentric circular strips of sectors.
Cylinders are a collection of tracks, stacked vertically.In CHS addressing the sector numbers always start at 1, there is no sector 0.
The Unix communities employ the term block to refer to a sector or group of sectors.
The CHS addressing supported in IBM-PC compatible BIOSes code used eight bits for, theoretically, up to 256 heads counted as head 0 up to 255. However, a bug in all versions of DOS up to and including 7.10 will cause these operating systems to crash on boot when encountering volumes with 256 heads. Therefore, all compatible BIOSes will use mappings with up to 255 heads only, including in virtual 255×63 geometries.
So, CHS addressing starts at 0/0/1 with a maximal value 1023/255/63
The fdisk utility normally displays partition table information using 1024-byte blocks, but also uses the word sector to help describe a disk's size in the phrase, 63 sectors per track.
In other words, the LBA sector number 63 corresponds to cylinder 0, head 1, sector 1 in the CHS format, which is the first sector you can use in the MBR format. The source of the confusion is that 63 is not (evenly) divisible by 8.
Now is when we mention that flash-based devices don’t have platters, or heads, or tracks. They are, quite simply, a group of flash sectors, the whole C/H/S thing flies out the window, and we can address any sector as easily as another. Moreover, many flash sectors these days are 2K in size.
Now, remember that CompactFlash doesn’t have very good algorithms for dealing with modern systems. This is one of the reasons we have moved away from it. A CF device contains an ATA controller
and appears to the host device as if it were a hard disk. It is literally plugged into an IDE channel. CF wear leveling algorithms are proprietary and undocumented, they are "secret sauce”. Some will be better
than others.It is often believed that disc partitions have to be aligned to cylinder or track boundaries. This is not in fact true and never really has been. There are alignment considerations for disc partitions, but they have nothing to do with cylinders, and they aren't mandatory. Operating systems will still work with misaligned partitions, just more slowly for some (not all) disc unit models.
The idea that disc partitions have to aligned to cylinder boundaries is nonsense on its face. Millions of people have had discs where the first primary partition began on track zero, sector one, head one with no ill effect whatsoever on operating systems from MS-DOS through Windows NT to OS/2. That was, after all, the default that fdisk/Disk Manager on those operating systems used for almost two decades. At best, the purported alignment requirement would have been a track alignment, with all partitions starting at sector one (Sectors are numbered from one, remember.) of any given track.
But this is not true, either. No version of any operating system has actually required this. Even MS-DOS was quite happy to have disc partitions starting at sectors other than 1. The only things that have required this have been disc partitioning utilities. There's been a bit of circular logic about this. The disc partitioning utilities enforced the requirement because their authors thought that it was a requirement, but people only thought that it was a requirement because fdisk and the like enforced it. It was what the partitioning utility programs enforced — so the logic went — so it must have been a restriction. In fact it never was, and no operating system itself has any trouble with this.
The idea of track alignment is daft anyway. It's pointless because it doesn't align things to any valid boundary on the disc unit itself. There's no performance or other benefit, because the physical layout of the partitions on the disc will not be aligned to the actual physical tracks on the disc by aligning them to the software-visible track size.
• The "tracks" that system softwares see at the ATA command register level aren't actually the real tracks on the disc itself, and haven't been since the advent of zoned bit recording (ZBR) in the early 1990s. Tracks are not, in fact, equally sized across the whole disc with ZBR; even though that's how discs are presented to software via the (old) cylinder+head+sector I/O command interface for ATA disc units.
• Unlike ATA, the SCSI command level has always operated in terms of logical block numbers, and not in terms of a cylinder+head+sector system in the first place. In the SCSI world, from the start the idea that system software necessarily even knew where the physical track boundaries were was incorrect. Indeed, PC firmwares for SCSI hard discs have to invent disc geometries, largely from thin air, for the benefits of old PC/AT and PC98 firmwares and operating systems that expect discs to be addressed, at the disc unit I/O command level, in terms of a three-dimensional CHS geometry. Alignment to a geometry that's just made up anyway by the machine firmware is just pointless.
With much fanfare, Microsoft finally eradicated enforcement of this entirely useless and pointless notion from the Windows NT Disk Manager, in 2008 (i.e. with the releases of Windows Vista Service Pack 1 and Windows Server 2008). Indeed, for years before that, since 2003, it had been recommending to Exchange Server and Microsoft SQL Server administrators that they use diskpart to align disc partitions to 4KiB multiples for performance reasons. (Some of the performance reasons given in early years were spurious, since they were based upon the erroneous premise that software-visible track boundaries were also physical track boundaries. But the end result, in light of later hardware developments, was right despite that.)
Neither FreeBSD nor Linux has caught up. The fdisk utility in FreeBSD and Linux still complains about partitions not aligned to track boundaries.
There is a disc partition alignment rule that does reflect the actual hardware. It is the rule that partitions be aligned to 4KiB boundaries. However, this rule only makes sense for some hard disc models. In some hard disc models, the internal sector size has been increased from 512B to 4KiB. At the I/O command level, as system softwares access the disc, the sector size is still 512B. Such discs are known as "512 byte emulation" discs. There are also "4KiB native" discs, where the sector size at the I/O command level is also 4KiB. But it was a while before any but a few operating systems could cope with sector sizes other than 512 bytes at the ATA/SCSI I/O command level, so we got 512e disks for a while.
What happens on such "512e" discs is that whenever the operating system or the firmware reads a 512B sector, the disc unit itself is actually reading a whole 4KiB and handing the firmware/operating system the appropriate one-eighth; and whenever the firmware/operating system writes a 512B sector, the disc unit is actually reading a whole 4KiB sector, modifying one eighth, and writing the whole 4KiB back again.
This may seem like a performance killer, as every I/O operation is, under the covers, eight times its apparent size. Fortunately, there's a way to hide the performance cost. This takes advantage of the fact that many operating systems like to do most of their I/O in 4KiB multiples anyway. All paging I/O on x86 operating systems is done in 4KiB multiples, for example, and many operating systems, including Windows, FreeBSD, Linux and Solaris, use the paging mechanism for ordinary file I/O. So the operating system will usually be reading and writing (a multiple of) eight 0.5KiB sectors in a single I/O operation.
So it's simply necessary to ensure that those eight 512B sectors are contiguous and aligned to an actual 4KiB sector on the disc. The "natural" I/O boundaries used by the operating system must align with the internal, hidden, 4KiB boundaries of the physical disc. The eight 512B sectors in the I/O command must not span two or more 4KiB physical sectors; but must be exactly one 4KiB sector, and in the right order within that sector.
The way that this 4KiB alignment is achieved is threefold:
1) Partitions are aligned to 4KiB boundaries relative to the start of the entire disc. The start, and end, of every partition is an integral number of 4KiB sectors from the start of the entire disc.
2) On-disc data structures within a volume are aligned to 4KiB boundaries relative to the start of their containing partitions. If a disc volume format employs concepts such as "zones", "cylinder groups", and so forth, as volume formats with BSD Unix influences such as UFS and EXT2/3/4 do, they must be integer multiples of 4KiB in size. While neither FAT nor NTFS have such concepts. But FAT volumes similarly have to ensure that the total size of the FATs and reserved sectors at the beginning of a volume is an integer multiple of 4KiB, so that the data clusters following them are aligned to 4KiB multiples.
3) The volume's space allocation unit ("cluster") size is an integer multiple of 4KiB. You'll find that Windows NT's tools nowadays discourage cluster sizes for FAT and NTFS volumes that are less than 4KiB. (Since cluster sizes are powers of two, larger clusters sizes are always going to be multiples of 4KiB.) UFS has had a basic allocation unit of 4KiB for several decades.
TLDR version: no such alignment problem exists.
-
My math is pretty simple. With your 0.04MB/sec "math" - or, as you said, "BEST, full consecutive filesystem block write scenario", I could write max. 2,4MB / minute. Shockingly, those 150+ megs worth of packages takes some 15 minutes to reinstall on the shitty Alix box, out of which large part is spent with downloading the stuff and configuration.
Your "math" also totally fails to explain why "/etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.
Sigh. How about finding the real bug?
Some of that would almost make sense if you had the exact same CF card I do, but since you don't, and since I don't run my pfSense install on an ALIX board, nothing about your setup is comparable. You haven't once posted the write speeds you see on your setup, and nobody smart will pay you any attention until you do.
-
@cmb:
TLDR version: no such alignment problem exists.
For those two decades, all disks used 512 byte blocks, and thus, it was impossible to misalign them, as any number of sectors * 512 will indeed divide evenly back into 512. Even mentioning 512 byte block magnetic drives in this discussion makes no sense what so ever, as it has nothing to do with it, and shows how little understanding one has of this whole topic.
Disk sectors actually start at 0, again, hardware common knowledge. Sector 0 is reserved for specific data, which is why you can't start a partition there.
No OS's required track alignment because drives generally hid this physical information, as they still do, and as flash drives emulate it, without any regard for the actual physical layer.
The middle lost me, it again has nothing to do with this conversation. The thread is not titled "history and behavior of old ass hardware & software" for a reason.
If you read what I wrote, we aren't trying to align to disk geometry, we are aligning to 4K or 4MB from sector 1, as 0 is not counted in this alignment, being reserved. Again, why are we posting the history of disk geometry?
Old Linux fdisk does not complain about track boundaries if you run it with the correct parameters, and newer (like, last 5+ years) Linux fdisk doesn't complain at all. And it does indeed default to 1MB, or more commonly known as sector 2048. Common knowledge.
Honestly, who on the pfSense dev team doesn't know about 512 byte, 512e, and 4K magnetic discs. And why are we still posting about magnetic disks? Oighhh.
Ok, sense is finally being made, in the last paragraph. But we are still on magnetic disks, and haven't moved on to flash disks, which is the main point of this thread.
TLDR version: such alignment problem exists, as partitions are not aligned to 4K, thus, the filesystem blocks of 4K each from the start of the partition, are all equally misaligned. Does anybody read the thread before they post?
Using the most compatible fake disk geometry, the first partition should start at a minimum of sector 64. NanoBSD's start at fake sector 63. Sector 0 is not counted here, as it is reserved anyways, flash makers know this, and account accordingly. Or, if you want to trust Jim without fact checking, there is no sector 0, either way, I don't care, the math is the same:
63 * 512 = 32256 / 4096 = 7.875. Or, not evenly divisible by 4K.
64 * 512 = 32768 / 4096 = 8. Or, evenly divisible by 4K.
Remember, Jim says this is extremely important for alignment. It is step 1 after all.
Thank you and good night.
If you need references, I will be happy to post a set of links for fact checking every single word I've typed.
-
Seriously, this is ALL info I've covered in the OP of this thread. I can only say the same exact thing, so many different ways. Please read it. If you don't understand, fine, ask questions, or fact check.
If you don't want this fixed, like, ever. Or you just love the idea of manually rebuilding full NanoBSD images yourself to fix the alignment issue, just make posts about how wrong I am, without any legitimate reason as to why, and totally fuck this thread to the point that reasonable people don't even want to read through it.
-
Some of what I snipped out in the interest of (some) brevity probably left that making less sense, I added part of it back in there in the previous post.
-
What your quote ends with, is exactly what I've been saying, and what I posted in response to it still applies exactly the same.
-
Ok, my bad, one added thing doesn't check out, which is the 2K flash erase block size said to be "common". So common in fact I've never ever seen it that small, unless we are getting into SSD's.
Cheap flash, that we should be focused on, has an erase block size of 4K or larger. Some USB flash drives have been reported to use minimum erase blocks as large as 1MB, also all noted in my first post. Yeah, generally people just throw these away because the performance is so horrible, but it's good to let people know they might be dealing with such a device.
All of this can be found using a simple dd raw write test script, I can post my version of it if anyone is interested. SD cards on a native (non USB) interface report this value to Linux where it can easily be read, not sure about FreeBSD. A native SD interface would be one such as found on most Android devices, which handily is already running Linux.
-
Honestly, who on the pfSense dev team doesn't know about 512 byte, 512e, and 4K magnetic discs.
No one, but if you get Jim on his soap box… :)
If you need references, I will be happy to post a set of links for fact checking every single word I've typed.
Please do. Better yet, post real world results of "wrong" vs. "right" as relevant to our embedded images.
And don't get all worked up with me because doktornotor is a dick.