NanoBSD Image Block Alignment [MisAlignment]


  • Banned

    Can we fix the general alignment of the partitions in the flashable NanoBSD images?

    Anyone running NanoBSD on any type of flash is suffering (some more than others) from write amplification. Packages have always taken many multiples of time longer to install than they should, and now that the remount kernel patch has been pulled in 2.2.3, even simple remount read only operations take near 30 full seconds, where everything but routing becomes unresponsive, even SSH.

    A massive help would be to at least change the start of the MBR partition, from block 63, to block 64, as OpenBSD has done…

    From: http://www.openbsd.org/faq/faq14.html#disklabel
    (fdisk platforms) Leave first track free: On platforms using fdisk(8), you should leave the first logical track unused, both in disklabel(8) and in fdisk(8). On "modern" computers (i.e., almost everything that will run OpenBSD), the exact amount doesn't really matter, though for performance reasons on the newest disks, having partitions aligned at 4k boundaries is good for performance. For this reason, OpenBSD now defaults to starting the first partition at block 64 instead of 63.

    And also change the size of the free gap in between NanoBSD slices, from 63 to 64 blocks, to keep the second slice aligned in the same manner.

    This would not only align properly to new 4K sector hard disks, but most SSD's and a MUCH larger portion of flash in general. My CF cards have a 4K minimum erase block size. This is not universal, but it is far more common than smaller erase block sizes, and the current 512 byte alignment, as a flash erase block size, is completely nonexistent.

    As of now (2.2.3-RELEASE), the start of the initial MBR partition (which absolutely everything else aligns from) is aligned to block 63, * that by the 512 bytes that most devices report regardless, = 32256 bytes or 31.5K. It block aligns to NOTHING but devices that actually use 0.5K or 512 byte sectors on the physical level, which as we know, will eventually be nothing, as 4K spinny drives and SSD's completely eliminate physical 512 byte storage devices.

    Some flash devices have far more controller cache than others, and thus handle the resulting write amplification far better. But cheaper devices, like CF cards, SD cards, and USB flash drives, or generally the large majority of what the NanoBSD version is run from, do not have much controller cache at all, so they suffer massive slowdowns and hangs, like having to wait 30 seconds for a remount read only to happen, or 45 minutes for packages to install.

    With write amplification not only comes the slow downs, but the wear magnification that is equal to the amount of write amplification caused. I know, obvious, but not always until you really think about what is all happening at once, on every layer (physical -> logical -> partition -> filesystem). If one is misaligned, they are all misaligned.

    On my system for example, this is causing writes to happen at an average speed of around 0.14MB/sec as seen by iostat, which then translates to a real world filesystem write speed of… wait for it... 0.04MB/sec, because when you write to a filesystem, much more than just the file data has to be written, file tables, metadata, and so on, need to all be written, are ALL misaligned and write amplified. This is all on CF based flash that is capable of writing at speeds of 4.4MB/sec if you simply erase block align the writes. This goes for all 3 slices, so this happens with every single config slice write as well.

    If you really want to all but guarantee flash erase block alignment, the best layout is to start the first MBR partition at block 8192, or 4MB from the start of the disk, and make the gap between slices 8192 blocks to match. This may be another discussion entirely for some, but I think it's better to at least give the best, most universal solution, than to omit it entirely. Almost all cheap flash controllers are optimized in some way for the second 4MB of storage, since that is where FAT32, xFAT, and some other filesystems store file tables and other important file data (the most active portion of any storage device). There have also been a few cheap USB flash drives now reported as having 4MB minimum erase blocks. We have yet to see larger than this as far as I know. It is possible we will, but not likely.

    So, let's get this fixed, shall we?

    Edit: References requested, and added: https://forum.pfsense.org/index.php?topic=95938.msg536261#msg536261


  • Banned

    More of the same difficulties reported long ago by multiple users on different hardware. They end up at the conclusion that using async "fixes" it. It doesn't.
    https://forum.pfsense.org/index.php?topic=70190.0

    What async actually does is allow the kernel to consider writes complete, before they are actually complete. The filesystem becomes very unstable by adding RAM as volatile cache, cache that is easily lost on power failure / kernel panic. As stated above, adding cache does work around the problem to a degree, but is far from being a "correct" solution. This was verified very recently by the pfSense devs themselves in recent commits to 2.2.3 and an issue with corrupt user account files, see here for details: https://redmine.pfsense.org/issues/4523

    The short version: async breaks stuff, sync doesn't. So lets fix the sync mode issues with proper alignment, and get the best of both worlds.


  • Banned

    @ky41083:

    This was verified very recently by the pfSense devs themselves in recent commits to 2.2.3 and an issue with corrupt user account files, see here for details: https://redmine.pfsense.org/issues/4523

    And then you get this with the miraculous sync solution: https://redmine.pfsense.org/issues/4803 - go figure.

    Not to mention, nothing like corrupt user account files was observed before. Really. No massive FS corruption reports anywhere. Until 2.2.x recently.

    @ky41083:

    The short version: async breaks stuff, sync doesn't. So lets fix the sync mode issues with proper alignment, and get the best of both worlds.

    Can assure you that no allignment will fix my CFs that take over a minute to /etc/rc.conf_mount_ro; nor will allignment fix the issue for the guy linked with config.xml corruption above where it takes 3+ minutes for him. There's a serious kernel bug somewhere:

    
    $ ls -la /cf/conf/config.xml
    -rw-r--r--  1 root  wheel  205753 Jul  1 09:30 /cf/conf/config.xml
    
    

    It does not take minutes to write 0,2 MB to CF on a sane (file)system, with even the worst misalignment you could ever imagine. Even at the 0.04MB/sec figure you pulled out of somewhere it would only take 5 secs. Not over a minute. Not 3 minutes.



  • I will add a +1 for getting back more sane times for the RW->RO transition on nanoBSD.
    On 2.2.3 I now go to Diags->nanoBSD and set the thing to RW, then do a bunch of changes, then set it back to RO. That way I get just a single transition back to RO. Of course I often forget, press Save on something, then waiting… and realize "oh, I should have switched to RW! I need to make more than 1 change.".
    And if I forget to switch back to RO at the end then the system is left with pending async writes for goodness knows how long.



  • nano never had any of the corruption issues because it's always run with sync.

    Fixing the ro->rw mount slowness is definitely a priority for 2.2.4. It wasn't as pronounced on the hardware we have as it is for a number of others. We'll review the alignment as well. There's a bug ticket open. https://redmine.pfsense.org/issues/4814


  • Banned

    @doktornotor:

    And then you get this with the miraculous sync solution: https://redmine.pfsense.org/issues/4803 - go figure.

    Yup, that's completely expected, since the flash storage device is busy writing data the entire time we are waiting for the remount read only to complete.

    @doktornotor:

    Not to mention, nothing like corrupt user account files was observed before. Really. No massive FS corruption reports anywhere. Until 2.2.x recently.

    Yup, kernel change (new disk modules) + old pulled kernel patch = current behavior.

    @doktornotor:

    Can assure you that no allignment will fix my CFs that take over a minute to /etc/rc.conf_mount_ro; nor will allignment fix the issue for the guy linked with config.xml corruption above where it takes 3+ minutes for him. There's a serious kernel bug somewhere:

    And you know… because you've aligned your pfSense partitions?

    @doktornotor:

    
    $ ls -la /cf/conf/config.xml
    -rw-r--r--  1 root  wheel  205753 Jul  1 09:30 /cf/conf/config.xml
    
    

    It does not take minutes to write 0,2 MB to CF on a sane (file)system, with even the worst misalignment you could ever imagine. Even at the 0.04MB/sec figure you pulled out of somewhere it would only take 5 secs. Not over a minute. Not 3 minutes.

    Actually, it absolutely can. 0.04MB/sec is a BEST, full consecutive filesystem block write scenario. And you're completely forgetting the remount read only that comes after. Let's take this value, for examples sake. On a CF device that gets 4.4MB/sec writes all day long, we are looking at a decrease in speed of 110x. Lets take this, apply it to the guy who's config writes are taking 3+ minutes, and we get a time of, 1.6 seconds. Still think it won't help him? Ok, how about you, your remount time turns into 0.5 seconds. Isn't math fun?

    Now, let's consider that this best case scenario write speed, is not even close to the use case. The use case is typically far slower, taking even longer. When writing config.xml changes (actually not all that bad, if kept mounted read write), remounting the filesystem read only, or really anything but a dd test write, things break down considerably…

    Let's assume the most common erase block size of 4K, and the most common controller cache size of 4K, to read-erase-write-verifyread each and every 4K flash block, 1 block at a time. This needs to be done where even 1 single byte of data changes in that 4K flash block. NanoBSD uses a FIFO disk buffer, so writes to the same blocks are not combined in any way.

    So, best case scenario, we want to rewrite the entire config.xml. But we don't, do we? We want to change a few bytes. So the small few byte write gets passed to the flash device, which has to (after all the OS level stuff) read-erase-write-verifyread an ENTIRE 4K block, best case. Worst case, the few bytes we changed, are not written consecutively, broken up into XML sections, some even crossing an erase block boundary, because the filesystem is not aligned with the storage device, and now, our small little few byte writes, turn into multiple read-erase-write-verifyread's for no less than 4K-8K of data, each, that nothing but one of the cheapest flash controllers there is has to deal with, with only 4K of memory to work with. Welcome to write amplification. Here, we deal with it in many orders of magnitude greater than our best case scenario of a 110x slowdown.

    Now, toss in a remount read only on top of all this. Sweet. Remounting read only not only has to wait for all of that awesomeness we talked about in the last paragraph to happen, but it comes with it's own slew of writes to things like metadata, inodes, and other random "few bytes here, one byte there" type of writes, that are again, FIFO'd out, so that every single one of them, causes a 4K read-erase-write-verifyread, or 8K if also falling on a flash boundary, and we have our delay. A delay so bad mind you, that they patched the kernel to get rid of it up to this point, with a patch that deliberately discards data.

    The old kernel patch that was removed, dumped all that data that a remount read only writes, so the "issue" was gone as far as most were concerned. This is also why this patch causes filesystem instability, and why they will not reapply it for any reason. NanoBSD pfSense was able to get away with doing this without corruption, because the discarded data that would break a "normal" FreeBSD system, has to do with the data that is stored on RAM disks for NanoBSD, RAM disks that are never remounted. Specifically not the data stored on the root slice, especially because NanoBSD doesn't expect to be able to write to the root slice 99% of the time, and doesn't.

    And finally, even if this doesn't "solve" the remount read only issue, or filesystem corruption, or even make your teeth whiter, it's just the right way to do things. Flash writes WILL be faster, flash device wear WILL go down, we can enjoy all of the things it DOES fix, and all the future bugs it will prevent.


  • Banned

    @ky41083:

    @doktornotor:

    Can assure you that no allignment will fix my CFs that take over a minute to /etc/rc.conf_mount_ro; nor will allignment fix the issue for the guy linked with config.xml corruption above where it takes 3+ minutes for him. There's a serious kernel bug somewhere:

    And you know… because you've aligned your pfSense partitions?

    No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.

    Sigh. Reminds me of the debates with the 4K drive guys who were screaming in a similar manner. No, writing to misaligned partitions is NOT 110x slower.


  • Banned

    Reading back through that, it occurs to me that using an I/O scheduler that behaves like Linux's Deadline would help a ton, once things are aligned that is. Something FIFO based but will combine writes to the same filesystem blocks.

    This would save a ton of code re-writing in order to write things like small config changes to flash disks optimally, which we wouldn't want to do anyways, because it would create the very issues we are trying to solve here, on non flash storage based installs.


  • Banned

    @doktornotor:

    No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.

    Right, because package installs are the best case 0.04MB/sec we talked about. Remount read only does indeed write data to disk. Go read the kernel patch that was pulled. And try to keep up.

    @doktornotor:

    Sigh. Reminds me of the debates with the 4K drive guys who were screaming in a similar manner. No, writing to misaligned partitions is NOT 110x slower.

    Sigh indeed. Turns out, those screaming 4K drive guys were right. Do your homework. And flash devices, because they cannot simply overwrite a sector, and don't have 128MB of controller cache to work with, are many times more effected by this.

    I know these things because I've not only done the research, but I've done the real world implementation testing to go with it.

    Here, I'll throw you a bone. Pop quiz:

    Q: Why does OpenBSD align to sector 64, Windows newer than XP, along with mdadm, LVM2, Linux newer than forever ago, ALL align to 1MB now?

    A: Because alignment.


  • Banned

    @ky41083:

    @doktornotor:

    No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.

    Right, because package installs are the best case 0.04MB/sec we talked about.

    And somehow that does not happen. Also, the subsequent RO remount does not take weeks. Miracle, I guess. You have a gaping hole somewhere in your write amplification theory.

    Look, I'm not against proper alignment. I'm merely tired of this "oh noes, we are misaligned, the sky will fall and universe will collapse into a blackhole" bullshit. Seriously tired. Bye.


  • Banned

    @cmb:

    nano never had any of the corruption issues because it's always run with sync.

    Fixing the ro->rw mount slowness is definitely a priority for 2.2.4. It wasn't as pronounced on the hardware we have as it is for a number of others. We'll review the alignment as well. There's a bug ticket open. https://redmine.pfsense.org/issues/4814

    It would appear, that people who understand the issue, simply don't agree with you.

    @doktornotor:

    @ky41083:

    @doktornotor:

    No, I know because because if that was the case, reinstalling the packages I use would take weeks. Which is clearly NOT the case. I also know because when you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.

    Right, because package installs are the best case 0.04MB/sec we talked about.

    And somehow that does not happen. Also, the subsequent RO remount does not take weeks. Miracle, I guess. You have a gaping hole somewhere in your write amplification theory.

    Look, I'm not against proper alignment. I'm merely tired of this "oh noes, we are misaligned, the sky will fall and universe will collapse into a blackhole" bullshit. Seriously tired. Bye.

    Then you should learn how to do math, so people listen to you.

    0.04MB/sec does not take weeks for a, let's say, 5MB package. It would, however, using this thing called math, take roughly 2 minutes. Let's double that for the sake of argument for the download + the install = 4 minutes.

    Where in the because dumb do you get weeks, especially for a simple remount read only?

    We simply get a magnitude of time, between pfSense 2.1.x and 2.2.x, in the order of roughly 110x on descent CF cards like mine, same or more for most, less for some lucky / smart flash buyers, that it takes to do a simple remount read only. We have always seen this slowdown with non remounting flash based writes! Always. We just didn't see the remount read only delay, because it was kernel patched.

    No wonder your tired, get some sleep kiddo.


  • Banned

    No shit. So writing 200KB takes the same as writing 5MB now? Yeah, get some sleep with your theories and fix your math. Bye.


  • Banned

    @doktornotor:

    No shit. So writing 200KB takes the same as writing 5MB now? Yeah, get some sleep with your theories and fix your math. Bye.

    Not a single word of that makes any sense at all.

    Either show your math, cite your sources, or stop trolling.


  • Banned

    My math is pretty simple. With your 0.04MB/sec "math" - or, as you said, "BEST, full consecutive filesystem block write scenario", I could write max. 2,4MB / minute. Shockingly, those 150+ megs worth of packages takes some 15 minutes to reinstall on the shitty Alix box, out of which large part is spent with downloading the stuff and configuration.

    Your "math" also totally fails to explain why "/etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.

    Sigh. How about finding the real bug?



  • Quoting Jim Thompson with his permission from internal discussion:

    All forms of flash suffer from write amplification.  Since there are different forms of flash, some have a larger issue write application than others.
    This is also true for different workloads.

    So that much is true, thus, there is “some” validity to the claims made in that post.

    What is not valid is his recommended path toward a fix.

    Quoting the relevant part of the OpenBSD page he cites:

    (fdisk platforms) Leave first track free: On platforms using fdisk(8), you should leave the first logical track unused, both in disklabel(8) and in fdisk(8). On "modern" computers (i.e., almost everything that will run OpenBSD), the exact amount doesn't really matter, though for performance reasons on the newest disks, having partitions aligned at 4k boundaries is good for performance. For this reason, OpenBSD now defaults to starting the first partition at block 64 instead of 63.

    So first let’s define some terms.

    Ordinary spinning rust hard disks are made of platters that have tracks.  Tracks are the thin concentric circular strips of sectors.
    Cylinders are a collection of tracks, stacked vertically.

    In CHS addressing the sector numbers always start at 1, there is no sector 0.

    The Unix communities employ the term block to refer to a sector or group of sectors.

    The CHS addressing supported in IBM-PC compatible BIOSes code used eight bits for, theoretically, up to 256 heads counted as head 0 up to 255. However, a bug in all versions of DOS up to and including 7.10 will cause these operating systems to crash on boot when encountering volumes with 256 heads. Therefore, all compatible BIOSes will use mappings with up to 255 heads only, including in virtual 255×63 geometries.

    So, CHS addressing starts at 0/0/1 with a maximal value 1023/255/63

    The fdisk utility normally displays partition table information using 1024-byte blocks, but also uses the word sector to help describe a disk's size in the phrase, 63 sectors per track.

    In other words, the LBA sector number 63 corresponds to cylinder 0, head 1, sector 1 in the CHS format, which is the first sector you can use in the MBR format.  The source of the confusion is that 63 is not (evenly) divisible by 8.

    Now is when we mention that flash-based devices don’t have platters, or heads, or tracks.  They are, quite simply, a group of flash sectors, the whole C/H/S thing flies out the window, and we can address any sector as easily as another.  Moreover, many flash sectors these days are 2K in size.

    Now, remember that CompactFlash doesn’t have very good algorithms for dealing with modern systems.  This is one of the reasons we have moved away from it.  A CF device contains an ATA controller
    and appears to the host device as if it were a hard disk.  It is literally plugged into an IDE channel. CF wear leveling algorithms are proprietary and undocumented, they are "secret sauce”.  Some will be better
    than others.

    It is often believed that disc partitions have to be aligned to cylinder or track boundaries. This is not in fact true and never really has been. There are alignment considerations for disc partitions, but they have nothing to do with cylinders, and they aren't mandatory. Operating systems will still work with misaligned partitions, just more slowly for some (not all) disc unit models.

    The idea that disc partitions have to aligned to cylinder boundaries is nonsense on its face. Millions of people have had discs where the first primary partition began on track zero, sector one, head one with no ill effect whatsoever on operating systems from MS-DOS through Windows NT to OS/2. That was, after all, the default that fdisk/Disk Manager on those operating systems used for almost two decades. At best, the purported alignment requirement would have been a track alignment, with all partitions starting at sector one (Sectors are numbered from one, remember.) of any given track.

    But this is not true, either. No version of any operating system has actually required this. Even MS-DOS was quite happy to have disc partitions starting at sectors other than 1. The only things that have required this have been disc partitioning utilities. There's been a bit of circular logic about this. The disc partitioning utilities enforced the requirement because their authors thought that it was a requirement, but people only thought that it was a requirement because fdisk and the like enforced it. It was what the partitioning utility programs enforced — so the logic went — so it must have been a restriction. In fact it never was, and no operating system itself has any trouble with this.

    The idea of track alignment is daft anyway. It's pointless because it doesn't align things to any valid boundary on the disc unit itself. There's no performance or other benefit, because the physical layout of the partitions on the disc will not be aligned to the actual physical tracks on the disc by aligning them to the software-visible track size.

    • The "tracks" that system softwares see at the ATA command register level aren't actually the real tracks on the disc itself, and haven't been since the advent of zoned bit recording (ZBR) in the early 1990s. Tracks are not, in fact, equally sized across the whole disc with ZBR; even though that's how discs are presented to software via the (old) cylinder+head+sector I/O command interface for ATA disc units.

    • Unlike ATA, the SCSI command level has always operated in terms of logical block numbers, and not in terms of a cylinder+head+sector system in the first place. In the SCSI world, from the start the idea that system software necessarily even knew where the physical track boundaries were was incorrect. Indeed, PC firmwares for SCSI hard discs have to invent disc geometries, largely from thin air, for the benefits of old PC/AT and PC98 firmwares and operating systems that expect discs to be addressed, at the disc unit I/O command level, in terms of a three-dimensional CHS geometry. Alignment to a geometry that's just made up anyway by the machine firmware is just pointless.

    With much fanfare, Microsoft finally eradicated enforcement of this entirely useless and pointless notion from the Windows NT Disk Manager, in 2008 (i.e. with the releases of Windows Vista Service Pack 1 and Windows Server 2008). Indeed, for years before that, since 2003, it had been recommending to Exchange Server and Microsoft SQL Server administrators that they use diskpart to align disc partitions to 4KiB multiples for performance reasons. (Some of the performance reasons given in early years were spurious, since they were based upon the erroneous premise that software-visible track boundaries were also physical track boundaries. But the end result, in light of later hardware developments, was right despite that.)

    Neither FreeBSD nor Linux has caught up. The fdisk utility in FreeBSD and Linux still complains about partitions not aligned to track boundaries.

    There is a disc partition alignment rule that does reflect the actual hardware. It is the rule that partitions be aligned to 4KiB boundaries. However, this rule only makes sense for some hard disc models.  In some hard disc models, the internal sector size has been increased from 512B to 4KiB. At the I/O command level, as system softwares access the disc, the sector size is still 512B. Such discs are known as "512 byte emulation" discs. There are also "4KiB native" discs, where the sector size at the I/O command level is also 4KiB. But it was a while before any but a few operating systems could cope with sector sizes other than 512 bytes at the ATA/SCSI I/O command level, so we got 512e disks for a while.

    What happens on such "512e" discs is that whenever the operating system or the firmware reads a 512B sector, the disc unit itself is actually reading a whole 4KiB and handing the firmware/operating system the appropriate one-eighth; and whenever the firmware/operating system writes a 512B sector, the disc unit is actually reading a whole 4KiB sector, modifying one eighth, and writing the whole 4KiB back again.

    This may seem like a performance killer, as every I/O operation is, under the covers, eight times its apparent size. Fortunately, there's a way to hide the performance cost. This takes advantage of the fact that many operating systems like to do most of their I/O in 4KiB multiples anyway. All paging I/O on x86 operating systems is done in 4KiB multiples, for example, and many operating systems, including Windows, FreeBSD, Linux and Solaris, use the paging mechanism for ordinary file I/O. So the operating system will usually be reading and writing (a multiple of) eight 0.5KiB sectors in a single I/O operation.

    So it's simply necessary to ensure that those eight 512B sectors are contiguous and aligned to an actual 4KiB sector on the disc. The "natural" I/O boundaries used by the operating system must align with the internal, hidden, 4KiB boundaries of the physical disc. The eight 512B sectors in the I/O command must not span two or more 4KiB physical sectors; but must be exactly one 4KiB sector, and in the right order within that sector.

    The way that this 4KiB alignment is achieved is threefold:

    1) Partitions are aligned to 4KiB boundaries relative to the start of the entire disc. The start, and end, of every partition is an integral number of 4KiB sectors from the start of the entire disc.

    2) On-disc data structures within a volume are aligned to 4KiB boundaries relative to the start of their containing partitions. If a disc volume format employs concepts such as "zones", "cylinder groups", and so forth, as volume formats with BSD Unix influences such as UFS and EXT2/3/4 do, they must be integer multiples of 4KiB in size. While neither FAT nor NTFS have such concepts. But FAT volumes similarly have to ensure that the total size of the FATs and reserved sectors at the beginning of a volume is an integer multiple of 4KiB, so that the data clusters following them are aligned to 4KiB multiples.

    3) The volume's space allocation unit ("cluster") size is an integer multiple of 4KiB. You'll find that Windows NT's tools nowadays discourage cluster sizes for FAT and NTFS volumes that are less than 4KiB. (Since cluster sizes are powers of two, larger clusters sizes are always going to be multiples of 4KiB.)  UFS has had a basic allocation unit of 4KiB for several decades.

    TLDR version: no such alignment problem exists.


  • Banned

    @doktornotor:

    My math is pretty simple. With your 0.04MB/sec "math" - or, as you said, "BEST, full consecutive filesystem block write scenario", I could write max. 2,4MB / minute. Shockingly, those 150+ megs worth of packages takes some 15 minutes to reinstall on the shitty Alix box, out of which large part is spent with downloading the stuff and configuration.

    Your "math" also totally fails to explain why "/etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere.

    Sigh. How about finding the real bug?

    Some of that would almost make sense if you had the exact same CF card I do, but since you don't, and since I don't run my pfSense install on an ALIX board, nothing about your setup is comparable. You haven't once posted the write speeds you see on your setup, and nobody smart will pay you any attention until you do.


  • Banned

    @cmb:

    TLDR version: no such alignment problem exists.

    For those two decades, all disks used 512 byte blocks, and thus, it was impossible to misalign them, as any number of sectors * 512 will indeed divide evenly back into 512. Even mentioning 512 byte block magnetic drives in this discussion makes no sense what so ever, as it has nothing to do with it, and shows how little understanding one has of this whole topic.

    Disk sectors actually start at 0, again, hardware common knowledge. Sector 0 is reserved for specific data, which is why you can't start a partition there.

    No OS's required track alignment because drives generally hid this physical information, as they still do, and as flash drives emulate it, without any regard for the actual physical layer.

    The middle lost me, it again has nothing to do with this conversation. The thread is not titled "history and behavior of old ass hardware & software" for a reason.

    If you read what I wrote, we aren't trying to align to disk geometry, we are aligning to 4K or 4MB from sector 1, as 0 is not counted in this alignment, being reserved. Again, why are we posting the history of disk geometry?

    Old Linux fdisk does not complain about track boundaries if you run it with the correct parameters, and newer (like, last 5+ years) Linux fdisk doesn't complain at all. And it does indeed default to 1MB, or more commonly known as sector 2048. Common knowledge.

    Honestly, who on the pfSense dev team doesn't know about 512 byte, 512e, and 4K magnetic discs. And why are we still posting about magnetic disks? Oighhh.

    Ok, sense is finally being made, in the last paragraph. But we are still on magnetic disks, and haven't moved on to flash disks, which is the main point of this thread.

    TLDR version: such alignment problem exists, as partitions are not aligned to 4K, thus, the filesystem blocks of 4K each from the start of the partition, are all equally misaligned. Does anybody read the thread before they post?

    Using the most compatible fake disk geometry, the first partition should start at a minimum of sector 64. NanoBSD's start at fake sector 63. Sector 0 is not counted here, as it is reserved anyways, flash makers know this, and account accordingly. Or, if you want to trust Jim without fact checking, there is no sector 0, either way, I don't care, the math is the same:

    63 * 512 = 32256 / 4096 = 7.875. Or, not evenly divisible by 4K.

    64 * 512 = 32768 / 4096 = 8. Or, evenly divisible by 4K.

    Remember, Jim says this is extremely important for alignment. It is step 1 after all.

    Thank you and good night.

    If you need references, I will be happy to post a set of links for fact checking every single word I've typed.


  • Banned

    Seriously, this is ALL info I've covered in the OP of this thread. I can only say the same exact thing, so many different ways. Please read it. If you don't understand, fine, ask questions, or fact check.

    If you don't want this fixed, like, ever. Or you just love the idea of manually rebuilding full NanoBSD images yourself to fix the alignment issue, just make posts about how wrong I am, without any legitimate reason as to why, and totally fuck this thread to the point that reasonable people don't even want to read through it.



  • Some of what I snipped out in the interest of (some) brevity probably left that making less sense, I added part of it back in there in the previous post.


  • Banned

    What your quote ends with, is exactly what I've been saying, and what I posted in response to it still applies exactly the same.


  • Banned

    Ok, my bad, one added thing doesn't check out, which is the 2K flash erase block size said to be "common". So common in fact I've never ever seen it that small, unless we are getting into SSD's.

    Cheap flash, that we should be focused on, has an erase block size of 4K or larger. Some USB flash drives have been reported to use minimum erase blocks as large as 1MB, also all noted in my first post. Yeah, generally people just throw these away because the performance is so horrible, but it's good to let people know they might be dealing with such a device.

    All of this can be found using a simple dd raw write test script, I can post my version of it if anyone is interested. SD cards on a native (non USB) interface report this value to Linux where it can easily be read, not sure about FreeBSD. A native SD interface would be one such as found on most Android devices, which handily is already running Linux.



  • @ky41083:

    Honestly, who on the pfSense dev team doesn't know about 512 byte, 512e, and 4K magnetic discs.

    No one, but if you get Jim on his soap box…  :)

    @ky41083:

    If you need references, I will be happy to post a set of links for fact checking every single word I've typed.

    Please do. Better yet, post real world results of "wrong" vs. "right" as relevant to our embedded images.

    And don't get all worked up with me because doktornotor is a dick.


  • Banned

    @cmb:

    Please do. Better yet, post real world results of "wrong" vs. "right" as relevant to our embedded images.

    The results I have posted are real world, my references will be as well. There are some on the forum I've linked to already, see my second post, and comments under it. Anything that is misaligned, more specifically to sector 63, I would think is relevant, would you agree?

    Someone would have to manually rebuild one of the NanoBSD images to test proper alignment on an actual NanoBSD install. I don't have a good FreeBSD VM to work with right now, or time unfortunately. Someone else could do this much easier / faster.

    The fact that OpenBSD has made this change already, noting performance, is pretty solid as far as I'm concerned.

    @cmb:

    And don't get all worked up with me because doktornotor is a dick.

    Fair enough, lol.

    References in next post…


  • Banned

    References

    Relevant Alignment Info for FAT32 SD. All applies except for the bits about FAT, obviously. Pay attention to the MBR / partition layout using sectors to calculate alignment.
    http://3gfp.com/wp/2014/07/formatting-sd-cards-for-speed-and-lifetime/

    Block Device Attributes I spoke about.
    https://www.kernel.org/doc/Documentation/mmc/mmc-dev-attrs.txt

    Measuring Flash Block Size, wrote my test script based on this. I just dd write from /dev/zero to smaller block sizes in descending order starting from 8MB, it works great. The first size that bombs out as being way slower / longer, is one below your smallest erase block. This is for ANY flash based storage device.
    http://kim.oyhus.no/FlashBlockSize.html

    Op, I was off, here we have flash devices using erase block sizes of 8MB, I said much smaller, so the issue can be worse than I laid out.
    https://www.raspberrypi.org/forums/viewtopic.php?t=11258&p=123670

    Speed of USB Flash Devices. So wish they recorded more filesystem details, but still a good reference. You can easily see how the exact same device reads / writes far slower in different circumstances, even with the same format. And which devices to look for when buying.
    http://usbflashspeed.com/

    Edit: FreeBSD Specific References Added

    Good Alignment Info for UFS / ZFS on FreeBSD
    http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html

    Discussion about forcing alignment with fdisk, and why it's not updated I suppose (gpart).
    https://forums.freebsd.org/threads/gpart-trying-to-force-mbr-partitions-to-be-cylinder-aligned.36439/

    Awesome example on FreeBSD using an actual drive that reports 4K sectors. This specific example I had not seen before. Remember, our flash drives still report 512 byte sectors, the NanoBSD images will end up aligned to KB or MB boundaries, not specific sectors. And the existing UFS filesystem is already using a 4K fragment size, so the partition is all that's left to fix.
    https://forums.freebsd.org/threads/ufs-sector-and-alignment-explanation.42208/


  • Banned

    Dumb dick question: When you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere. How does that fit your write amplification theories? For some mysterious reason still unanswered.  ::)

    Regards,

    Mr. Dumb Dick


  • Banned

    @doktornotor:

    Dumb dick question: When you do /etc/rc.conf_mount_rw; /etc/rc.conf_mount_rw the it still takes damn ages without any config changes done and with nothing to write anywhere. How does that fit your write amplification theories? For some mysterious reason still unanswered.  ::)

    Regards,

    Mr. Dumb Dick

    A. Because you keep referring to documented referenced facts as theories, like an ignorant dick.

    B. Because no one likes teaching ignorant dicks how to use Google.


  • Banned

    Right. The most amplified writes are those where there's nothign to write. Excellent theory.


  • Banned

    @doktornotor:

    Right. The most amplified writes are those where there's nothign to write. Excellent theory.

    I can't count now, the number of times I've said, remounting UFS results in writes to disk.

    Nor can I count, the number of times I've referenced the source of the old kernel patch as hard proof of this.

    BTW, the answer was actually…

    E: Your Mom


  • Banned

    Ok, seriously now… the reason I'm taking a minute to post something useful...

    If you question the alignment, simply flash a drive with a NanoBSD image, boot Gparted Live from a disc or USB [YUMI works great]. If the storage device with pfSense on it uses 512 byte sector emulation, the very first [only] MBR partition that it can read, will start at sector 63. You can see all of this from the GUI.

    If your device with pfSense NanoBSD on it uses a different size for sector emulation, just multiply whatever that sector size is by the sector number the first MBR partition starts on, and you will get 31.5K.

    Then, look at the very first [of three] FreeBSD slice inside that first MBR partition, and it starts at sector 0 of the MBR partition, the very same sector that the MBR partition starts on.


  • Banned

    Oh, fuck me gently sideways. Yeah, it writes even when there's nothing to write.

    Look: This shit worked until 8.3. Then someone broke it.

    Now, go find the real bug and align your ass instead.


  • Banned

    @doktornotor:

    Oh, fuck me gently sideways. Yeah, it writes even when there's nothing to write.

    Look: This shit worked until 8.3. Then someone broke it.

    Now, go find the real bug and align your ass instead.

    Sounds like I'm too busy aligning your ass to me…

    Anyways, yeah, I've read that thread. They changed some kernel code in 8.3 that added filesystem stability. pfSense devs wrote a patch to unpatch, if you will, 8.3 so it behaved again like 8.2, which the devs currently call an "incorrect" solution, and will not reintroduce that patch. So, forward we go with said bug hunt… ah shit, now you have me giving history lessons.


  • Banned

    Oh yeah, and, for the… (I can't count how many times now), you're forgetting the whole topic of the thread your posting in. It's not called "fix remount issue" for a reason. That's not what it's about.

    It's about fixing the alignment issue, because it is such a dirt simple thing to fix in the build system. Hell, wouldn't fixing it in the installer also make a great upstream patch to get FreeBSD on the same page as every other OS that's still maintained today? I would be infinitely happy if it simply made packages install at a reasonable speed, or made upgrades install at a reasonable speed, or even made slice duplication happen at a reasonable speed.

    When I have a CF device that I can write to at (lets round) around 4MB/sec, and after flashing it with pfSense NanoBSD, am now writing to it at 0.04MB/sec, there is clearly some slowdown to be made up there. Even doktornodick or whatever that spells, can't argue against that, not coherently anyways.


  • Banned

    Bored waiting for Windows to update, added following FreeBSD specific references to articles / threads in the references post:

    Edit: FreeBSD Specific References Added

    Good Alignment Info for UFS / ZFS on FreeBSD
    http://ivoras.net/blog/tree/2011-01-01.freebsd-on-4k-sector-drives.html

    Discussion about forcing alignment with fdisk, and why it's not updated I suppose (gpart).
    https://forums.freebsd.org/threads/gpart-trying-to-force-mbr-partitions-to-be-cylinder-aligned.36439/

    Awesome example on FreeBSD using an actual drive that reports 4K sectors. This specific example I had not seen before. Remember, our flash drives still report 512 byte sectors, the NanoBSD images will end up aligned to KB or MB boundaries, not specific sectors. And the existing UFS filesystem is already using a 4K fragment size, so the partition is all that's left to fix.
    https://forums.freebsd.org/threads/ufs-sector-and-alignment-explanation.42208/

    This one's for you cmb, for the only reasonable discussion that's taken place here. Thank you for a little sanity.


  • Banned

    @ky41083:

    They changed some kernel code in 8.3 that added filesystem stability

    ROFL. Look, we added stability. No regression at all. It's not slower than molasses – it's just very stable now!


  • Banned

    Then find the bug that you are clearly capable of finding, submit it to FreeBSD, and get it patched upstream. pfSense is the wrong place for messy kernel patches.


  • Banned

    The bug was already filed. Considering people have been complaining just for 3+ years, this definitely can wait. The IPv6 fragmention bug in pf has been filed just ~7 years ago – and of course still well alive and kicking. Welcome to FreeBSD's kernel land, brace for impact!


  • Banned

    @doktornotor:

    The bug was already filed. Considering people have been complaining just for 3+ years, this definitely can wait. The IPv6 fragmention bug in pf has been filed just ~7 years ago – and of course still well alive and kicking. Welcome to FreeBSD's kernel land, brace for impact!

    Hahahaha, epic.

    Fragmentation of packets is against the entire IPv6 protocol, for good reason. And you cite the unstable kernel patch as not being applied upstream.

    All in the same post.

    You absolutely do not belong in this discussion. Any further posts by you will be pertinently ignored by me. If anyone else feels your worthy of a response… never mind, they won't.


  • Banned

    @ky41083:

    IPv6 fragmentation of packets smaller than 1280 bytes is against the entire IPv6 protocol, for good reason.

    Thanks for another "insightful" post… You definitely got the point of the bug.  :o ;D ::)

    @ky41083:

    Any further posts by you will be pertinently ignored by me.


  • Banned

    Added info posted in another thread. Also simple testing method found, see end.

    What "fixes" the issue? from disk to disk, regardless of type (flash / spinny) and write rate, isn't necessarily the disk controller, but how much cache the disk controller has to work with.

    If it has enough cache to absorb all the random writes and spit them out to the physical layer on its own time, it tells the kernel it has all that data, and you don't see hanging issues.

    If it doesn't have enough cache to absorb all the random writes, you wait until it has actually written most of it to the physical layer, then it tells the kernel it has all the data, and this is the hang you experience.

    So, it's not a disk speed issue at all, it's that faster / larger / newer disks tend to have better controllers with more cache than slower / smaller / older disks.

    It doesn't help that NanoBSD images are not 4k aligned, so writes take even longer than they should on flash & 4k drives. Hopefully this gets fixed for the 2.3 branch, or we get a NanoBSD installer so we can fix it manually.

    Want to test? Run a pfSense NanoBSD VM on a system with a RAID controller that has a descent size cache in write-back mode. Now try it with the controller cache disabled. Now try it with the controller cache disabled and running on a true 4k sector drive. Spoiler: good, bad, worse.


  • Banned


Log in to reply