linux-ext4 - Re: discard and data=writeback

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <X+If/kAwiaMdaBtF@mit.edu>
Date:   Tue, 22 Dec 2020 11:34:06 -0500
From:   "Theodore Y. Ts'o" <tytso@....edu>
To:     Matteo Croce <mcroce@...ux.microsoft.com>
Cc:     linux-ext4@...r.kernel.org
Subject: Re: discard and data=writeback

On Tue, Dec 22, 2020 at 03:59:29PM +0100, Matteo Croce wrote:
> 
> I'm issuing sync + sleep(10) after the extraction, so the writes
> should all be flushed.
> Also, I repeated the test three times, with very similar results:

So that means the problem is not due to page cache writeback
interfering with the discards.  So it's most likely that the problem
is due to how the blocks are allocated and laid out when using
data=ordered vs data=writeback.

Some experiments to try next.  After extracting the files with
data=ordered and data=writeback on a freshly formatted file system,
use "e2freefrag" to see how the free space is fragmented.  This will
tell us how the file system is doing from a holistic perspective, in
terms of blocks allocated to the extracted files.  (E2freefrag is
showing you the blocks *not* allocated, of course, but that's a mirror
image dual of the blocks that *are* allocated, especially if you start
from an identical known state; hence the use of a freshly formatted
file system.)

Next, we can see how individual files look like with respect to
fragmentation.  This can be done via using filefrag on all of the
files, e.g:

       find . -type f -print0  | xargs -0 filefrag

Another way to get similar (although not identical) information is via
running "e2fsck -E fragcheck" on a file system.  How they differ is
especially more of a big deal on ext3 file systems without extents and
flex_bg, since filefrag tries to take into account metadata blocks
such as indirect blocks and extent tree blocks, and e2fsck -E
fragcheck does not; but it's good enough for getting a good gestalt
for the files' overall fragmentation --- and note that as long as the
average fragment size is at least a megabyte or two, some
fragmentation really isn't that much of a problem from a real-world
performance perspective.  People can get way too invested in trying to
get to perfection with 100% fragmentation-free files.  The problem
with doing this at the expense of all else is that you can end up
making the overall free space fragmentation worse as the file system
ages, at which point the file system performance really dives through
the floor as the file system approaches 100%, or even 80-90% full,
especially on HDD's.  For SSD's fragmentation doesn't matter quite so
much, unless the average fragment size is *really* small, and when you
are discarded freed blocks.

Even if the files are showing no substantial difference in
fragmentation, and the free space is equally A-OK with respect to
fragmentation, the other possibility is the *layout* of the blocks are
such that the order in which they are deleted using rm -rf ends up
being less friendly from a discard perspective.  This can happen if
the directory hierarchy is big enough, and/or the journal size is
small enough, that the rm -rf requires multiple journal transactions
to complete.  That's because with mount -o discard, we do the discards
after each transaction commit, and it might be that even though the
used blocks are perfectly contiguous, because of the order in which
the files end up getting deleted, we end up needing to discard them in
smaller chunks.

For example, one could imagine a case where you have a million 4k
files, and they are allocated contiguously, but if you get
super-unlucky, such that in the first transaction you delete all of
the odd-numbered files, and in second transaction you delete all of
the even-numbered files, you might need to do a million 4k discards
--- but if all of the deletes could fit into a single transaction, you
would only need to do a single million block discard operation.

Finally, you may want to consider whether or not mount -o discard
really makes sense or not.  For most SSD's, especially high-end SSD's,
it probably doesn't make that much difference.  That's because when
you overwrite a sector, the SSD knows (or should know; this might not
be some really cheap, crappy low-end flash devices; but on those
devices, discard might not be making uch of a difference anyway), that
the old contents of the sector is no longer needed.  Hence an
overwrite effectively is an "implied discard".  So long as there is a
sufficient number of free erase blocks, the SSD might be able to keep
up doing the GC for those "implied discards", and so accelerating the
process by sending explicit discards after every journal transaction
might not be necessary.  Or, maybe it's sufficient to run "fstrim"
every week at Sunday 3am local time; or maybe even fstrim once a night
or fstrim once a month --- your mileage may vary.

It's going to vary from SSD to SSD and from workload to workload, but
you might find that mount -o discard isn't buying you all that much
--- if you run a random write workload, and you don't notice any
performance degradation, and you don't notice an increase in the SSD's
write amplification numbers (if they are provided by your SSD), then
you might very well find that it's not worth it to use mount -o
discard.

I personally don't bother using mount -o discard, and instead
periodically run fstrim, on my personal machines.  Part of that is
because I'm mostly just reading and replying to emails, building
kernels and editing text files, and that is not nearly as stressful on
the FTL as a full-blown random write workload (for example, if you
were running a database supporting a transaction processing workload).

Cheers,

						- Ted