lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 22 Dec 2020 15:53:39 -0700
From:   Andreas Dilger <>
To:     Matteo Croce <>
Cc:     Ext4 <>, Wang Shilong <>,
        "Theodore Y. Ts'o" <tytso@....EDU>
Subject: Re: discard and data=writeback

On Dec 22, 2020, at 9:34 AM, Theodore Y. Ts'o <tytso@....EDU> wrote:
> On Tue, Dec 22, 2020 at 03:59:29PM +0100, Matteo Croce wrote:
>> I'm issuing sync + sleep(10) after the extraction, so the writes
>> should all be flushed.
>> Also, I repeated the test three times, with very similar results:
> So that means the problem is not due to page cache writeback
> interfering with the discards.  So it's most likely that the problem
> is due to how the blocks are allocated and laid out when using
> data=ordered vs data=writeback.
> Some experiments to try next.  After extracting the files with
> data=ordered and data=writeback on a freshly formatted file system,
> use "e2freefrag" to see how the free space is fragmented.  This will
> tell us how the file system is doing from a holistic perspective, in
> terms of blocks allocated to the extracted files.  (E2freefrag is
> showing you the blocks *not* allocated, of course, but that's a mirror
> image dual of the blocks that *are* allocated, especially if you start
> from an identical known state; hence the use of a freshly formatted
> file system.)
> Next, we can see how individual files look like with respect to
> fragmentation.  This can be done via using filefrag on all of the
> files, e.g:
>       find . -type f -print0  | xargs -0 filefrag
> Another way to get similar (although not identical) information is via
> running "e2fsck -E fragcheck" on a file system.  How they differ is
> especially more of a big deal on ext3 file systems without extents and
> flex_bg, since filefrag tries to take into account metadata blocks
> such as indirect blocks and extent tree blocks, and e2fsck -E
> fragcheck does not; but it's good enough for getting a good gestalt
> for the files' overall fragmentation --- and note that as long as the
> average fragment size is at least a megabyte or two, some
> fragmentation really isn't that much of a problem from a real-world
> performance perspective.  People can get way too invested in trying to
> get to perfection with 100% fragmentation-free files.  The problem
> with doing this at the expense of all else is that you can end up
> making the overall free space fragmentation worse as the file system
> ages, at which point the file system performance really dives through
> the floor as the file system approaches 100%, or even 80-90% full,
> especially on HDD's.  For SSD's fragmentation doesn't matter quite so
> much, unless the average fragment size is *really* small, and when you
> are discarded freed blocks.
> Even if the files are showing no substantial difference in
> fragmentation, and the free space is equally A-OK with respect to
> fragmentation, the other possibility is the *layout* of the blocks are
> such that the order in which they are deleted using rm -rf ends up
> being less friendly from a discard perspective.  This can happen if
> the directory hierarchy is big enough, and/or the journal size is
> small enough, that the rm -rf requires multiple journal transactions
> to complete.  That's because with mount -o discard, we do the discards
> after each transaction commit, and it might be that even though the
> used blocks are perfectly contiguous, because of the order in which
> the files end up getting deleted, we end up needing to discard them in
> smaller chunks.
> For example, one could imagine a case where you have a million 4k
> files, and they are allocated contiguously, but if you get
> super-unlucky, such that in the first transaction you delete all of
> the odd-numbered files, and in second transaction you delete all of
> the even-numbered files, you might need to do a million 4k discards
> --- but if all of the deletes could fit into a single transaction, you
> would only need to do a single million block discard operation.
> Finally, you may want to consider whether or not mount -o discard
> really makes sense or not.  For most SSD's, especially high-end SSD's,
> it probably doesn't make that much difference.  That's because when
> you overwrite a sector, the SSD knows (or should know; this might not
> be some really cheap, crappy low-end flash devices; but on those
> devices, discard might not be making uch of a difference anyway), that
> the old contents of the sector is no longer needed.  Hence an
> overwrite effectively is an "implied discard".  So long as there is a
> sufficient number of free erase blocks, the SSD might be able to keep
> up doing the GC for those "implied discards", and so accelerating the
> process by sending explicit discards after every journal transaction
> might not be necessary.  Or, maybe it's sufficient to run "fstrim"
> every week at Sunday 3am local time; or maybe even fstrim once a night
> or fstrim once a month --- your mileage may vary.
> It's going to vary from SSD to SSD and from workload to workload, but
> you might find that mount -o discard isn't buying you all that much
> --- if you run a random write workload, and you don't notice any
> performance degradation, and you don't notice an increase in the SSD's
> write amplification numbers (if they are provided by your SSD), then
> you might very well find that it's not worth it to use mount -o
> discard.
> I personally don't bother using mount -o discard, and instead
> periodically run fstrim, on my personal machines.  Part of that is
> because I'm mostly just reading and replying to emails, building
> kernels and editing text files, and that is not nearly as stressful on
> the FTL as a full-blown random write workload (for example, if you
> were running a database supporting a transaction processing workload).

The problem (IMHO) with "-o discard" is that if it is only trimming
*blocks* that were deleted, these may be too small to effectively be
processed by the underlying device (e.g. the "super-unlucky" example
above where interleaved 4KB file deletes result in 1M separate 4KB
trim requests to the device, even when the *space* that is freed by
the unlinks could be handled with far fewer large trim requests.

There was a discussion previously ("introduce EXT4_BG_WAS_TRIMMED ...")

about leveraging the persistent EXT4_BG_WAS_TRIMMED flag in the group
descriptors, and having "-o discard" only track trim on a per-group
basis rather than its current mode of doing trim on a per-block basis,
and then use the same code internally as fstrim to do a trim of free
blocks in that block group.

Using EXT4_BG_WAS_TRIMMED and tracking *groups* to be trimmed would be
a bit more lazy than the current "-o discard" implementation, but would
be more memory efficient, and also more efficient for the device (fewer,
larger trim requests submitted).  It would only need to track groups
that have at least a reasonable amount of free space to be trimmed.  If
the group doesn't have enough free blocks to trim now, it will be checked
again in the future when more blocks are freed.

Cheers, Andreas

Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)

Powered by blists - more mailing lists