linux-ext4 - Re: Regarding ext4 extent allocation strategy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yg8P9uvExJ2B/J3T@B-P7TQMD6M-0146.local>
Date:   Fri, 18 Feb 2022 11:18:14 +0800
From:   Gao Xiang <hsiangkao@...ux.alibaba.com>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Shyam Prasad N <nspmangalore@...il.com>,
        David Howells <dhowells@...hat.com>,
        Steve French <smfrench@...il.com>, linux-ext4@...r.kernel.org,
        Jeffle Xu <jefflexu@...ux.alibaba.com>,
        bo.liu@...ux.alibaba.com, tao.peng@...ux.alibaba.com
Subject: Re: Regarding ext4 extent allocation strategy

Hi Ted and David,

On Tue, Jul 13, 2021 at 07:39:16AM -0400, Theodore Y. Ts'o wrote:
> On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> > 
> > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > filesystem has recently been exploring the use of fscache on top of
> > ext4 for caching the network filesystem data for some customer
> > workloads.
> > 
> > However, the maintainer of fscache (David Howells) recently warned us
> > that a few other extent based filesystem developers pointed out a
> > theoretical bug in the current implementation of fscache/cachefiles.
> > It currently does not maintain a separate metadata for the cached data
> > it holds, but instead uses the sparseness of the underlying filesystem
> > to track the ranges of the data that is being cached.
> > The bug that has been pointed out with this is that the underlying
> > filesystems could bridge holes between data ranges with zeroes or
> > punch hole in data ranges that contain zeroes. (@David please add if I
> > missed something).
> > 
> > David has already begun working on the fix to this by maintaining the
> > metadata of the cached ranges in fscache itself.
> > However, since it could take some time for this fix to be approved and
> > then backported by various distros, I'd like to understand if there is
> > a potential problem in using fscache on top of ext4 without the fix.
> > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > way to disable such optimizations, I think we'll be okay to use the
> > older versions of fscache even without the fix mentioned above.
> 
> Yes, the tuning knob you are looking for is:
> 
> What:		/sys/fs/ext4/<disk>/extent_max_zeroout_kb
> Date:		August 2012
> Contact:	"Theodore Ts'o" <tytso@....edu>
> Description:
> 		The maximum number of kilobytes which will be zeroed
> 		out in preference to creating a new uninitialized
> 		extent when manipulating an inode's extent tree.  Note
> 		that using a larger value will increase the
> 		variability of time necessary to complete a random
> 		write operation (since a 4k random write might turn
> 		into a much larger write due to the zeroout
> 		operation).
> 
> (From Documentation/ABI/testing/sysfs-fs-ext4)
> 
> The basic idea here is that with a random workload, with HDD's, the
> cost of writing a 16k random write is not much more than the time to
> write a 4k random write; that is, the cost of HDD seeks dominates.
> There is also a cost in having a many additional entries in the extent
> tree.  So if we have a fallocated region, e.g:
> 
>     +-------------+---+---+---+----------+---+---+---------+
> ... + Uninit (U)  | W | U | W |   Uninit | W | U | Written | ...
>     +-------------+---+---+---+----------+---+---+---------+
> 
> It's more efficient to have the extent tree look like this
> 
>     +-------------+-----------+----------+---+---+---------+
> ... + Uninit (U)  |  Written  |   Uninit | W | U | Written | ...
>     +-------------+-----------+----------+---+---+---------+
> 
> And just simply write zeros to the first "U" in the above figure.
> 
> The default value of extent_max_zeroout_kb is 32k.  This optimization
> can be disabled by setting extent_max_zeroout_kb to 0.  The downside
> of this is a potential degredation of a random write workload (using
> for example the fio benchmark program) on that file system.
> 

As far as I understand what cachefile does, it just truncates a sparse
file with a big hole, and do direct IO _only_ all the time to fill the
holes.

But the description above is all around (un)written extents, which
already have physical blocks allocated, but just without data
initialization. So we could zero out the middle extent and merge
these extents into one bigger written extent.

However, IMO, it's not the case of what the current cachefiles
behavior is... I think rare local fs allocates blocks with direct
i/o due to real holes, zero out and merge extents since at least it
touches disk quota.

David pointed this message yesterday since we're doing on-demand read
feature by using cachefiles as well. But I still fail to understand why
the current cachefile behavior is wrong.

Could you kindly leave more hints about this? Many thanks!

Thanks,
Gao Xiang

> Cheers,
> 
> 						- Ted