[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANT5p=qi8-9iZa0XE70ZaCUdqscKufovjcUAZZPDRmN9W5_uQA@mail.gmail.com>
Date: Tue, 13 Jul 2021 18:27:37 +0530
From: Shyam Prasad N <nspmangalore@...il.com>
To: "Theodore Y. Ts'o" <tytso@....edu>
Cc: David Howells <dhowells@...hat.com>,
Steve French <smfrench@...il.com>, linux-ext4@...r.kernel.org
Subject: Re: Regarding ext4 extent allocation strategy
On Tue, Jul 13, 2021 at 5:09 PM Theodore Y. Ts'o <tytso@....edu> wrote:
>
> On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote:
> >
> > Our team in Microsoft, which works on the Linux SMB3 client kernel
> > filesystem has recently been exploring the use of fscache on top of
> > ext4 for caching the network filesystem data for some customer
> > workloads.
> >
> > However, the maintainer of fscache (David Howells) recently warned us
> > that a few other extent based filesystem developers pointed out a
> > theoretical bug in the current implementation of fscache/cachefiles.
> > It currently does not maintain a separate metadata for the cached data
> > it holds, but instead uses the sparseness of the underlying filesystem
> > to track the ranges of the data that is being cached.
> > The bug that has been pointed out with this is that the underlying
> > filesystems could bridge holes between data ranges with zeroes or
> > punch hole in data ranges that contain zeroes. (@David please add if I
> > missed something).
> >
> > David has already begun working on the fix to this by maintaining the
> > metadata of the cached ranges in fscache itself.
> > However, since it could take some time for this fix to be approved and
> > then backported by various distros, I'd like to understand if there is
> > a potential problem in using fscache on top of ext4 without the fix.
> > If ext4 doesn't do any such optimizations on the data ranges, or has a
> > way to disable such optimizations, I think we'll be okay to use the
> > older versions of fscache even without the fix mentioned above.
>
> Yes, the tuning knob you are looking for is:
>
> What: /sys/fs/ext4/<disk>/extent_max_zeroout_kb
> Date: August 2012
> Contact: "Theodore Ts'o" <tytso@....edu>
> Description:
> The maximum number of kilobytes which will be zeroed
> out in preference to creating a new uninitialized
> extent when manipulating an inode's extent tree. Note
> that using a larger value will increase the
> variability of time necessary to complete a random
> write operation (since a 4k random write might turn
> into a much larger write due to the zeroout
> operation).
>
> (From Documentation/ABI/testing/sysfs-fs-ext4)
>
> The basic idea here is that with a random workload, with HDD's, the
> cost of writing a 16k random write is not much more than the time to
> write a 4k random write; that is, the cost of HDD seeks dominates.
> There is also a cost in having a many additional entries in the extent
> tree. So if we have a fallocated region, e.g:
>
> +-------------+---+---+---+----------+---+---+---------+
> ... + Uninit (U) | W | U | W | Uninit | W | U | Written | ...
> +-------------+---+---+---+----------+---+---+---------+
>
> It's more efficient to have the extent tree look like this
>
> +-------------+-----------+----------+---+---+---------+
> ... + Uninit (U) | Written | Uninit | W | U | Written | ...
> +-------------+-----------+----------+---+---+---------+
>
> And just simply write zeros to the first "U" in the above figure.
>
> The default value of extent_max_zeroout_kb is 32k. This optimization
> can be disabled by setting extent_max_zeroout_kb to 0. The downside
> of this is a potential degredation of a random write workload (using
> for example the fio benchmark program) on that file system.
>
> Cheers,
>
> - Ted
Hi Ted,
Thanks for pointing this out. We'll look into the use of this option.
Also, is this parameter also respected when a hole is punched in the
middle of an allocated data extent? i.e. is there still a possibility
that a punched hole does not translate to splitting the data extent,
even when extent_max_zeroout_kb is set to 0?
--
Regards,
Shyam
Powered by blists - more mailing lists