[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxg+d2WoPEL2mC5H3d0uxh-_HGw3Bhyrun=z4O2nCg-yNQ@mail.gmail.com>
Date: Fri, 19 Mar 2021 07:51:54 +0200
From: Amir Goldstein <amir73il@...il.com>
To: harshad shirwadkar <harshadshirwadkar@...il.com>
Cc: Ext4 <linux-ext4@...r.kernel.org>, Theodore Tso <tytso@....edu>,
Miklos Szeredi <miklos@...redi.hu>,
overlayfs <linux-unionfs@...r.kernel.org>
Subject: Re: [PATCH] ext4: add rename whiteout support for fast commit
[adding overlayfs list]
On Fri, Mar 19, 2021 at 3:32 AM harshad shirwadkar
<harshadshirwadkar@...il.com> wrote:
>
> Thanks for the review Amir.
>
> Sure changing the subject makes sense.
>
> Also, on further discussions on Ext4 conference call, we also thought
> that with this patch, overlayfs customers would not benefit from fast
> commits much if they call renames often. So, in order to really make
> rename whiteout a fast commit compatible operation, we probably would
> need to add support in fast commit to replay a char device creation
> event (since whiteout object is a char device). That would imply, we
> would need to do careful versioning and would need to burn an on-disk
> feature flag.
>
> An alternative to this would be to have a static whiteout object with
> irrelevant nlink count and to have every rename point to that object
> instead. Based on how we decide to implement that, at max only the
> first rename operation would be fast commit incompatible since that's
> when this object would get created. All the further operations would
> be fast commit compatible. The big benefit of this approach is that
> this way we don't have to add support for char device creation in fast
> commit replay code and thus we don't have to worry about versioning.
>
I'm glad to hear that, Harshad.
Please note that creating a static whiteout object on-disk is one possible
implementation option. Not creating any object on-disk may be even better.
I had suggested the static object approach because it should be pretty
simple to implement and add e2fsprogs support for.
However, if we look at the requirements for RENAME_WHITEOUT,
the resulting directory entry does not actually need to point to any
object on-disk at all.
An alternative implementation would be to create a directory entry
with {d_ino = 0, d_type = DT_WHT}. Lookup of this entry will return
a reference to a singleton read-only ext4 whiteout inode object, which
does not reside on disk, so fast commit is irrelevant in that sense.
i_nlink should be handled carefully, but that should be easier from
doing that for a static on-disk object.
I am not sure how userland tools handle DT_WHT, but I see that
other filesystems can emit this value in theory and man rename(2)
claims that BSD uses DT_WHT, so the common tools should be
able to handle it.
As far as overlayfs is concerned:
1. ovl_lookup() will find an IS_WHITEOUT() inode as usual
2. ovl_dir_read_merged() will need this small patch (below) and will
not access the inode object at all
3. At first, ovl_whiteout() -> vfs_whiteout() can still create a usual chardev
4. Later, we can initiate the overlayfs instance singleton whiteout
reference in ovl_check_rename_whiteout() and ovl_whiteout() will
never get -EMLINK when linking this whiteout object
One other challenge is how to handle users trying to make operations
on the upper layer directly (migrating images etc).
As long as the tools still observe the whiteout as a chadev (with stat(2))
then export and import should work fine (creating a real chardev on import).
If there are tools that try to change inode permissions recursively on the
upper layer (?) there may be a problem with those read-only whiteouts
although the permission of a whiteout is a moot concept.
Thanks,
Amir.
--- a/fs/overlayfs/readdir.c
+++ b/fs/overlayfs/readdir.c
@@ -161,7 +161,7 @@ static struct ovl_cache_entry
*ovl_cache_entry_new(struct ovl_readdir_data *rdd,
if (ovl_calc_d_ino(rdd, p))
p->ino = 0;
p->is_upper = rdd->is_upper;
- p->is_whiteout = false;
+ p->is_whiteout = (d_type == DT_WHT);
if (d_type == DT_CHR) {
p->next_maybe_whiteout = rdd->first_maybe_whiteout;
Powered by blists - more mailing lists