[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YbuGLsQy6TSM2xOl@mit.edu>
Date: Thu, 16 Dec 2021 13:32:14 -0500
From: "Theodore Ts'o" <tytso@....edu>
To: Lukas Czerner <lczerner@...hat.com>
Cc: Jan Kara <jack@...e.cz>, "Darrick J. Wong" <djwong@...nel.org>,
Luís Henriques <lhenriques@...e.de>,
Andreas Dilger <adilger.kernel@...ger.ca>,
linux-ext4@...r.kernel.org, linux-kernel@...r.kernel.org,
Jeroen van Wolffelaar <jeroen@...ffelaar.nl>
Subject: Re: [PATCH v2] ext4: set csum seed in tmp inode while migrating to
extents
On Wed, Dec 15, 2021 at 03:12:37PM +0100, Lukas Czerner wrote:
> > Run fsck of course! And then recover from backups :) I know this is sad but
> > the situation is that our migration code just is not crash-safe (if we
> > crash we are going to free blocks that are still used by the migrated
> > inode) and Luis makes it work in case we do not crash (which should be
> > hopefully more common) and documents it does not work in case we crash.
> > So overall I'd call it a win.
> >
> > But maybe we should just remove this online-migration functionality
> > completely from the kernel? That would be also a fine solution for me. I
> > was thinking whether we could somehow make the inode migration crash-safe
> > but I didn't think of anything which would not require on-disk format
> > change...
>
> Since this is not something that anyone can honestly recommend doing
> without a prior backup and a word of warning I personaly would be in favor
> of removing it.
So there are a couple options that we could pursue:
1) We could change the migrate code to stop putting the orphan inode
on the orphan list. If we do this, an crash in the middle of the
migrate will, in the worst case (when the migration isn't completed
within a single jbd2 transaction) result in a leaked inode. That's
not ideal, but it won't lead user data loss, and e2fsck will recover
the situation by cloning the blocks, and leaving the inode in
lost+found.
2) We could try to ensure migration happens all within a single
transaction, if they all fit inside a the inode structure, we allocate
a tmp inode for all of the indirect blocks, attach the blocks to the
tmp inode, place the tmp inode on the orphan list, and put all of that
on a single handle, and then in a second handle, truncate the tmp
inode to release the indirect blocks. If we need to allocate extent
tree blocks, then all of that would need to fit in a single
transaction, and it's a bit more complicated, but it is doable.
3) We can simply remove the inode migration feature by removing
EXT4_EXTENTS_FL from EXT4_FL_USER_MODIFIABLE, and changing the
implementation of the EXT4_IOC_MIGRATE ioctl to return EOPNOTSUPP, and
then cleaning up the code paths that are now unreachable.
The migration feature is clearly less compelling than it was ten years
ago, when ext4 was first introduced --- and most enterprise distros
have never supported the feature even when it has existed. Also on
the plus side, we've never shipped a program to globally migrate a
file system by using ioctl interface.
On the other hand, there may have been user shell scripts that have
done something like "find /mntpt -type f -print0 | xargs -0 chattr +e {} \;"
And so option #3 could be construed as "breaking userspace", especially
without a deprecation window.
Furthermore, Option #1 is pretty simple to implement, and chances of a
migration getting spread across two jbd2 commits is not actually
pretty low. And if it does happen, there would only be a single inode
that would get its blocks cloned and attached to lost+found.
Thats being said, if we *did* option #1, in the long run we'd want to
land a complete solution, which would either be something like option
#2, or allocating a flag to give a hint to e2fsprogs that if it does
find an leaked inode with with the flag set on the on-disk inode, that
all it needs to do is to zero out the inode and be done with it.
So the question is, is it worth it to continue supporting the migrate
feature, or should we just delete all of the migration code, and risk
users complaining that we've broken their use case? The chances of
that happening is admittedly low, and Linus's rule that "it's only
breaking userspace if a user complains" means we might very well get
away with it. :-)
- Ted
Powered by blists - more mailing lists