linux-ext4 - Re: resizefs shrinking is extremely slow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180813162048.GB3767@thunk.org>
Date:   Mon, 13 Aug 2018 12:20:48 -0400
From:   "Theodore Y. Ts'o" <tytso@....edu>
To:     Jaco Kroon <jaco@....co.za>
Cc:     linux-ext4 <linux-ext4@...r.kernel.org>
Subject: Re: resizefs shrinking is extremely slow

On Mon, Aug 13, 2018 at 12:52:18PM +0200, Jaco Kroon wrote:
> 
> 1.  Implement a mechanism (ioct) to avoid allocation over a certain
> block + inode.  Based on what I've seen I doubt this would not be overly
> difficult.  Can be "temporary in memory" restriction, or be persisted to
> the superblock.  I'm inclined to opt for the former as this won't
> require disk-layout changes, but please advise (disk layout changes will
> be required for step 2 anyway).

Yes, this is easy, and doing it as a temporary in memory restriction.
The simplest way the way to do this is on a per-block group basis.  We
already have a flag which does this when a block or inode allocation
bitmap is corrupt, to avoid further damage to the file system.  We'll
want a separate set of flags because this is something that userspace
needs to be able to set and unset.

The tricky policy bit is what to do if there is an allocation failure,
and should we advertise this via the statfs(2) system call or df(1)
command.  That is, should we just return ENOSPC, or should we make it
a "soft avoidance".  My preference is that it should be a soft
advoidance, so that if the defrag process prevents block groups to be
from being used, and it crashes leaving those flag set, we don't just
have weird failures.  Or should we have something automatic such that
if a process exits without clearing the flags, the flags are cleared
automatically?

I would then suggest that you implement a much more efficient block
movement in fs/ext4/move_extent.c first.  Thats's because it doesn't
require a format change, and also because fixing this will have the
most amount of benefit; if the block movement can be done on-line,
resize2fs can be run off-line to move the inodes, and that will take
much less time.

> 2.  Implement "forwarding address" mechanism for inodes (ioctl).  This
> would allow re-allocation of inodes.  This will possibly involve (from
> what I know, possibly more):
> 2.0  documentation changes to document changes on on-disk, including
> updates to userspace tools to accomodate changes on-disk format.
> 2.1  on open of inode, if inode is forwarding, open forwarded to inode
> instead.
> 2.2  readdir() - not sure if this does a kind of stat on referenced
> inodes by default (readdir(2) implies not, readdir(3) implies some
> filesystems do, including ext4 - d_type field), update references to
> forwarded inodes to reference the forwarded to inode instead (if mounted
> rw).
> 2.3  extra ioctl to clone inode to newly allocated inode, and replace
> forward pointer.

No, readdir doesn't stat the inode.  It will only return the d_type
field if it is present in the directory.

> 3.  Utilize defrag (variant) code to re-allocate extents (will require
> full-system scan of all extent trees to find which extents needs to be
> re-allocated) and inodes (this will likely need to be a single logical
> change, with a separate path to move code from e4defrag that can be
> shared, care to be taken if a mount-point is masking files inside a
> directory that represents a mountpoint, or possibly just warn that step
> 3.6 may fail):
> 3.1.  Use (1) to mark that we want to start a filesystem reduce.
> 3.2.  Use ioctl() in 2.3 to re-allocate (forward) all inodes that will
> no longer be available.
> 3.4.  Open the root (/) inode, and initiate a readdir() scan of the
> whole filesystem, serving two purposes:
> 3.4.1.  It will trigger the code in 2.1 and 2.2 to update forwarding
> references.
> 3.4.2.  It will allow us to find files utilizing blocks that needs to be
> re-allocated (get_file_extents as per current e4defrag code), and to
> re-allocate those.
> 3.5.  Scan the extents for bad blocks and free any blocks that will no
> longer be under consideration.
> 3.6.  Scan any other inodes specifically that may have blocks/inodes in
> the upper range.
 
Yes, roughly speaking.  As I mentioned in another message which I
cc'ed you on.  One caution here is that e4defrag is really bad shape
in terms of code maintability.  So cleaning this up before you start
making changes would be a good thing.  Note that once you have (1)
implemented, e4defrag will pretty much do what you want in terms of
evacuating data blocks for regular out of the file system.  The only
change you'll need to make is the criteria for operating on the file
--- instead of it being whether or not the file is fragmented, it
would be if it thas blocks that needs to be evacuated.

Also note that is not a complete solution for moving all blocks, since
it doesn't take into account

   * directory blocks
   * extended attribute blocks
   * symlink blocks (for non-fast symlinks)
   * extent tree blocks

However, it will solve the vast majority (> 95%) of the blocks, so
again, if you can move the data blocks on-line, running resize2fs
off-line to complete the shrink will still significantly speed up the
downtime of the file system.


> 1.  How do I go about writing (a) sensible test case(s) for this?

Use xfstests.  For a quick introduction to xfstests, and the test
framework, this slide deck might be helpful:

	https://thunk.org/gce-xfstests

> 2.  Based on the above process I expect it may actually be harder to
> optimize the user-space resize than going online, can anyone concur with
> my assessment?  My assessment in part is based on a very rudementary
> gleaning of the e4defrag code.

Most of the heavy lifting is actually done in the kernel code, so in
terms of "harder" it probably is still is the kernel, however, it was
written to leverage as much existing code as possible, which is why
fs/ext4/move_extent.c looks "simple".  To make it be lot faster so
that I/O is not done block by block, a lot of new code will need to be
written.  Also, as mentioned above, the e4defrag code needs a lot of
code cleanup.

The userspace resize code is more complex because it have as much
other kernel file system code to rely upon.  One of the things which
makes life hard is trying to minimize or eliminate data loss if the
resize operation is aborted.  We can rely on the journal in the
kernel.  In resize2fs, we've been using very careful ordering of
operations to try to minimize the risk of data loss on a crash.

> 5.  I only count threee extra ioctls that need to be added kernel
> space.  The rest of the kernel changes affects existing code paths I
> believe.  Is a separate module still worth it, and if it is, how would I
> approach that?

Given the design we're playing around with, a separate module probably
doesn't make sense.  That's because we're implementing the smallest
bits in the kernel, with all of the more complex policy and "Command
and control" in userspace.

> 6.  May I simply allocate the next available bit out of the feature set
> for this or is there some central database where this needs to go into
> (ie, step 2.0)?

We check in a bit assignment into the kernel and e2fsprogs, and that's
how we reserve the bit.  There's a lot we can do without tackling the
the on-disk format changes, and we should get that done before we
worry about reserving the feature bit --- but that part isn't hard.

Finally, implementing a new file system feature can take a *lot* of
time.  As in, it's likely going to take the better part of the year.u
Doing the simple parts that also will improve e4defrag makes sense,
and it may be that if the goal is optimizing your time, a partial
solution which improves the eventual off-line shrink may be the best
approach.  At the very least, we can do that and then you can see if
you really want to follow through with a complete on-line shrink
implementation.

Cheers,

						- Ted