linux-ext4 - Re: resizefs shrinking is extremely slow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <40D96E04-FFAB-41C5-AD0C-CF15FA0078B0@dilger.ca>
Date:   Mon, 13 Aug 2018 12:30:50 -0600
From:   Andreas Dilger <adilger@...ger.ca>
To:     "Theodore Y. Ts'o" <tytso@....edu>
Cc:     Jaco Kroon <jaco@....co.za>,
        linux-ext4 <linux-ext4@...r.kernel.org>
Subject: Re: resizefs shrinking is extremely slow

On Aug 13, 2018, at 10:20 AM, Theodore Y. Ts'o <tytso@....edu> wrote:
> 
> On Mon, Aug 13, 2018 at 12:52:18PM +0200, Jaco Kroon wrote:
>> 
>> 1.  Implement a mechanism (ioct) to avoid allocation over a certain
>> block + inode.  Based on what I've seen I doubt this would not be overly
>> difficult.  Can be "temporary in memory" restriction, or be persisted to
>> the superblock.  I'm inclined to opt for the former as this won't
>> require disk-layout changes, but please advise (disk layout changes will
>> be required for step 2 anyway).
> 
> Yes, this is easy, and doing it as a temporary in memory restriction.
> The simplest way the way to do this is on a per-block group basis.  We
> already have a flag which does this when a block or inode allocation
> bitmap is corrupt, to avoid further damage to the file system.  We'll
> want a separate set of flags because this is something that userspace
> needs to be able to set and unset.

If the goal is shrinking the filesystem, then just having a high watermark
that limits allocations beyond a specific block number would be enough.
The high watermark could be set via /sys/fs/ext4/<devno>/block_high_watermark
or via tune2fs to store a 64-bit value into the superblock to make it
persistent.

We could potentially also make the kernel avoid allocating inodes located
in blocks beyond the high watermark, or have a separate inode_high_watermark
value to limit inode allocation, though this could have a separate issue
of coordination between the block and inode limits?

Just having a block_high_watermark makes the implementation rather easy.
It is checked in block allocation paths[*] instead of ext4_blocks_count()
(I don't think it would be correct to always return block_high_watermark
from ext4_blocks_count()), and in ext4_statfs() to reduce the number of
free/available blocks reported by the filesystem.

[*] at first glance - ext4_inode_to_goal_block(), ext4_set_resv_clusters(),
    ext4_mb_initialize_context(), and reserved_clusters_store()

> The tricky policy bit is what to do if there is an allocation failure,
> and should we advertise this via the statfs(2) system call or df(1)
> command.  That is, should we just return ENOSPC, or should we make it
> a "soft avoidance".  My preference is that it should be a soft
> advoidance, so that if the defrag process prevents block groups to be
> from being used, and it crashes leaving those flag set, we don't just
> have weird failures.  Or should we have something automatic such that
> if a process exits without clearing the flags, the flags are cleared
> automatically?

If the goal is to shrink the filesystem, it isn't clear why we'd want
to allow making this a soft failure?  If this is a desirable feature,
then it should definitely be tunable so that a "hard avoidance" mode
is available to prevent new files from moving into the restricted area.

> I would then suggest that you implement a much more efficient block
> movement in fs/ext4/move_extent.c first.  Thats's because it doesn't
> require a format change, and also because fixing this will have the
> most amount of benefit; if the block movement can be done on-line,
> resize2fs can be run off-line to move the inodes, and that will take
> much less time.
> 

>> 3.  Utilize defrag (variant) code to re-allocate extents (will require
>> full-system scan of all extent trees to find which extents needs to be
>> re-allocated) and inodes (this will likely need to be a single logical
>> change, with a separate path to move code from e4defrag that can be
>> shared, care to be taken if a mount-point is masking files inside a
>> directory that represents a mountpoint, or possibly just warn that step
>> 3.6 may fail):
>> 3.1.  Use (1) to mark that we want to start a filesystem reduce.
>> 3.2.  Use ioctl() in 2.3 to re-allocate (forward) all inodes that will
>> no longer be available.
>> 3.4.  Open the root (/) inode, and initiate a readdir() scan of the
>> whole filesystem, serving two purposes:
>> 3.4.1.  It will trigger the code in 2.1 and 2.2 to update forwarding
>> references.
>> 3.4.2.  It will allow us to find files utilizing blocks that needs to be
>> re-allocated (get_file_extents as per current e4defrag code), and to
>> re-allocate those.
>> 3.5.  Scan the extents for bad blocks and free any blocks that will no
>> longer be under consideration.
>> 3.6.  Scan any other inodes specifically that may have blocks/inodes in
>> the upper range.
> 
> Yes, roughly speaking.  As I mentioned in another message which I
> cc'ed you on.  One caution here is that e4defrag is really bad shape
> in terms of code maintability.  So cleaning this up before you start
> making changes would be a good thing.  Note that once you have (1)
> implemented, e4defrag will pretty much do what you want in terms of
> evacuating data blocks for regular out of the file system.  The only
> change you'll need to make is the criteria for operating on the file
> --- instead of it being whether or not the file is fragmented, it
> would be if it thas blocks that needs to be evacuated.
> 
> Also note that is not a complete solution for moving all blocks, since
> it doesn't take into account
> 
>   * directory blocks
>   * extended attribute blocks
>   * symlink blocks (for non-fast symlinks)
>   * extent tree blocks
> 
> However, it will solve the vast majority (> 95%) of the blocks, so
> again, if you can move the data blocks on-line, running resize2fs
> off-line to complete the shrink will still significantly speed up the
> downtime of the file system.

Depending on how patient you are and the churn rate of the filesystem,
you could potentially just set the high watermark(s) and leave the
filesystem alone for days/weeks as in-use inodes/blocks migrate out of
the restricted space.  We could consider something similar for directory
blocks - opportunistically dropping blocks at the end of a directory
(or potentially in the middle if EXT4_FEATURE_INCOPMAT_LARGEDIR is set)
when the last direntry in the block is deleted so that we don't need to
migrate the unused directory blocks.

Optionally, you could use cp/mv/rsync to speed the movement of files
and directories within the restricted area, and the kernel limits on
allocation ensure that inodes/blocks will only move out of the restricted
area of the filesystem and never into it.  Not as transparent as e4defrag,
but if you aren't using NFS you probably don't care about the inode number
changes.

Cheers, Andreas

>> 1.  How do I go about writing (a) sensible test case(s) for this?
> 
> Use xfstests.  For a quick introduction to xfstests, and the test
> framework, this slide deck might be helpful:
> 
> 	https://thunk.org/gce-xfstests
> 
>> 2.  Based on the above process I expect it may actually be harder to
>> optimize the user-space resize than going online, can anyone concur with
>> my assessment?  My assessment in part is based on a very rudementary
>> gleaning of the e4defrag code.
> 
> Most of the heavy lifting is actually done in the kernel code, so in
> terms of "harder" it probably is still is the kernel, however, it was
> written to leverage as much existing code as possible, which is why
> fs/ext4/move_extent.c looks "simple".  To make it be lot faster so
> that I/O is not done block by block, a lot of new code will need to be
> written.  Also, as mentioned above, the e4defrag code needs a lot of
> code cleanup.
> 
> The userspace resize code is more complex because it have as much
> other kernel file system code to rely upon.  One of the things which
> makes life hard is trying to minimize or eliminate data loss if the
> resize operation is aborted.  We can rely on the journal in the
> kernel.  In resize2fs, we've been using very careful ordering of
> operations to try to minimize the risk of data loss on a crash.
> 
>> 5.  I only count threee extra ioctls that need to be added kernel
>> space.  The rest of the kernel changes affects existing code paths I
>> believe.  Is a separate module still worth it, and if it is, how would I
>> approach that?
> 
> Given the design we're playing around with, a separate module probably
> doesn't make sense.  That's because we're implementing the smallest
> bits in the kernel, with all of the more complex policy and "Command
> and control" in userspace.
> 
>> 6.  May I simply allocate the next available bit out of the feature set
>> for this or is there some central database where this needs to go into
>> (ie, step 2.0)?
> 
> We check in a bit assignment into the kernel and e2fsprogs, and that's
> how we reserve the bit.  There's a lot we can do without tackling the
> the on-disk format changes, and we should get that done before we
> worry about reserving the feature bit --- but that part isn't hard.
> 
> Finally, implementing a new file system feature can take a *lot* of
> time.  As in, it's likely going to take the better part of the year.u
> Doing the simple parts that also will improve e4defrag makes sense,
> and it may be that if the goal is optimizing your time, a partial
> solution which improves the eventual off-line shrink may be the best
> approach.  At the very least, we can do that and then you can see if
> you really want to follow through with a complete on-line shrink
> implementation.
> 
> Cheers,
> 
> 						- Ted


Cheers, Andreas






Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)