[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <680C3979-C9CA-453C-A973-585E9B880263@dilger.ca>
Date: Tue, 27 Aug 2019 14:53:48 -0600
From: Andreas Dilger <adilger@...ger.ca>
To: harshad shirwadkar <harshadshirwadkar@...il.com>
Cc: Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH] ext4: attempt to shrink directory on dentry removal
On Aug 26, 2019, at 3:46 PM, harshad shirwadkar <harshadshirwadkar@...il.com> wrote:
>
> I see, this is an interesting approach. I think we can do this without
> modifying on-disk format and by bearing the cost of 2 extra reads per
> merge. Whenever dentry deletion results in a dirent block that is
> _sufficiently_ empty, we can look at its parent dx block and and find
> its neighbors that can be merged. We can be aggressive and do this for
> every dentry deletion or we could do this whenever the current dirent
> block is half empty or something.
The point of storing the "leaf block fullness" into the high bits of
dx_entry->block is that this information would always be available
during htree traversal, even if the leaf blocks are not in memory.
That would allow the unlink code to make a decision in advance whether
shrinking is going to happen (e.g. leaf block plus one neighbour are
under 50% full).
The overhead would be to keep the "leaf block fullness" value updated
enough to be useful. In this regard, a lower-resolution field (maybe
1 or 2 bits is enough, representing { unset, <= 25% full }, or { unset,
<= 20% full, <= 40% full, > 40% full }. Anything above this threshold
isn't interesting to consider merging, so there is no need to track it.
> By this method we end up reading up to 2 extra blocks (one previous
> and one next) that are not going to be merged. That's the trade-off we
> have to make in order to avoid any changes to on-disk structure (If we
> modify the on-disk structure and store the fullness in the dx block,
> we would read only the blocks that need to be merged).
Right. As Ted wrote, we could do the shrinking optimistically if the
dx_entry field is "unset", but the blocks are in memory. However, in
when the leaf block is first read into memory, we could also update the
dx_entry fullness when it is sanity checked, so that we don't need to
keep checking whether the leaf block is empty or not. The dx_entry
update could be done only in memory and written opportunistically to
disk if the htree index block is modified. At worst, if the dx_entry
fullness is incorrect the merge could fail and the block is not freed.
> The same logic can also be applied to intermediate dx nodes as well.
> After every merge operation, we'll have a set of blocks that need to
> be freed. Once we know what blocks we can free, we can use Ted's idea
> of swapping them with the last block, one by one.
>
> Since merging approach also requires a way to free up directory
> blocks, I think we could first get a patch in that can free up
> directory blocks by swapping with the last block. Once we have that
> then we could implement merging.
Yes, definitely. I'm not suggesting that we don't need the ability
to free leaf/index blocks first. I just wanted to point out that we
will not see any blocks being freed until the directory is almost
completely (99%) empty. For testing purposes you could use 1KB blocks
and 250-character names (which means at most 3 entries per block), and
we would start seeing blocks being freed once the directory was below
about 30% full, but this does not represent a common use case.
Cheers, Andreas
> On Sun, Aug 25, 2019 at 10:07 PM Andreas Dilger <adilger@...ger.ca> wrote:
>>
>> On Aug 21, 2019, at 12:27 PM, Harshad Shirwadkar <harshadshirwadkar@...il.com> wrote:
>>>
>>> On every dentry deletion, this patch traverses directory inode in
>>> reverse direction and frees up empty dirent blocks until it finds a
>>> non-empty dirent block. We leverage the fact that we never clear
>>> dentry name when we delete dentrys by merging them with the previous
>>> one. So, even though the dirent block has only fake dentry which spans
>>> across the entire block, we can use the name in this dead entry to
>>> perform dx lookup and find intermediate dx node blocks as well as
>>> offset inside these blocks.
>>
>>
>> One high-level limitation with this implementation is that it is unlikely
>> to remove any directory blocks until the directory is almost entirely
>> empty since "rm -r" will return entries in hash order, which does not
>> match the order that the leaf blocks are allocated in the file. Even
>> worse, if files in the directory are not deleted in hash order, no leaf
>> block will be completely empty until about 99% of the files have been
>> deleted - assume 24-byte filenames in 4096-byte blocks means up to 128
>> entries per block, typically 3/4 full, or 1/96 entries will be left in
>> each block before it becomes empty.
>>
>> One option that was discussed in the past is to use the high 4 bits
>> of dx_entry->block (i.e. the opposite of dx_get_block()) to store the
>> "fullness" of each block (in 1/16th of a block, or 256-byte increments
>> for 4096-byte blocks) and try to merge entries into an adjacent block
>> if it becomes mostly empty (e.g. if the current block plus the neighbour
>> are below 50% full). That allows removing blocks much earlier as the
>> directory shrinks, rather than waiting until each block is completely
>> empty. A fullness of "0" would mean "unset", since we don't set it
>> yet, and once this feature is available there would never be a block
>> that is entirely empty.
>>
>>> As of now, we only support non-indexed directories and indexed
>>> directories with no intermediate dx nodes. This technique can also be
>>> used to remove intermediate dx nodes. But it needs a little more
>>> interesting logic to make that happen since we don't store directory
>>> entry name in intermediate nodes.
>>>
>>> Ran kvm-xfstests smoke test-suite and verified that there are no
>>> failures. Also, verified that when all the files are deleted in a
>>> directory, directory shrinks to either 4k for non-indexed directories
>>> or 8k for indexed directories with 1 level.
>>>
>>> This patch is an improvement over previous patch that I sent out
>>> earlier this month. So, if this patch looks alright, then we can drop
>>> the other shrinking patch.
>>>
>>
>>> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@...il.com>
>>>
>>> ---
>>> This patch supersedes the other directory shrinking patch sent in Aug
>>> 2019 ("ext4: shrink directory when last block is empty").
>>>
>>> fs/ext4/namei.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 176 insertions(+)
>>>
>>>
>>>
>>> +static inline bool is_empty_dirent_block(struct inode *dir,
>>> + struct buffer_head *bh)
>>> +{
>>> + struct ext4_dir_entry_2 *de = (struct ext4_dir_entry_2 *)bh->b_data;
>>> + int csum_size = 0;
>>> +
>>> + if (ext4_has_metadata_csum(dir->i_sb))
>>> + csum_size = sizeof(struct ext4_dir_entry_tail);
>>> +
>>> + return ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize) ==
>>> + dir->i_sb->s_blocksize - csum_size && de->inode == 0;
>>> +}
>>
>> This may not always detect empty directory blocks properly, because
>> ext4_generic_delete_entry() will only merge deleted entries with the
>> previous entry. It at least appears possible that if entries are not
>> deleted in the proper order (e.g. in reverse of the order they are
>> listed in the directory) there may be multiple empty entries in a block,
>> and the above check will fail.
>>
>> Instead, this checks should walk all entries in a block and return false
>> if any one of them has a non-zero de->inode. In the common case there
>> may be only a single entry, or the first entry will be used, so it
>> should be fairly quick to decide that the block cannot be removed.
>>
>> Another option is to change ext4_generic_delete_entry() to also try
>> to merge with the immediately following entry to ensure that an empty
>> block always has rec_len of the full blocksize. However, I think this
>> is probably not a worthwhile effort since it would be better to support
>> removing blocks that are partly empty rather than entirely empty.
>>
>>> @@ -2510,6 +2684,8 @@ static int ext4_delete_entry(handle_t *handle,
>>> if (unlikely(err))
>>> goto out;
>>>
>>> + ext4_try_dir_shrink(handle, dir);
>>> +
>>> return 0;
>>
>> I think it would be inefficient to try shrinking the directory after
>> _every_ directory entry is removed. Instead, there should be some
>> way to determine here if ext4_generic_delete_entry() removed the last
>> entry from the directory block, and only shrink in that case.
>>
>> Cheers, Andreas
>>
>>
>>
>>
>>
Cheers, Andreas
Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)
Powered by blists - more mailing lists