[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <516627f6-4568-4d4d-bfc2-0fcf6b870ad8@wdc.com>
Date: Mon, 15 Jul 2024 11:38:23 +0000
From: Johannes Thumshirn <Johannes.Thumshirn@....com>
To: Filipe Manana <fdmanana@...nel.org>, Johannes Thumshirn <jth@...nel.org>
CC: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba
<dsterba@...e.com>, "linux-btrfs@...r.kernel.org"
<linux-btrfs@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, Qu Wenru <wqu@...e.com>
Subject: Re: [PATCH v3 1/3] btrfs: don't hold dev_replace rwsem over whole of
btrfs_map_block
On 15.07.24 13:29, Filipe Manana wrote:
> On Fri, Jul 12, 2024 at 8:49 AM Johannes Thumshirn <jth@...nel.org> wrote:
>>
>> From: Johannes Thumshirn <johannes.thumshirn@....com>
>>
>> Don't hold the dev_replace rwsem for the entirety of btrfs_map_block().
>>
>> It is only needed to protect
>> a) calls to find_live_mirror() and
>> b) calling into handle_ops_on_dev_replace().
>>
>> But there is no need to hold the rwsem for any kind of set_io_stripe()
>> calls.
>>
>> So relax taking the dev_replace rwsem to only protect both cases and check
>> if the device replace status has changed in the meantime, for which we have
>> to re-do the find_live_mirror() calls.
>>
>> This fixes a deadlock on raid-stripe-tree where device replace performs a
>> scrub operation, which in turn calls into btrfs_map_block() to find the
>> physical location of the block.
>>
>> Cc: Filipe Manana <fdmanana@...e.com>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
>> Reviewed-by: Josef Bacik <josef@...icpanda.com>
>> Reviewed-by: Qu Wenruo <wqu@...e.com>
>> ---
>> fs/btrfs/volumes.c | 28 +++++++++++++++++-----------
>> 1 file changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index fcedc43ef291..4209419244a1 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6650,14 +6650,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> max_len = btrfs_max_io_len(map, map_offset, &io_geom);
>> *length = min_t(u64, map->chunk_len - map_offset, max_len);
>>
>> +again:
>> down_read(&dev_replace->rwsem);
>> dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
>> - /*
>> - * Hold the semaphore for read during the whole operation, write is
>> - * requested at commit time but must wait.
>> - */
>> - if (!dev_replace_is_ongoing)
>> - up_read(&dev_replace->rwsem);
>>
>> switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>> case BTRFS_BLOCK_GROUP_RAID0:
>> @@ -6695,6 +6690,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> "stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
>> io_geom.stripe_index, map->num_stripes);
>> ret = -EINVAL;
>> + up_read(&dev_replace->rwsem);
>> goto out;
>> }
>>
>> @@ -6710,6 +6706,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> */
>> num_alloc_stripes += 2;
>>
>> + up_read(&dev_replace->rwsem);
>> +
>> /*
>> * If this I/O maps to a single device, try to return the device and
>> * physical block information on the stack instead of allocating an
>> @@ -6782,6 +6780,18 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>> goto out;
>> }
>>
>> + /*
>> + * Check if something changed the dev_replace state since
>> + * we've checked it for the last time and if redo the whole
>> + * mapping operation.
>> + */
>> + down_read(&dev_replace->rwsem);
>> + if (dev_replace_is_ongoing !=
>> + btrfs_dev_replace_is_ongoing(dev_replace)) {
>> + up_read(&dev_replace->rwsem);
>> + goto again;
>
> We previously allocated bioc, so before the goto we have to free it
> (call btrfs_put_bioc(bioc)), otherwise we'll leak it as after the goto
> we end up allocating a new one.
>
> Otherwise it looks fine, thanks.
>
Good catch, will update.
Powered by blists - more mailing lists