linux-kernel - Re: [PATCH v3 1/3] btrfs: don't hold dev_replace rwsem over whole of btrfs_map

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <516627f6-4568-4d4d-bfc2-0fcf6b870ad8@wdc.com>
Date: Mon, 15 Jul 2024 11:38:23 +0000
From: Johannes Thumshirn <Johannes.Thumshirn@....com>
To: Filipe Manana <fdmanana@...nel.org>, Johannes Thumshirn <jth@...nel.org>
CC: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba
	<dsterba@...e.com>, "linux-btrfs@...r.kernel.org"
	<linux-btrfs@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, Qu Wenru <wqu@...e.com>
Subject: Re: [PATCH v3 1/3] btrfs: don't hold dev_replace rwsem over whole of
 btrfs_map_block

On 15.07.24 13:29, Filipe Manana wrote:
> On Fri, Jul 12, 2024 at 8:49 AM Johannes Thumshirn <jth@...nel.org> wrote:
>>
>> From: Johannes Thumshirn <johannes.thumshirn@....com>
>>
>> Don't hold the dev_replace rwsem for the entirety of btrfs_map_block().
>>
>> It is only needed to protect
>> a) calls to find_live_mirror() and
>> b) calling into handle_ops_on_dev_replace().
>>
>> But there is no need to hold the rwsem for any kind of set_io_stripe()
>> calls.
>>
>> So relax taking the dev_replace rwsem to only protect both cases and check
>> if the device replace status has changed in the meantime, for which we have
>> to re-do the find_live_mirror() calls.
>>
>> This fixes a deadlock on raid-stripe-tree where device replace performs a
>> scrub operation, which in turn calls into btrfs_map_block() to find the
>> physical location of the block.
>>
>> Cc: Filipe Manana <fdmanana@...e.com>
>> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
>> Reviewed-by: Josef Bacik <josef@...icpanda.com>
>> Reviewed-by: Qu Wenruo <wqu@...e.com>
>> ---
>>   fs/btrfs/volumes.c | 28 +++++++++++++++++-----------
>>   1 file changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>> index fcedc43ef291..4209419244a1 100644
>> --- a/fs/btrfs/volumes.c
>> +++ b/fs/btrfs/volumes.c
>> @@ -6650,14 +6650,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>>          max_len = btrfs_max_io_len(map, map_offset, &io_geom);
>>          *length = min_t(u64, map->chunk_len - map_offset, max_len);
>>
>> +again:
>>          down_read(&dev_replace->rwsem);
>>          dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
>> -       /*
>> -        * Hold the semaphore for read during the whole operation, write is
>> -        * requested at commit time but must wait.
>> -        */
>> -       if (!dev_replace_is_ongoing)
>> -               up_read(&dev_replace->rwsem);
>>
>>          switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
>>          case BTRFS_BLOCK_GROUP_RAID0:
>> @@ -6695,6 +6690,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>>                             "stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
>>                             io_geom.stripe_index, map->num_stripes);
>>                  ret = -EINVAL;
>> +               up_read(&dev_replace->rwsem);
>>                  goto out;
>>          }
>>
>> @@ -6710,6 +6706,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>>                   */
>>                  num_alloc_stripes += 2;
>>
>> +       up_read(&dev_replace->rwsem);
>> +
>>          /*
>>           * If this I/O maps to a single device, try to return the device and
>>           * physical block information on the stack instead of allocating an
>> @@ -6782,6 +6780,18 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>>                  goto out;
>>          }
>>
>> +       /*
>> +        * Check if something changed the dev_replace state since
>> +        * we've checked it for the last time and if redo the whole
>> +        * mapping operation.
>> +        */
>> +       down_read(&dev_replace->rwsem);
>> +       if (dev_replace_is_ongoing !=
>> +           btrfs_dev_replace_is_ongoing(dev_replace)) {
>> +               up_read(&dev_replace->rwsem);
>> +               goto again;
> 
> We previously allocated bioc, so before the goto we have to free it
> (call btrfs_put_bioc(bioc)), otherwise we'll leak it as after the goto
> we end up allocating a new one.
> 
> Otherwise it looks fine, thanks.
> 

Good catch, will update.