[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H45Ym_QHPYaregfVvUDzaVpm5i62G8==yNQ3Bfd63Ffmw@mail.gmail.com>
Date: Wed, 14 Aug 2024 14:31:00 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Johannes Thumshirn <jth@...nel.org>
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>,
linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org,
Filipe Manana <fdmanana@...e.com>, Johannes Thumshirn <johannes.thumshirn@....com>
Subject: Re: [PATCH] btrfs: relax dev_replace rwsem usage on scrub with rst
On Wed, Aug 14, 2024 at 1:46 PM Johannes Thumshirn <jth@...nel.org> wrote:
>
> From: Johannes Thumshin <johannes.thumshirn@....com>
>
> Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
> the RAID stripe-tree, we get the following splat from lockdep:
>
> BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started
>
> ============================================
> WARNING: possible recursive locking detected
> 6.11.0-rc3+ #592 Not tainted
> --------------------------------------------
> btrfs/4203 is trying to acquire lock:
> ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
> but task is already holding lock:
> ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
> other info that might help us debug this:
> Possible unsafe locking scenario:
>
> CPU0
> ----
> lock(&fs_info->dev_replace.rwsem);
> lock(&fs_info->dev_replace.rwsem);
>
> *** DEADLOCK ***
Is this really the full splat?
There should be a stack trace showing that the problem happens when
btrfs_map_block() is called within the scrub/dev replace code, no?
>
> May be due to missing lock nesting notation
>
> 1 lock held by btrfs/4203:
> #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
> This fixes a deadlock on RAID stripe-tree where device replace performs a
> scrub operation, which in turn calls into btrfs_map_block() to find the
> physical location of the block.
>
> Cc: Filipe Manana <fdmanana@...e.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
> ---
>
>
> Signed-off-by: Johannes Thumshirn <jth@...nel.org>
> ---
> fs/btrfs/volumes.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 4b9b647a7e29..e5bd2bee912d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> int dev_replace_is_ongoing = 0;
> u16 num_alloc_stripes;
> u64 max_len;
> + bool rst;
The name is a bit confusing, something like "update_rst" is more
meaningful and clearly indicates it's a boolean/condition.
>
> ASSERT(bioc_ret);
>
> @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> if (io_geom.mirror_num > num_copies)
> return -EINVAL;
>
> + rst = btrfs_need_stripe_tree_update(fs_info, map->type);
> +
> map_offset = logical - map->start;
> io_geom.raid56_full_stripe_start = (u64)-1;
> max_len = btrfs_max_io_len(map, map_offset, &io_geom);
> @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> * For all other non-RAID56 profiles, just copy the target
> * stripe into the bioc.
> */
> + if (rst && dev_replace_is_ongoing)
> + up_read(&dev_replace->rwsem);
> for (int i = 0; i < io_geom.num_stripes; i++) {
> ret = set_io_stripe(fs_info, logical, length,
> &bioc->stripes[i], map, &io_geom);
So, why is this safe? The change log doesn't mention anything about
the chosen fix.
So even if this is called while we are not in the device replace code,
btrfs_need_stripe_tree_update() can return true.
In that case we unlock the device replace semaphore and can result in
a use-after-free on a device, like this:
1) btrfs_map_block() called while not in the device replace code
callchain, and there's a device replace for device X running in
parallel;
2) btrfs_need_stripe_tree_update() returns true;
3) we unlock device replace semaphore;
4) we call set_io_stripe() which makes bioc point to device X, which
is the old device (the one being replaced);
5) before we read lock the device replace semaphore at
btrfs_map_block(), the device replace finishes and frees device X;
6) the bioc still points to device X... and then it's used for doing IO later
Can't this happen? I don't see why not.
If this is safe we should have an explanation in the changelog about
the details.
Thanks.
> +
> if (ret < 0)
> break;
> io_geom.stripe_index++;
> }
> + if (rst && dev_replace_is_ongoing)
> + down_read(&dev_replace->rwsem);
> +
> }
>
> if (ret) {
>
> ---
> base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb
> change-id: 20240814-dev_replace_rwsem-new-91c160579246
>
> Best regards,
> --
> Johannes Thumshirn <jth@...nel.org>
>
>
Powered by blists - more mailing lists