linux-kernel - Re: [PATCH] btrfs: relax dev_replace rwsem usage on scrub with rst

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H45Ym_QHPYaregfVvUDzaVpm5i62G8==yNQ3Bfd63Ffmw@mail.gmail.com>
Date: Wed, 14 Aug 2024 14:31:00 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Johannes Thumshirn <jth@...nel.org>
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>, 
	linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Filipe Manana <fdmanana@...e.com>, Johannes Thumshirn <johannes.thumshirn@....com>
Subject: Re: [PATCH] btrfs: relax dev_replace rwsem usage on scrub with rst

On Wed, Aug 14, 2024 at 1:46 PM Johannes Thumshirn <jth@...nel.org> wrote:
>
> From: Johannes Thumshin <johannes.thumshirn@....com>
>
> Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
> the RAID stripe-tree, we get the following splat from lockdep:
>
>  BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started
>
>  ============================================
>  WARNING: possible recursive locking detected
>  6.11.0-rc3+ #592 Not tainted
>  --------------------------------------------
>  btrfs/4203 is trying to acquire lock:
>  ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
>  but task is already holding lock:
>  ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
>  other info that might help us debug this:
>   Possible unsafe locking scenario:
>
>         CPU0
>         ----
>    lock(&fs_info->dev_replace.rwsem);
>    lock(&fs_info->dev_replace.rwsem);
>
>   *** DEADLOCK ***

Is this really the full splat?
There should be a stack trace showing that the problem happens when
btrfs_map_block() is called within the scrub/dev replace code, no?


>
>   May be due to missing lock nesting notation
>
>  1 lock held by btrfs/4203:
>   #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
>
> This fixes a deadlock on RAID stripe-tree where device replace performs a
> scrub operation, which in turn calls into btrfs_map_block() to find the
> physical location of the block.
>
> Cc: Filipe Manana <fdmanana@...e.com>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
> ---
>
>
> Signed-off-by: Johannes Thumshirn <jth@...nel.org>
> ---
>  fs/btrfs/volumes.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 4b9b647a7e29..e5bd2bee912d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>         int dev_replace_is_ongoing = 0;
>         u16 num_alloc_stripes;
>         u64 max_len;
> +       bool rst;

The name is a bit confusing, something like "update_rst" is more
meaningful and clearly indicates it's a boolean/condition.

>
>         ASSERT(bioc_ret);
>
> @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>         if (io_geom.mirror_num > num_copies)
>                 return -EINVAL;
>
> +       rst = btrfs_need_stripe_tree_update(fs_info, map->type);
> +
>         map_offset = logical - map->start;
>         io_geom.raid56_full_stripe_start = (u64)-1;
>         max_len = btrfs_max_io_len(map, map_offset, &io_geom);
> @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
>                  * For all other non-RAID56 profiles, just copy the target
>                  * stripe into the bioc.
>                  */
> +               if (rst && dev_replace_is_ongoing)
> +                       up_read(&dev_replace->rwsem);
>                 for (int i = 0; i < io_geom.num_stripes; i++) {
>                         ret = set_io_stripe(fs_info, logical, length,
>                                             &bioc->stripes[i], map, &io_geom);

So, why is this safe? The change log doesn't mention anything about
the chosen fix.

So even if this is called while we are not in the device replace code,
btrfs_need_stripe_tree_update() can return true.
In that case we unlock the device replace semaphore and can result in
a use-after-free on a device, like this:

1) btrfs_map_block() called while not in the device replace code
callchain, and there's a device replace for device X running in
parallel;

2) btrfs_need_stripe_tree_update() returns true;

3) we unlock device replace semaphore;

4) we call set_io_stripe() which makes bioc point to device X, which
is the old device (the one being replaced);

5) before we read lock the device replace semaphore at
btrfs_map_block(), the device replace finishes and frees device X;

6) the bioc still points to device X... and then it's used for doing IO later

Can't this happen? I don't see why not.
If this is safe we should have an explanation in the changelog about
the details.

Thanks.


> +
>                         if (ret < 0)
>                                 break;
>                         io_geom.stripe_index++;
>                 }
> +               if (rst && dev_replace_is_ongoing)
> +                       down_read(&dev_replace->rwsem);
> +
>         }
>
>         if (ret) {
>
> ---
> base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb
> change-id: 20240814-dev_replace_rwsem-new-91c160579246
>
> Best regards,
> --
> Johannes Thumshirn <jth@...nel.org>
>
>