linux-kernel - Re: [PATCH] btrfs: relax dev_replace rwsem usage on scrub with rst

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H5L-j6Pe2zBmd07K1MyRXbEOO6C=-SB93PdoOQ+4spSOA@mail.gmail.com>
Date: Wed, 14 Aug 2024 15:12:47 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Johannes Thumshirn <jth@...nel.org>
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>, 
	linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Filipe Manana <fdmanana@...e.com>, Johannes Thumshirn <johannes.thumshirn@....com>
Subject: Re: [PATCH] btrfs: relax dev_replace rwsem usage on scrub with rst

On Wed, Aug 14, 2024 at 2:31 PM Filipe Manana <fdmanana@...nel.org> wrote:
>
> On Wed, Aug 14, 2024 at 1:46 PM Johannes Thumshirn <jth@...nel.org> wrote:
> >
> > From: Johannes Thumshin <johannes.thumshirn@....com>
> >
> > Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
> > the RAID stripe-tree, we get the following splat from lockdep:
> >
> >  BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started
> >
> >  ============================================
> >  WARNING: possible recursive locking detected
> >  6.11.0-rc3+ #592 Not tainted
> >  --------------------------------------------
> >  btrfs/4203 is trying to acquire lock:
> >  ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> >  but task is already holding lock:
> >  ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> >  other info that might help us debug this:
> >   Possible unsafe locking scenario:
> >
> >         CPU0
> >         ----
> >    lock(&fs_info->dev_replace.rwsem);
> >    lock(&fs_info->dev_replace.rwsem);
> >
> >   *** DEADLOCK ***
>
> Is this really the full splat?
> There should be a stack trace showing that the problem happens when
> btrfs_map_block() is called within the scrub/dev replace code, no?
>
>
> >
> >   May be due to missing lock nesting notation
> >
> >  1 lock held by btrfs/4203:
> >   #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> > This fixes a deadlock on RAID stripe-tree where device replace performs a
> > scrub operation, which in turn calls into btrfs_map_block() to find the
> > physical location of the block.
> >
> > Cc: Filipe Manana <fdmanana@...e.com>
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
> > ---
> >
> >
> > Signed-off-by: Johannes Thumshirn <jth@...nel.org>
> > ---
> >  fs/btrfs/volumes.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 4b9b647a7e29..e5bd2bee912d 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> >         int dev_replace_is_ongoing = 0;
> >         u16 num_alloc_stripes;
> >         u64 max_len;
> > +       bool rst;
>
> The name is a bit confusing, something like "update_rst" is more
> meaningful and clearly indicates it's a boolean/condition.
>
> >
> >         ASSERT(bioc_ret);
> >
> > @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> >         if (io_geom.mirror_num > num_copies)
> >                 return -EINVAL;
> >
> > +       rst = btrfs_need_stripe_tree_update(fs_info, map->type);
> > +
> >         map_offset = logical - map->start;
> >         io_geom.raid56_full_stripe_start = (u64)-1;
> >         max_len = btrfs_max_io_len(map, map_offset, &io_geom);
> > @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> >                  * For all other non-RAID56 profiles, just copy the target
> >                  * stripe into the bioc.
> >                  */
> > +               if (rst && dev_replace_is_ongoing)
> > +                       up_read(&dev_replace->rwsem);
> >                 for (int i = 0; i < io_geom.num_stripes; i++) {
> >                         ret = set_io_stripe(fs_info, logical, length,
> >                                             &bioc->stripes[i], map, &io_geom);
>
> So, why is this safe? The change log doesn't mention anything about
> the chosen fix.
>
> So even if this is called while we are not in the device replace code,
> btrfs_need_stripe_tree_update() can return true.
> In that case we unlock the device replace semaphore and can result in
> a use-after-free on a device, like this:
>
> 1) btrfs_map_block() called while not in the device replace code
> callchain, and there's a device replace for device X running in
> parallel;
>
> 2) btrfs_need_stripe_tree_update() returns true;
>
> 3) we unlock device replace semaphore;
>
> 4) we call set_io_stripe() which makes bioc point to device X, which
> is the old device (the one being replaced);
>
> 5) before we read lock the device replace semaphore at
> btrfs_map_block(), the device replace finishes and frees device X;
>
> 6) the bioc still points to device X... and then it's used for doing IO later
>
> Can't this happen? I don't see why not.
> If this is safe we should have an explanation in the changelog about
> the details.

And, how actually does this patch fixes the double locking problem?

Isn't the problem that the replace code ends calling btrfs_map_block()
while holding a read lock on the semaphore and then btrfs_map_block()
does a read lock on it again?

I would suggest a different fix:

Make the device replace code store a pointer (or pid) of to the task
running device replace, and at btrfs_map_block() don't take the
semaphore if "current" matches that pointer/pid.

Wouldn't that work? Seems safe and simple to me.

>
> Thanks.
>
>
> > +
> >                         if (ret < 0)
> >                                 break;
> >                         io_geom.stripe_index++;
> >                 }
> > +               if (rst && dev_replace_is_ongoing)
> > +                       down_read(&dev_replace->rwsem);
> > +
> >         }
> >
> >         if (ret) {
> >
> > ---
> > base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb
> > change-id: 20240814-dev_replace_rwsem-new-91c160579246
> >
> > Best regards,
> > --
> > Johannes Thumshirn <jth@...nel.org>
> >
> >