[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL3q7H5L-j6Pe2zBmd07K1MyRXbEOO6C=-SB93PdoOQ+4spSOA@mail.gmail.com>
Date: Wed, 14 Aug 2024 15:12:47 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Johannes Thumshirn <jth@...nel.org>
Cc: Chris Mason <clm@...com>, Josef Bacik <josef@...icpanda.com>, David Sterba <dsterba@...e.com>,
linux-btrfs@...r.kernel.org, linux-kernel@...r.kernel.org,
Filipe Manana <fdmanana@...e.com>, Johannes Thumshirn <johannes.thumshirn@....com>
Subject: Re: [PATCH] btrfs: relax dev_replace rwsem usage on scrub with rst
On Wed, Aug 14, 2024 at 2:31 PM Filipe Manana <fdmanana@...nel.org> wrote:
>
> On Wed, Aug 14, 2024 at 1:46 PM Johannes Thumshirn <jth@...nel.org> wrote:
> >
> > From: Johannes Thumshin <johannes.thumshirn@....com>
> >
> > Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
> > the RAID stripe-tree, we get the following splat from lockdep:
> >
> > BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started
> >
> > ============================================
> > WARNING: possible recursive locking detected
> > 6.11.0-rc3+ #592 Not tainted
> > --------------------------------------------
> > btrfs/4203 is trying to acquire lock:
> > ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> > but task is already holding lock:
> > ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> > other info that might help us debug this:
> > Possible unsafe locking scenario:
> >
> > CPU0
> > ----
> > lock(&fs_info->dev_replace.rwsem);
> > lock(&fs_info->dev_replace.rwsem);
> >
> > *** DEADLOCK ***
>
> Is this really the full splat?
> There should be a stack trace showing that the problem happens when
> btrfs_map_block() is called within the scrub/dev replace code, no?
>
>
> >
> > May be due to missing lock nesting notation
> >
> > 1 lock held by btrfs/4203:
> > #0: ffff888103f35c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250
> >
> > This fixes a deadlock on RAID stripe-tree where device replace performs a
> > scrub operation, which in turn calls into btrfs_map_block() to find the
> > physical location of the block.
> >
> > Cc: Filipe Manana <fdmanana@...e.com>
> > Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
> > ---
> >
> >
> > Signed-off-by: Johannes Thumshirn <jth@...nel.org>
> > ---
> > fs/btrfs/volumes.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 4b9b647a7e29..e5bd2bee912d 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6459,6 +6459,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> > int dev_replace_is_ongoing = 0;
> > u16 num_alloc_stripes;
> > u64 max_len;
> > + bool rst;
>
> The name is a bit confusing, something like "update_rst" is more
> meaningful and clearly indicates it's a boolean/condition.
>
> >
> > ASSERT(bioc_ret);
> >
> > @@ -6475,6 +6476,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> > if (io_geom.mirror_num > num_copies)
> > return -EINVAL;
> >
> > + rst = btrfs_need_stripe_tree_update(fs_info, map->type);
> > +
> > map_offset = logical - map->start;
> > io_geom.raid56_full_stripe_start = (u64)-1;
> > max_len = btrfs_max_io_len(map, map_offset, &io_geom);
> > @@ -6597,13 +6600,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> > * For all other non-RAID56 profiles, just copy the target
> > * stripe into the bioc.
> > */
> > + if (rst && dev_replace_is_ongoing)
> > + up_read(&dev_replace->rwsem);
> > for (int i = 0; i < io_geom.num_stripes; i++) {
> > ret = set_io_stripe(fs_info, logical, length,
> > &bioc->stripes[i], map, &io_geom);
>
> So, why is this safe? The change log doesn't mention anything about
> the chosen fix.
>
> So even if this is called while we are not in the device replace code,
> btrfs_need_stripe_tree_update() can return true.
> In that case we unlock the device replace semaphore and can result in
> a use-after-free on a device, like this:
>
> 1) btrfs_map_block() called while not in the device replace code
> callchain, and there's a device replace for device X running in
> parallel;
>
> 2) btrfs_need_stripe_tree_update() returns true;
>
> 3) we unlock device replace semaphore;
>
> 4) we call set_io_stripe() which makes bioc point to device X, which
> is the old device (the one being replaced);
>
> 5) before we read lock the device replace semaphore at
> btrfs_map_block(), the device replace finishes and frees device X;
>
> 6) the bioc still points to device X... and then it's used for doing IO later
>
> Can't this happen? I don't see why not.
> If this is safe we should have an explanation in the changelog about
> the details.
And, how actually does this patch fixes the double locking problem?
Isn't the problem that the replace code ends calling btrfs_map_block()
while holding a read lock on the semaphore and then btrfs_map_block()
does a read lock on it again?
I would suggest a different fix:
Make the device replace code store a pointer (or pid) of to the task
running device replace, and at btrfs_map_block() don't take the
semaphore if "current" matches that pointer/pid.
Wouldn't that work? Seems safe and simple to me.
>
> Thanks.
>
>
> > +
> > if (ret < 0)
> > break;
> > io_geom.stripe_index++;
> > }
> > + if (rst && dev_replace_is_ongoing)
> > + down_read(&dev_replace->rwsem);
> > +
> > }
> >
> > if (ret) {
> >
> > ---
> > base-commit: 4ce21d87ae81a86b488e0d326682883485317dcb
> > change-id: 20240814-dev_replace_rwsem-new-91c160579246
> >
> > Best regards,
> > --
> > Johannes Thumshirn <jth@...nel.org>
> >
> >
Powered by blists - more mailing lists