[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL3q7H7Lhym2F82Fu=VDoD1uvvVFW2q9WVx_pM0jVf+VS=ji8A@mail.gmail.com>
Date: Mon, 15 Jul 2024 14:20:03 +0100
From: Filipe Manana <fdmanana@...nel.org>
To: Johannes Thumshirn <jth@...nel.org>
Cc: Johannes Thumshirn <johannes.thumshirn@....com>, Josef Bacik <josef@...icpanda.com>,
Qu Wenruo <wqu@...e.com>, Filipe Manana <fdmanana@...e.com>, Chris Mason <clm@...com>,
David Sterba <dsterba@...e.com>,
"open list:BTRFS FILE SYSTEM" <linux-btrfs@...r.kernel.org>, open list <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v4] btrfs: don't hold dev_replace rwsem over whole of btrfs_map_block
On Mon, Jul 15, 2024 at 2:13 PM Johannes Thumshirn <jth@...nel.org> wrote:
>
> From: Johannes Thumshirn <johannes.thumshirn@....com>
>
> Don't hold the dev_replace rwsem for the entirety of btrfs_map_block().
>
> It is only needed to protect
> a) calls to find_live_mirror() and
> b) calling into handle_ops_on_dev_replace().
>
> But there is no need to hold the rwsem for any kind of set_io_stripe()
> calls.
>
> So relax taking the dev_replace rwsem to only protect both cases and check
> if the device replace status has changed in the meantime, for which we have
> to re-do the find_live_mirror() calls.
>
> This fixes a deadlock on raid-stripe-tree where device replace performs a
> scrub operation, which in turn calls into btrfs_map_block() to find the
> physical location of the block.
>
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@....com>
> Reviewed-by: Josef Bacik <josef@...icpanda.com>
> Reviewed-by: Qu Wenruo <wqu@...e.com>
>
> ---
> Cc: Filipe Manana <fdmanana@...e.com>
>
> Changes in v4:
> - Free bioc in case we need to redo the mapping
> Link to v3:
> https://lore.kernel.org/linux-btrfs/20240712-b4-rst-updates-v3-1-5cf27dac98a7@kernel.org
Reviewed-by: Filipe Manana <fdmanana@...e.com>
Looks good now, thanks.
> ---
> fs/btrfs/volumes.c | 29 ++++++++++++++++++-----------
> 1 file changed, 18 insertions(+), 11 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index fcedc43ef291..9437e779d020 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6650,14 +6650,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> max_len = btrfs_max_io_len(map, map_offset, &io_geom);
> *length = min_t(u64, map->chunk_len - map_offset, max_len);
>
> +again:
> down_read(&dev_replace->rwsem);
> dev_replace_is_ongoing = btrfs_dev_replace_is_ongoing(dev_replace);
> - /*
> - * Hold the semaphore for read during the whole operation, write is
> - * requested at commit time but must wait.
> - */
> - if (!dev_replace_is_ongoing)
> - up_read(&dev_replace->rwsem);
>
> switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
> case BTRFS_BLOCK_GROUP_RAID0:
> @@ -6695,6 +6690,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> "stripe index math went horribly wrong, got stripe_index=%u, num_stripes=%u",
> io_geom.stripe_index, map->num_stripes);
> ret = -EINVAL;
> + up_read(&dev_replace->rwsem);
> goto out;
> }
>
> @@ -6710,6 +6706,8 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> */
> num_alloc_stripes += 2;
>
> + up_read(&dev_replace->rwsem);
> +
> /*
> * If this I/O maps to a single device, try to return the device and
> * physical block information on the stack instead of allocating an
> @@ -6782,6 +6780,19 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> goto out;
> }
>
> + /*
> + * Check if something changed the dev_replace state since
> + * we've checked it for the last time and if redo the whole
> + * mapping operation.
> + */
> + down_read(&dev_replace->rwsem);
> + if (dev_replace_is_ongoing !=
> + btrfs_dev_replace_is_ongoing(dev_replace)) {
> + btrfs_put_bioc(bioc);
> + up_read(&dev_replace->rwsem);
> + goto again;
> + }
> +
> if (op != BTRFS_MAP_READ)
> io_geom.max_errors = btrfs_chunk_max_errors(map);
>
> @@ -6789,6 +6800,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> op != BTRFS_MAP_READ) {
> handle_ops_on_dev_replace(bioc, dev_replace, logical, &io_geom);
> }
> + up_read(&dev_replace->rwsem);
>
> *bioc_ret = bioc;
> bioc->num_stripes = io_geom.num_stripes;
> @@ -6796,11 +6808,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
> bioc->mirror_num = io_geom.mirror_num;
>
> out:
> - if (dev_replace_is_ongoing) {
> - lockdep_assert_held(&dev_replace->rwsem);
> - /* Unlock and let waiting writers proceed */
> - up_read(&dev_replace->rwsem);
> - }
> btrfs_free_chunk_map(map);
> return ret;
> }
> --
> 2.43.0
>
>
Powered by blists - more mailing lists