[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7dc44b53-8dac-d4ff-0af6-4a718967aa26@huaweicloud.com>
Date: Tue, 28 May 2024 21:12:35 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: linan666@...weicloud.com, song@...nel.org
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
yi.zhang@...wei.com, houtao1@...wei.com, yangerkun@...wei.com,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH 2/2] md: fix deadlock between mddev_suspend and flush bio
Hi,
在 2024/05/26 2:52, linan666@...weicloud.com 写道:
> From: Li Nan <linan122@...wei.com>
>
> Deadlock occurs when mddev is being suspended while some flush bio is in
> progress. It is a complex issue.
>
> T1. the first flush is at the ending stage, it clears 'mddev->flush_bio'
> and tries to submit data, but is blocked because mddev is suspended
> by T4.
> T2. the second flush sets 'mddev->flush_bio', and attempts to queue
> md_submit_flush_data(), which is already running (T1) and won't
> execute again if on the same CPU as T1.
> T3. the third flush inc active_io and tries to flush, but is blocked because
> 'mddev->flush_bio' is not NULL (set by T2).
> T4. mddev_suspend() is called and waits for active_io dec to 0 which is inc
> by T3.
>
> T1 T2 T3 T4
> (flush 1) (flush 2) (third 3) (suspend)
> md_submit_flush_data
> mddev->flush_bio = NULL;
> .
> . md_flush_request
> . mddev->flush_bio = bio
> . queue submit_flushes
> . .
> . . md_handle_request
> . . active_io + 1
> . . md_flush_request
> . . wait !mddev->flush_bio
> . .
> . . mddev_suspend
> . . wait !active_io
> . .
> . submit_flushes
> . queue_work md_submit_flush_data
> . //md_submit_flush_data is already running (T1)
> .
> md_handle_request
> wait resume
>
> The root issue is non-atomic inc/dec of active_io during flush process.
> active_io is dec before md_submit_flush_data is queued, and inc soon
> after md_submit_flush_data() run.
> md_flush_request
> active_io + 1
> submit_flushes
> active_io - 1
> md_submit_flush_data
> md_handle_request
> active_io + 1
> make_request
> active_io - 1
>
> If active_io is dec after md_handle_request() instead of within
> submit_flushes(), make_request() can be called directly intead of
> md_handle_request() in md_submit_flush_data(), and active_io will
> only inc and dec once in the whole flush process. Deadlock will be
> fixed.
>
> Additionally, the only difference between fixing the issue and before is
> that there is no return error handling of make_request(). But after
> previous patch cleaned md_write_start(), make_requst() only return error
> in raid5_make_request() by dm-raid, see commit 41425f96d7aa ("dm-raid456,
> md/raid456: fix a deadlock for dm-raid456 while io concurrent with
> reshape)". Since dm always splits data and flush operation into two
> separate io, io size of flush submitted by dm always is 0, make_request()
> will not be called in md_submit_flush_data(). To prevent future
> modifications from introducing issues, add WARN_ON to ensure
> make_request() no error is returned in this context.
>
> Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration")
> Signed-off-by: Li Nan <linan122@...wei.com>
The patch itself looks correct. However, there was a plan to remove
the flush handling and submit the flush bio directly to underlying
disks like dm. Because md_flush_request(), which is fast patch, grab a
disk level spinlock mddev->lock, and will affect performance.
I'm fine taking this patch first, I'll leave the decision to Song.
Thanks,
Kuai
> ---
> drivers/md/md.c | 26 +++++++++++++++-----------
> 1 file changed, 15 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 14d6e615bcbb..9bb7e627e57f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -550,13 +550,9 @@ static void md_end_flush(struct bio *bio)
>
> rdev_dec_pending(rdev, mddev);
>
> - if (atomic_dec_and_test(&mddev->flush_pending)) {
> - /* The pair is percpu_ref_get() from md_flush_request() */
> - percpu_ref_put(&mddev->active_io);
> -
> + if (atomic_dec_and_test(&mddev->flush_pending))
> /* The pre-request flush has finished */
> queue_work(md_wq, &mddev->flush_work);
> - }
> }
>
> static void md_submit_flush_data(struct work_struct *ws);
> @@ -587,12 +583,8 @@ static void submit_flushes(struct work_struct *ws)
> rcu_read_lock();
> }
> rcu_read_unlock();
> - if (atomic_dec_and_test(&mddev->flush_pending)) {
> - /* The pair is percpu_ref_get() from md_flush_request() */
> - percpu_ref_put(&mddev->active_io);
> -
> + if (atomic_dec_and_test(&mddev->flush_pending))
> queue_work(md_wq, &mddev->flush_work);
> - }
> }
>
> static void md_submit_flush_data(struct work_struct *ws)
> @@ -617,8 +609,20 @@ static void md_submit_flush_data(struct work_struct *ws)
> bio_endio(bio);
> } else {
> bio->bi_opf &= ~REQ_PREFLUSH;
> - md_handle_request(mddev, bio);
> +
> + /*
> + * make_requst() will never return error here, it only
> + * returns error in raid5_make_request() by dm-raid.
> + * Since dm always splits data and flush operation into
> + * two separate io, io size of flush submitted by dm
> + * always is 0, make_request() will not be called here.
> + */
> + if (WARN_ON_ONCE(!mddev->pers->make_request(mddev, bio)))
> + bio_io_error(bio);;
> }
> +
> + /* The pair is percpu_ref_get() from md_flush_request() */
> + percpu_ref_put(&mddev->active_io);
> }
>
> /*
>
Powered by blists - more mailing lists