linux-kernel - Re: [PATCH 2/2] md: fix deadlock between mddev

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7dc44b53-8dac-d4ff-0af6-4a718967aa26@huaweicloud.com>
Date: Tue, 28 May 2024 21:12:35 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: linan666@...weicloud.com, song@...nel.org
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
 yi.zhang@...wei.com, houtao1@...wei.com, yangerkun@...wei.com,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH 2/2] md: fix deadlock between mddev_suspend and flush bio

Hi,

在 2024/05/26 2:52, linan666@...weicloud.com 写道:
> From: Li Nan <linan122@...wei.com>
> 
> Deadlock occurs when mddev is being suspended while some flush bio is in
> progress. It is a complex issue.
> 
> T1. the first flush is at the ending stage, it clears 'mddev->flush_bio'
>      and tries to submit data, but is blocked because mddev is suspended
>      by T4.
> T2. the second flush sets 'mddev->flush_bio', and attempts to queue
>      md_submit_flush_data(), which is already running (T1) and won't
>      execute again if on the same CPU as T1.
> T3. the third flush inc active_io and tries to flush, but is blocked because
>      'mddev->flush_bio' is not NULL (set by T2).
> T4. mddev_suspend() is called and waits for active_io dec to 0 which is inc
>      by T3.
> 
>    T1		T2		T3		T4
>    (flush 1)	(flush 2)	(third 3)	(suspend)
>    md_submit_flush_data
>     mddev->flush_bio = NULL;
>     .
>     .	 	md_flush_request
>     .	  	 mddev->flush_bio = bio
>     .	  	 queue submit_flushes
>     .		 .
>     .		 .		md_handle_request
>     .		 .		 active_io + 1
>     .		 .		 md_flush_request
>     .		 .		  wait !mddev->flush_bio
>     .		 .
>     .		 .				mddev_suspend
>     .		 .				 wait !active_io
>     .		 .
>     .		 submit_flushes
>     .		 queue_work md_submit_flush_data
>     .		 //md_submit_flush_data is already running (T1)
>     .
>     md_handle_request
>      wait resume
> 
> The root issue is non-atomic inc/dec of active_io during flush process.
> active_io is dec before md_submit_flush_data is queued, and inc soon
> after md_submit_flush_data() run.
>    md_flush_request
>      active_io + 1
>      submit_flushes
>        active_io - 1
>        md_submit_flush_data
>          md_handle_request
>          active_io + 1
>            make_request
>          active_io - 1
> 
> If active_io is dec after md_handle_request() instead of within
> submit_flushes(), make_request() can be called directly intead of
> md_handle_request() in md_submit_flush_data(), and active_io will
> only inc and dec once in the whole flush process. Deadlock will be
> fixed.
> 
> Additionally, the only difference between fixing the issue and before is
> that there is no return error handling of make_request(). But after
> previous patch cleaned md_write_start(), make_requst() only return error
> in raid5_make_request() by dm-raid, see commit 41425f96d7aa ("dm-raid456,
> md/raid456: fix a deadlock for dm-raid456 while io concurrent with
> reshape)". Since dm always splits data and flush operation into two
> separate io, io size of flush submitted by dm always is 0, make_request()
> will not be called in md_submit_flush_data(). To prevent future
> modifications from introducing issues, add WARN_ON to ensure
> make_request() no error is returned in this context.
> 
> Fixes: fa2bbff7b0b4 ("md: synchronize flush io with array reconfiguration")
> Signed-off-by: Li Nan <linan122@...wei.com>

The patch itself looks correct. However, there was a plan to remove
the flush handling and submit the flush bio directly to underlying
disks like dm. Because md_flush_request(), which is fast patch, grab a
disk level spinlock mddev->lock, and will affect performance.

I'm fine taking this patch first, I'll leave the decision to Song.

Thanks,
Kuai


> ---
>   drivers/md/md.c | 26 +++++++++++++++-----------
>   1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 14d6e615bcbb..9bb7e627e57f 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -550,13 +550,9 @@ static void md_end_flush(struct bio *bio)
>   
>   	rdev_dec_pending(rdev, mddev);
>   
> -	if (atomic_dec_and_test(&mddev->flush_pending)) {
> -		/* The pair is percpu_ref_get() from md_flush_request() */
> -		percpu_ref_put(&mddev->active_io);
> -
> +	if (atomic_dec_and_test(&mddev->flush_pending))
>   		/* The pre-request flush has finished */
>   		queue_work(md_wq, &mddev->flush_work);
> -	}
>   }
>   
>   static void md_submit_flush_data(struct work_struct *ws);
> @@ -587,12 +583,8 @@ static void submit_flushes(struct work_struct *ws)
>   			rcu_read_lock();
>   		}
>   	rcu_read_unlock();
> -	if (atomic_dec_and_test(&mddev->flush_pending)) {
> -		/* The pair is percpu_ref_get() from md_flush_request() */
> -		percpu_ref_put(&mddev->active_io);
> -
> +	if (atomic_dec_and_test(&mddev->flush_pending))
>   		queue_work(md_wq, &mddev->flush_work);
> -	}
>   }
>   
>   static void md_submit_flush_data(struct work_struct *ws)
> @@ -617,8 +609,20 @@ static void md_submit_flush_data(struct work_struct *ws)
>   		bio_endio(bio);
>   	} else {
>   		bio->bi_opf &= ~REQ_PREFLUSH;
> -		md_handle_request(mddev, bio);
> +
> +		/*
> +		 * make_requst() will never return error here, it only
> +		 * returns error in raid5_make_request() by dm-raid.
> +		 * Since dm always splits data and flush operation into
> +		 * two separate io, io size of flush submitted by dm
> +		 * always is 0, make_request() will not be called here.
> +		 */
> +		if (WARN_ON_ONCE(!mddev->pers->make_request(mddev, bio)))
> +			bio_io_error(bio);;
>   	}
> +
> +	/* The pair is percpu_ref_get() from md_flush_request() */
> +	percpu_ref_put(&mddev->active_io);
>   }
>   
>   /*
>