linux-kernel - Re: [PATCH v5 01/14] md: don't ignore suspended array in md_check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALTww2-ZhRBJOD3jXs=xKFaD=iR=dtoC9h2rUQi5Stpi+tJ9Bw@mail.gmail.com>
Date: Fri, 16 Feb 2024 14:58:40 +0800
From: Xiao Ni <xni@...hat.com>
To: Yu Kuai <yukuai1@...weicloud.com>
Cc: mpatocka@...hat.com, heinzm@...hat.com, blazej.kucman@...ux.intel.com, 
	agk@...hat.com, snitzer@...nel.org, dm-devel@...ts.linux.dev, song@...nel.org, 
	yukuai3@...wei.com, jbrassow@....redhat.com, neilb@...e.de, shli@...com, 
	akpm@...l.org, linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org, 
	yi.zhang@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH v5 01/14] md: don't ignore suspended array in md_check_recovery()

On Thu, Feb 1, 2024 at 5:30 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@...wei.com>
>
> mddev_suspend() never stop sync_thread, hence it doesn't make sense to
> ignore suspended array in md_check_recovery(), which might cause
> sync_thread can't be unregistered.
>
> After commit f52f5c71f3d4 ("md: fix stopping sync thread"), following
> hang can be triggered by test shell/integrity-caching.sh:

Hi Kuai

After applying this patch, it's still stuck at mddev_suspend. Maybe
the deadlock can be fixed by other patches from the patch set. But
this patch can't fix this issue. If so, the comment is not right.

>
> 1) suspend the array:
> raid_postsuspend
>  mddev_suspend
>
> 2) stop the array:
> raid_dtr
>  md_stop
>   __md_stop_writes
>    stop_sync_thread
>     set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>     md_wakeup_thread_directly(mddev->sync_thread);
>     wait_event(..., !test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
>
> 3) sync thread done:
> md_do_sync
>  set_bit(MD_RECOVERY_DONE, &mddev->recovery);
>  md_wakeup_thread(mddev->thread);
>
> 4) daemon thread can't unregister sync thread:
> md_check_recovery
>  if (mddev->suspended)
>    return; -> return directly
>  md_read_sync_thread
>  clear_bit(MD_RECOVERY_RUNNING, &mddev->recovery);
>  -> MD_RECOVERY_RUNNING can't be cleared, hence step 2 hang;

I add some debug logs when stopping dmraid with lvremove command. The
step you mentioned are sequential but not async. The process is :
dev_remove->dm_destroy->__dm_destroy->dm_table_postsuspend_targets(raid_postsuspend)
-> dm_table_destroy(raid_dtr). It looks like mddev_suspend is waiting
for active_io to be zero.

Best Regards
Xiao

> This problem is not just related to dm-raid, fix it by ignoring
> suspended array in md_check_recovery(). And follow up patches will
> improve dm-raid better to frozen sync thread during suspend.
>
> Reported-by: Mikulas Patocka <mpatocka@...hat.com>
> Closes: https://lore.kernel.org/all/8fb335e-6d2c-dbb5-d7-ded8db5145a@redhat.com/
> Fixes: 68866e425be2 ("MD: no sync IO while suspended")
> Fixes: f52f5c71f3d4 ("md: fix stopping sync thread")
> Signed-off-by: Yu Kuai <yukuai3@...wei.com>
> ---
>  drivers/md/md.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2266358d8074..07b80278eaa5 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -9469,9 +9469,6 @@ static void md_start_sync(struct work_struct *ws)
>   */
>  void md_check_recovery(struct mddev *mddev)
>  {
> -       if (READ_ONCE(mddev->suspended))
> -               return;
> -
>         if (mddev->bitmap)
>                 md_bitmap_daemon_work(mddev);
>
> --
> 2.39.2
>