linux-kernel - Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALTww2-BmudurHbsbbqBMq+KgZs+hokqOJnovS5KDGEidHqZzA@mail.gmail.com>
Date: Mon, 4 Mar 2024 09:25:55 +0800
From: Xiao Ni <xni@...hat.com>
To: Yu Kuai <yukuai1@...weicloud.com>
Cc: zkabelac@...hat.com, agk@...hat.com, snitzer@...nel.org, 
	mpatocka@...hat.com, dm-devel@...ts.linux.dev, song@...nel.org, 
	heinzm@...hat.com, neilb@...e.de, jbrassow@...hat.com, 
	linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org, yi.zhang@...wei.com, 
	yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2

On Mon, Mar 4, 2024 at 9:24 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
>
> Hi,
>
> 在 2024/03/04 9:07, Yu Kuai 写道:
> > Hi,
> >
> > 在 2024/03/03 21:16, Xiao Ni 写道:
> >> Hi all
> >>
> >> There is a error report from lvm regression tests. The case is
> >> lvconvert-raid-reshape-stripes-load-reload.sh. I saw this error when I
> >> tried to fix dmraid regression problems too. In my patch set,  after
> >> reverting ad39c08186f8a0f221337985036ba86731d6aafe (md: Don't register
> >> sync_thread for reshape directly), this problem doesn't appear.
> >

Hi Kuai
> > How often did you see this tes failed? I'm running the tests for over
> > two days now, for 30+ rounds, and this test never fail in my VM.

I ran 5 times and it failed 2 times just now.

>
> Take a quick look, there is still a path from raid10 that
> MD_RECOVERY_FROZEN can be cleared, and in theroy this problem can be
> triggered. Can you test the following patch on the top of this set?
> I'll keep running the test myself.

Sure, I'll give the result later.

Regards
Xiao
>
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a5f8419e2df1..7ca29469123a 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -4575,7 +4575,8 @@ static int raid10_start_reshape(struct mddev *mddev)
>          return 0;
>
>   abort:
> -       mddev->recovery = 0;
> +       if (mddev->gendisk)
> +               mddev->recovery = 0;
>          spin_lock_irq(&conf->device_lock);
>          conf->geo = conf->prev;
>          mddev->raid_disks = conf->geo.raid_disks;
>
> Thanks,
> Kuai
> >
> > Thanks,
> > Kuai
> >
> >>
> >> I put the log in the attachment.
> >>
> >> On Fri, Mar 1, 2024 at 6:03 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
> >>>
> >>> From: Yu Kuai <yukuai3@...wei.com>
> >>>
> >>> link to part1:
> >>> https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/
> >>>
> >>>
> >>> part1 contains fixes for deadlocks for stopping sync_thread
> >>>
> >>> This set contains fixes:
> >>>   - reshape can start unexpected, cause data corruption, patch 1,5,6;
> >>>   - deadlocks that reshape concurrent with IO, patch 8;
> >>>   - a lockdep warning, patch 9;
> >>>
> >>> I'm runing lvm2 tests with following scripts with a few rounds now,
> >>>
> >>> for t in `ls test/shell`; do
> >>>          if cat test/shell/$t | grep raid &> /dev/null; then
> >>>                  make check T=shell/$t
> >>>          fi
> >>> done
> >>>
> >>> There are no deadlock and no fs corrupt now, however, there are still
> >>> four
> >>> failed tests:
> >>>
> >>> ###       failed: [ndev-vanilla] shell/lvchange-raid1-writemostly.sh
> >>> ###       failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
> >>> ###       failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
> >>> ###       failed: [ndev-vanilla] shell/lvextend-raid.sh
> >>>
> >>> And failed reasons are the same:
> >>>
> >>> ## ERROR: The test started dmeventd (147856) unexpectedly
> >>>
> >>> I have no clue yet, and it seems other folks doesn't have this issue.
> >>>
> >>> Yu Kuai (9):
> >>>    md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
> >>>    md: export helpers to stop sync_thread
> >>>    md: export helper md_is_rdwr()
> >>>    md: add a new helper reshape_interrupted()
> >>>    dm-raid: really frozen sync_thread during suspend
> >>>    md/dm-raid: don't call md_reap_sync_thread() directly
> >>>    dm-raid: add a new helper prepare_suspend() in md_personality
> >>>    dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io
> >>>      concurrent with reshape
> >>>    dm-raid: fix lockdep waring in "pers->hot_add_disk"
> >>>
> >>>   drivers/md/dm-raid.c | 93 ++++++++++++++++++++++++++++++++++----------
> >>>   drivers/md/md.c      | 73 ++++++++++++++++++++++++++--------
> >>>   drivers/md/md.h      | 38 +++++++++++++++++-
> >>>   drivers/md/raid5.c   | 32 ++++++++++++++-
> >>>   4 files changed, 196 insertions(+), 40 deletions(-)
> >>>
> >>> --
> >>> 2.39.2
> >>>
> >
> >
> > .
> >
>