[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c87f249a-2bfd-edd2-887d-87413bd044d7@huaweicloud.com>
Date: Mon, 4 Mar 2024 19:52:52 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Xiao Ni <xni@...hat.com>, Yu Kuai <yukuai1@...weicloud.com>
Cc: zkabelac@...hat.com, agk@...hat.com, snitzer@...nel.org,
mpatocka@...hat.com, dm-devel@...ts.linux.dev, song@...nel.org,
heinzm@...hat.com, neilb@...e.de, jbrassow@...hat.com,
linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
yi.zhang@...wei.com, yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2
Hi,
在 2024/03/04 19:06, Xiao Ni 写道:
> On Mon, Mar 4, 2024 at 4:27 PM Xiao Ni <xni@...hat.com> wrote:
>>
>> On Mon, Mar 4, 2024 at 9:25 AM Xiao Ni <xni@...hat.com> wrote:
>>>
>>> On Mon, Mar 4, 2024 at 9:24 AM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> 在 2024/03/04 9:07, Yu Kuai 写道:
>>>>> Hi,
>>>>>
>>>>> 在 2024/03/03 21:16, Xiao Ni 写道:
>>>>>> Hi all
>>>>>>
>>>>>> There is a error report from lvm regression tests. The case is
>>>>>> lvconvert-raid-reshape-stripes-load-reload.sh. I saw this error when I
>>>>>> tried to fix dmraid regression problems too. In my patch set, after
>>>>>> reverting ad39c08186f8a0f221337985036ba86731d6aafe (md: Don't register
>>>>>> sync_thread for reshape directly), this problem doesn't appear.
>>>>>
>>>
>>> Hi Kuai
>>>>> How often did you see this tes failed? I'm running the tests for over
>>>>> two days now, for 30+ rounds, and this test never fail in my VM.
>>>
>>> I ran 5 times and it failed 2 times just now.
>>>
>>>>
>>>> Take a quick look, there is still a path from raid10 that
>>>> MD_RECOVERY_FROZEN can be cleared, and in theroy this problem can be
>>>> triggered. Can you test the following patch on the top of this set?
>>>> I'll keep running the test myself.
>>>
>>> Sure, I'll give the result later.
>>
>> Hi all
>>
>> It's not stable to reproduce this. After applying this raid10 patch it
>> failed once 28 times. Without the raid10 patch, it failed once 30
>> times, but it failed frequently this morning.
>
> Hi all
>
> After running 152 times with kernel 6.6, the problem can appear too.
> So it can return the state of 6.6. This patch set can make this
> problem appear quickly.
I verified in my VM that after testing 100+ times, this problem can both
triggered with v6.6 and v6.8-rc5 + this set.
I think we can merge this patchset, and figure out why the test can fail
later.
Thanks,
Kuai
>
> Best Regards
> Xiao
>
>
>>
>> Regards
>> Xiao
>>>
>>> Regards
>>> Xiao
>>>>
>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>> index a5f8419e2df1..7ca29469123a 100644
>>>> --- a/drivers/md/raid10.c
>>>> +++ b/drivers/md/raid10.c
>>>> @@ -4575,7 +4575,8 @@ static int raid10_start_reshape(struct mddev *mddev)
>>>> return 0;
>>>>
>>>> abort:
>>>> - mddev->recovery = 0;
>>>> + if (mddev->gendisk)
>>>> + mddev->recovery = 0;
>>>> spin_lock_irq(&conf->device_lock);
>>>> conf->geo = conf->prev;
>>>> mddev->raid_disks = conf->geo.raid_disks;
>>>>
>>>> Thanks,
>>>> Kuai
>>>>>
>>>>> Thanks,
>>>>> Kuai
>>>>>
>>>>>>
>>>>>> I put the log in the attachment.
>>>>>>
>>>>>> On Fri, Mar 1, 2024 at 6:03 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>>>>>
>>>>>>> From: Yu Kuai <yukuai3@...wei.com>
>>>>>>>
>>>>>>> link to part1:
>>>>>>> https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/
>>>>>>>
>>>>>>>
>>>>>>> part1 contains fixes for deadlocks for stopping sync_thread
>>>>>>>
>>>>>>> This set contains fixes:
>>>>>>> - reshape can start unexpected, cause data corruption, patch 1,5,6;
>>>>>>> - deadlocks that reshape concurrent with IO, patch 8;
>>>>>>> - a lockdep warning, patch 9;
>>>>>>>
>>>>>>> I'm runing lvm2 tests with following scripts with a few rounds now,
>>>>>>>
>>>>>>> for t in `ls test/shell`; do
>>>>>>> if cat test/shell/$t | grep raid &> /dev/null; then
>>>>>>> make check T=shell/$t
>>>>>>> fi
>>>>>>> done
>>>>>>>
>>>>>>> There are no deadlock and no fs corrupt now, however, there are still
>>>>>>> four
>>>>>>> failed tests:
>>>>>>>
>>>>>>> ### failed: [ndev-vanilla] shell/lvchange-raid1-writemostly.sh
>>>>>>> ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
>>>>>>> ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
>>>>>>> ### failed: [ndev-vanilla] shell/lvextend-raid.sh
>>>>>>>
>>>>>>> And failed reasons are the same:
>>>>>>>
>>>>>>> ## ERROR: The test started dmeventd (147856) unexpectedly
>>>>>>>
>>>>>>> I have no clue yet, and it seems other folks doesn't have this issue.
>>>>>>>
>>>>>>> Yu Kuai (9):
>>>>>>> md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
>>>>>>> md: export helpers to stop sync_thread
>>>>>>> md: export helper md_is_rdwr()
>>>>>>> md: add a new helper reshape_interrupted()
>>>>>>> dm-raid: really frozen sync_thread during suspend
>>>>>>> md/dm-raid: don't call md_reap_sync_thread() directly
>>>>>>> dm-raid: add a new helper prepare_suspend() in md_personality
>>>>>>> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io
>>>>>>> concurrent with reshape
>>>>>>> dm-raid: fix lockdep waring in "pers->hot_add_disk"
>>>>>>>
>>>>>>> drivers/md/dm-raid.c | 93 ++++++++++++++++++++++++++++++++++----------
>>>>>>> drivers/md/md.c | 73 ++++++++++++++++++++++++++--------
>>>>>>> drivers/md/md.h | 38 +++++++++++++++++-
>>>>>>> drivers/md/raid5.c | 32 ++++++++++++++-
>>>>>>> 4 files changed, 196 insertions(+), 40 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.39.2
>>>>>>>
>>>>>
>>>>>
>>>>> .
>>>>>
>>>>
>
> .
>
Powered by blists - more mailing lists