linux-kernel - Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <35feaa54-db9e-f0d6-d5a5-a10a45bb90a5@huaweicloud.com>
Date: Mon, 4 Mar 2024 09:23:56 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Yu Kuai <yukuai1@...weicloud.com>, Xiao Ni <xni@...hat.com>
Cc: zkabelac@...hat.com, agk@...hat.com, snitzer@...nel.org,
 mpatocka@...hat.com, dm-devel@...ts.linux.dev, song@...nel.org,
 heinzm@...hat.com, neilb@...e.de, jbrassow@...hat.com,
 linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
 yi.zhang@...wei.com, yangerkun@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2

Hi,

在 2024/03/04 9:07, Yu Kuai 写道:
> Hi,
> 
> 在 2024/03/03 21:16, Xiao Ni 写道:
>> Hi all
>>
>> There is a error report from lvm regression tests. The case is
>> lvconvert-raid-reshape-stripes-load-reload.sh. I saw this error when I
>> tried to fix dmraid regression problems too. In my patch set,  after
>> reverting ad39c08186f8a0f221337985036ba86731d6aafe (md: Don't register
>> sync_thread for reshape directly), this problem doesn't appear.
> 
> How often did you see this tes failed? I'm running the tests for over
> two days now, for 30+ rounds, and this test never fail in my VM.

Take a quick look, there is still a path from raid10 that
MD_RECOVERY_FROZEN can be cleared, and in theroy this problem can be
triggered. Can you test the following patch on the top of this set?
I'll keep running the test myself.

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a5f8419e2df1..7ca29469123a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4575,7 +4575,8 @@ static int raid10_start_reshape(struct mddev *mddev)
         return 0;

  abort:
-       mddev->recovery = 0;
+       if (mddev->gendisk)
+               mddev->recovery = 0;
         spin_lock_irq(&conf->device_lock);
         conf->geo = conf->prev;
         mddev->raid_disks = conf->geo.raid_disks;

Thanks,
Kuai
> 
> Thanks,
> Kuai
> 
>>
>> I put the log in the attachment.
>>
>> On Fri, Mar 1, 2024 at 6:03 PM Yu Kuai <yukuai1@...weicloud.com> wrote:
>>>
>>> From: Yu Kuai <yukuai3@...wei.com>
>>>
>>> link to part1: 
>>> https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@mail.gmail.com/ 
>>>
>>>
>>> part1 contains fixes for deadlocks for stopping sync_thread
>>>
>>> This set contains fixes:
>>>   - reshape can start unexpected, cause data corruption, patch 1,5,6;
>>>   - deadlocks that reshape concurrent with IO, patch 8;
>>>   - a lockdep warning, patch 9;
>>>
>>> I'm runing lvm2 tests with following scripts with a few rounds now,
>>>
>>> for t in `ls test/shell`; do
>>>          if cat test/shell/$t | grep raid &> /dev/null; then
>>>                  make check T=shell/$t
>>>          fi
>>> done
>>>
>>> There are no deadlock and no fs corrupt now, however, there are still 
>>> four
>>> failed tests:
>>>
>>> ###       failed: [ndev-vanilla] shell/lvchange-raid1-writemostly.sh
>>> ###       failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
>>> ###       failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
>>> ###       failed: [ndev-vanilla] shell/lvextend-raid.sh
>>>
>>> And failed reasons are the same:
>>>
>>> ## ERROR: The test started dmeventd (147856) unexpectedly
>>>
>>> I have no clue yet, and it seems other folks doesn't have this issue.
>>>
>>> Yu Kuai (9):
>>>    md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
>>>    md: export helpers to stop sync_thread
>>>    md: export helper md_is_rdwr()
>>>    md: add a new helper reshape_interrupted()
>>>    dm-raid: really frozen sync_thread during suspend
>>>    md/dm-raid: don't call md_reap_sync_thread() directly
>>>    dm-raid: add a new helper prepare_suspend() in md_personality
>>>    dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io
>>>      concurrent with reshape
>>>    dm-raid: fix lockdep waring in "pers->hot_add_disk"
>>>
>>>   drivers/md/dm-raid.c | 93 ++++++++++++++++++++++++++++++++++----------
>>>   drivers/md/md.c      | 73 ++++++++++++++++++++++++++--------
>>>   drivers/md/md.h      | 38 +++++++++++++++++-
>>>   drivers/md/raid5.c   | 32 ++++++++++++++-
>>>   4 files changed, 196 insertions(+), 40 deletions(-)
>>>
>>> -- 
>>> 2.39.2
>>>
> 
> 
> .
>