linux-kernel - Re: [PATCH 2/3] md/raid10: fix incorrect done of recovery

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <1398a108-90ab-3790-eb43-faeaacda2c99@huaweicloud.com>
Date:   Fri, 26 May 2023 10:55:39 +0800
From:   Yu Kuai <yukuai1@...weicloud.com>
To:     Li Nan <linan666@...weicloud.com>,
        Yu Kuai <yukuai1@...weicloud.com>, song@...nel.org,
        shli@...com, allenpeng@...ology.com, alexwu@...ology.com,
        bingjingc@...ology.com, neilb@...e.de
Cc:     linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
        yi.zhang@...wei.com, houtao1@...wei.com, yangerkun@...wei.com,
        "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH 2/3] md/raid10: fix incorrect done of recovery

Hi,

在 2023/05/25 22:00, Li Nan 写道:
> 
> 
> 在 2023/5/22 21:54, Yu Kuai 写道:
>> Hi,
>>
>> 在 2023/05/22 19:54, linan666@...weicloud.com 写道:
>>> From: Li Nan <linan122@...wei.com>
>>>
>>> Recovery will go to giveup and let chunks_skipped++ in
>>> raid10_sync_request() if there are some bad_blocks, and it will return
>>> max_sector when chunks_skipped >= geo.raid_disks. Now, recovery fail and
>>> data is inconsistent but user think recovery is done, it is wrong.
>>>
>>> Fix it by set mirror's recovery_disabled and spare device shouln't be
>>> added to here.
>>>
>>> Signed-off-by: Li Nan <linan122@...wei.com>
>>> ---
>>>   drivers/md/raid10.c | 16 +++++++++++++++-
>>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>> index e21502c03b45..70cc87c7ee57 100644
>>> --- a/drivers/md/raid10.c
>>> +++ b/drivers/md/raid10.c
>>> @@ -3303,6 +3303,7 @@ static sector_t raid10_sync_request(struct 
>>> mddev *mddev, sector_t sector_nr,
>>>       int chunks_skipped = 0;
>>>       sector_t chunk_mask = conf->geo.chunk_mask;
>>>       int page_idx = 0;
>>> +    int error_disk = -1;
>>>       /*
>>>        * Allow skipping a full rebuild for incremental assembly
>>> @@ -3386,7 +3387,18 @@ static sector_t raid10_sync_request(struct 
>>> mddev *mddev, sector_t sector_nr,
>>>           return reshape_request(mddev, sector_nr, skipped);
>>>       if (chunks_skipped >= conf->geo.raid_disks) {
>>> -        /* if there has been nothing to do on any drive,
>>> +        pr_err("md/raid10:%s: %s fail\n", mdname(mddev),
>>> +            test_bit(MD_RECOVERY_SYNC, &mddev->recovery) ?  "resync" 
>>> : "recovery");
>>
>> Line exceed 80 columns, and following.
>>> +        if (error_disk >= 0 && !test_bit(MD_RECOVERY_SYNC, 
>>> &mddev->recovery)) {
>>
>> Resync has the same problem, right?
>>
> 
> Yes. But I have no idea to fix it. md_error disk nor set 
> recovery_disabled is a good solution. So, just print error message now.
> Do you have any ideas?

I'll look into this, in the meadtime, I don't suggest to apply this
patch because this is just temporary solution that only fix half of
the problem.

Thanks,
Kuai