linux-kernel - Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata write fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f9a22cef-0596-485c-b573-90d27bd3af36@huaweicloud.com>
Date: Fri, 15 Aug 2025 09:26:18 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Kenta Akagi <k@...l.me>, Yu Kuai <yukuai1@...weicloud.com>,
 Song Liu <song@...nel.org>, Mariusz Tkaczyk <mtkaczyk@...nel.org>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] md/raid1,raid10: don't broken array on failfast metadata
 write fails

Hi,

在 2025/08/14 23:54, Kenta Akagi 写道:
> On 2025/08/13 9:59, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/08/12 17:01, Kenta Akagi 写道:
>>> It is not intended for the array to fail when a metadata write with
>>> MD_FAILFAST fails.
>>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>>> when md_error is called on the last device in RAID1/10,
>>> the MD_BROKEN flag is set on the array.
>>> Because of this, a failfast metadata write failure will
>>> make the array "broken" state.
>>>
>>> If rdev is not Faulty even after calling md_error,
>>> the rdev is the last device, and there is nothing except
>>> MD_BROKEN that prevents writes to the array.
>>> Therefore, by clearing MD_BROKEN, the array will not become
>>> "broken" after a failfast metadata write failure.
>>
>> I don't understand here, I think MD_BROKEN is expected, the last
>> rdev has IO error while updating metadata, the array is now broken
>> and you can only read it afterwards. Allow using this broken array
>> read-write might causing more severe problem like data loss.
>>
> Thank you for reviewing.
> 
> I think that only when the bio has the MD_FAILFAST flag,
> a metadata write failure to the last rdev should not make it
> broken array at that point.
> 
> This is because a metadata write with MD_FAILFAST is retried after
> failure as follows:
> 
> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
> 
> 2. In md_super_wait, which is called by the function that
> executed md_super_write and waits for completion,
> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
> 
> 3. The caller of md_super_wait (such as md_update_sb)
> receives a negative return value and then retries md_super_write.
> 
> 4. The md_super_write function, which is called to perform
> the same metadata write, issues a write bio
> without MD_FAILFAST this time, because the rdev has LastDev flag.
> 
> When a bio from super_written without MD_FAILFAST fails,
> the array is truly broken, and MD_BROKEN should be set.
> 
> A failfast bio, for example in the case of nvme-tcp ,
> will fail immediately if the connection to the target is
> lost for a few seconds and the device enters a reconnecting
> state - even though it would recover if given a few seconds.
> This behavior is exactly as intended by the design of failfast.
> 
> However, md treats super_write operations fails with failfast as fatal.
> For example, if an initiator - that is, a machine loading the md module -
> loses all connections for a few seconds, the array becomes
> broken and subsequent write is no longer possible.
> This is the issue I am currently facing, and which this patch aims to fix.
> 
> Should I add more context to the commit message? Please advise.

Yes, please explain in detail in commit message.
> 
> Thanks,
> AKAGI
> 
>> Thanks,
>> Kuai
>>
>>>
>>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>>> Signed-off-by: Kenta Akagi <k@...l.me>
>>> ---
>>>    drivers/md/md.c | 1 +
>>>    drivers/md/md.h | 2 +-
>>>    2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index ac85ec73a409..3ec4abf02fa0 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -1002,6 +1002,7 @@ static void super_written(struct bio *bio)
>>>            md_error(mddev, rdev);
>>>            if (!test_bit(Faulty, &rdev->flags)
>>>                && (bio->bi_opf & MD_FAILFAST)) {
>>> +            clear_bit(MD_BROKEN, &mddev->flags);

And I feel a beeter way is to set MD_BROKEN only if the last rdev
failed, set it in middle and clear it is werid.

Thanks,
Kuai

>>>                set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>                set_bit(LastDev, &rdev->flags);
>>>            }
>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>> index 51af29a03079..2f87bcc5d834 100644
>>> --- a/drivers/md/md.h
>>> +++ b/drivers/md/md.h
>>> @@ -332,7 +332,7 @@ struct md_cluster_operations;
>>>     *                   resync lock, need to release the lock.
>>>     * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
>>>     *                calls to md_error() will never cause the array to
>>> - *                become failed.
>>> + *                become failed while fail_last_dev is not set.
>>>     * @MD_HAS_PPL:  The raid array has PPL feature set.
>>>     * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
>>>     * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
>>>
>>
>>
> .
>