linux-kernel - Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast metadata write fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fb752529-6802-4ef9-aeb3-9b04ba86ef5f@huaweicloud.com>
Date: Mon, 18 Aug 2025 10:48:36 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Yu Kuai <yukuai1@...weicloud.com>, Kenta Akagi <k@...l.me>,
 Song Liu <song@...nel.org>, Mariusz Tkaczyk <mtkaczyk@...nel.org>,
 Guoqing Jiang <jgq516@...il.com>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast
 metadata write fails

Hi,

在 2025/08/18 10:05, Yu Kuai 写道:
> Hi,
> 
> 在 2025/08/18 1:27, Kenta Akagi 写道:
>> A super_write IO failure with MD_FAILFAST must not cause the array
>> to fail.
>>
>> Because a failfast bio may fail even when the rdev is not broken,
>> so IO must be retried rather than failing the array when a metadata
>> write with MD_FAILFAST fails on the last rdev.
> 
> Why just last rdev? If failfast can fail when the rdev is not broken, I
> feel we should retry for all the rdev.

BTW, I couldn't figure out the reason, why failfast is added for the
meta write. I do feel just remove this flag for metadata write will fix
this problem.

Thanks,
Kuai

>>
>> A metadata write with MD_FAILFAST is retried after failure as
>> follows:
>>
>> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>>
>> 2. In md_super_wait, which is called by the function that
>> executed md_super_write and waits for completion,
>> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>>
>> 3. The caller of md_super_wait (such as md_update_sb)
>> receives a negative return value and then retries md_super_write.
>>
>> 4. The md_super_write function, which is called to perform
>> the same metadata write, issues a write bio without MD_FAILFAST
>> this time.
>>
>> When a write from super_written without MD_FAILFAST fails,
>> the array may broken, and MD_BROKEN should be set.
>>
>> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>> calling md_error on the last rdev in RAID1/10 always sets
>> the MD_BROKEN flag on the array.
>> As a result, when failfast IO fails on the last rdev, the array
>> immediately becomes failed.
>>
>> This commit prevents MD_BROKEN from being set when a super_write with
>> MD_FAILFAST fails on the last rdev, ensuring that the array does
>> not become failed due to failfast IO failures.
>>
>> Failfast IO failures on any rdev except the last one are not retried
>> and are marked as Faulty immediately. This minimizes array IO latency
>> when an rdev fails.
>>
>> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>> Signed-off-by: Kenta Akagi <k@...l.me>
>> ---
>>   drivers/md/md.c     |  9 ++++++---
>>   drivers/md/md.h     |  7 ++++---
>>   drivers/md/raid1.c  | 12 ++++++++++--
>>   drivers/md/raid10.c | 12 ++++++++++--
>>   4 files changed, 30 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index ac85ec73a409..61a8188849a3 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
>>       if (bio->bi_status) {
>>           pr_err("md: %s gets error=%d\n", __func__,
>>                  blk_status_to_errno(bio->bi_status));
>> +        if (bio->bi_opf & MD_FAILFAST)
>> +            set_bit(FailfastIOFailure, &rdev->flags);
> 
> I think it's better to retry the bio with the flag cleared, then all
> underlying procedures can stay the same.
> 
> Thanks,
> Kuai
> 
>>           md_error(mddev, rdev);
>>           if (!test_bit(Faulty, &rdev->flags)
>>               && (bio->bi_opf & MD_FAILFAST)) {
>> +            pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>> +                mdname(mddev), rdev->bdev);
>>               set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>> -            set_bit(LastDev, &rdev->flags);
>>           }
>>       } else
>> -        clear_bit(LastDev, &rdev->flags);
>> +        clear_bit(FailfastIOFailure, &rdev->flags);
>>       bio_put(bio);
>> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct 
>> md_rdev *rdev,
>>       if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
>>           test_bit(FailFast, &rdev->flags) &&
>> -        !test_bit(LastDev, &rdev->flags))
>> +        !test_bit(FailfastIOFailure, &rdev->flags))
>>           bio->bi_opf |= MD_FAILFAST;
>>       atomic_inc(&mddev->pending_writes);
>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>> index 51af29a03079..cf989aca72ad 100644
>> --- a/drivers/md/md.h
>> +++ b/drivers/md/md.h
>> @@ -281,9 +281,10 @@ enum flag_bits {
>>                    * It is expects that no bad block log
>>                    * is present.
>>                    */
>> -    LastDev,        /* Seems to be the last working dev as
>> -                 * it didn't fail, so don't use FailFast
>> -                 * any more for metadata
>> +    FailfastIOFailure,    /* A device that failled a metadata write
>> +                 * with failfast.
>> +                 * error_handler must not fail the array
>> +                 * if last device has this flag.
>>                    */
>>       CollisionCheck,        /*
>>                    * check if there is collision between raid1
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 408c26398321..fc7195e58f80 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, 
>> struct mddev *mddev)
>>    *    - recovery is interrupted.
>>    *    - &mddev->degraded is bumped.
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>> + * failed in failfast and will be retried, so the @mddev did not fail.
>> + *
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>    */
>>   static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, 
>> struct md_rdev *rdev)
>>       if (test_bit(In_sync, &rdev->flags) &&
>>           (conf->raid_disks - mddev->degraded) == 1) {
>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>> +            return;
>> +        }
>>           set_bit(MD_BROKEN, &mddev->flags);
>>           if (!mddev->fail_last_dev) {
>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>> index b60c30bfb6c7..ff105a0dcd05 100644
>> --- a/drivers/md/raid10.c
>> +++ b/drivers/md/raid10.c
>> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int 
>> ignore)
>>    *    - recovery is interrupted.
>>    *    - &mddev->degraded is bumped.
>>    *
>> - * @rdev is marked as &Faulty excluding case when array is failed and
>> - * &mddev->fail_last_dev is off.
>> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
>> + * failed in failfast, so the @mddev did not fail.
>> + *
>> + * @rdev is marked as &Faulty excluding any cases:
>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>    */
>>   static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>>   {
>> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, 
>> struct md_rdev *rdev)
>>       spin_lock_irqsave(&conf->device_lock, flags);
>>       if (test_bit(In_sync, &rdev->flags) && !enough(conf, 
>> rdev->raid_disk)) {
>> +        if (test_bit(FailfastIOFailure, &rdev->flags)) {
>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>> +            return;
>> +        }
>>           set_bit(MD_BROKEN, &mddev->flags);
>>           if (!mddev->fail_last_dev) {
>>
> 
> .
>