[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51efe62a-6190-1fd5-7f7b-b17c3d1af54b@huaweicloud.com>
Date: Mon, 18 Aug 2025 10:05:06 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Kenta Akagi <k@...l.me>, Song Liu <song@...nel.org>,
Mariusz Tkaczyk <mtkaczyk@...nel.org>, Guoqing Jiang <jgq516@...il.com>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH v2 1/3] md/raid1,raid10: don't broken array on failfast
metadata write fails
Hi,
在 2025/08/18 1:27, Kenta Akagi 写道:
> A super_write IO failure with MD_FAILFAST must not cause the array
> to fail.
>
> Because a failfast bio may fail even when the rdev is not broken,
> so IO must be retried rather than failing the array when a metadata
> write with MD_FAILFAST fails on the last rdev.
Why just last rdev? If failfast can fail when the rdev is not broken, I
feel we should retry for all the rdev.
>
> A metadata write with MD_FAILFAST is retried after failure as
> follows:
>
> 1. In super_written, MD_SB_NEED_REWRITE is set in sb_flags.
>
> 2. In md_super_wait, which is called by the function that
> executed md_super_write and waits for completion,
> -EAGAIN is returned because MD_SB_NEED_REWRITE is set.
>
> 3. The caller of md_super_wait (such as md_update_sb)
> receives a negative return value and then retries md_super_write.
>
> 4. The md_super_write function, which is called to perform
> the same metadata write, issues a write bio without MD_FAILFAST
> this time.
>
> When a write from super_written without MD_FAILFAST fails,
> the array may broken, and MD_BROKEN should be set.
>
> After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
> calling md_error on the last rdev in RAID1/10 always sets
> the MD_BROKEN flag on the array.
> As a result, when failfast IO fails on the last rdev, the array
> immediately becomes failed.
>
> This commit prevents MD_BROKEN from being set when a super_write with
> MD_FAILFAST fails on the last rdev, ensuring that the array does
> not become failed due to failfast IO failures.
>
> Failfast IO failures on any rdev except the last one are not retried
> and are marked as Faulty immediately. This minimizes array IO latency
> when an rdev fails.
>
> Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
> Signed-off-by: Kenta Akagi <k@...l.me>
> ---
> drivers/md/md.c | 9 ++++++---
> drivers/md/md.h | 7 ++++---
> drivers/md/raid1.c | 12 ++++++++++--
> drivers/md/raid10.c | 12 ++++++++++--
> 4 files changed, 30 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index ac85ec73a409..61a8188849a3 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -999,14 +999,17 @@ static void super_written(struct bio *bio)
> if (bio->bi_status) {
> pr_err("md: %s gets error=%d\n", __func__,
> blk_status_to_errno(bio->bi_status));
> + if (bio->bi_opf & MD_FAILFAST)
> + set_bit(FailfastIOFailure, &rdev->flags);
I think it's better to retry the bio with the flag cleared, then all
underlying procedures can stay the same.
Thanks,
Kuai
> md_error(mddev, rdev);
> if (!test_bit(Faulty, &rdev->flags)
> && (bio->bi_opf & MD_FAILFAST)) {
> + pr_warn("md: %s: Metadata write will be repeated to %pg\n",
> + mdname(mddev), rdev->bdev);
> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> - set_bit(LastDev, &rdev->flags);
> }
> } else
> - clear_bit(LastDev, &rdev->flags);
> + clear_bit(FailfastIOFailure, &rdev->flags);
>
> bio_put(bio);
>
> @@ -1048,7 +1051,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
>
> if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
> test_bit(FailFast, &rdev->flags) &&
> - !test_bit(LastDev, &rdev->flags))
> + !test_bit(FailfastIOFailure, &rdev->flags))
> bio->bi_opf |= MD_FAILFAST;
>
> atomic_inc(&mddev->pending_writes);
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index 51af29a03079..cf989aca72ad 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -281,9 +281,10 @@ enum flag_bits {
> * It is expects that no bad block log
> * is present.
> */
> - LastDev, /* Seems to be the last working dev as
> - * it didn't fail, so don't use FailFast
> - * any more for metadata
> + FailfastIOFailure, /* A device that failled a metadata write
> + * with failfast.
> + * error_handler must not fail the array
> + * if last device has this flag.
> */
> CollisionCheck, /*
> * check if there is collision between raid1
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 408c26398321..fc7195e58f80 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1746,8 +1746,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
> * - recovery is interrupted.
> * - &mddev->degraded is bumped.
> *
> - * @rdev is marked as &Faulty excluding case when array is failed and
> - * &mddev->fail_last_dev is off.
> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
> + * failed in failfast and will be retried, so the @mddev did not fail.
> + *
> + * @rdev is marked as &Faulty excluding any cases:
> + * - when @mddev is failed and &mddev->fail_last_dev is off
> + * - when @rdev is last device and &FailfastIOFailure flag is set
> */
> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> {
> @@ -1758,6 +1762,10 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>
> if (test_bit(In_sync, &rdev->flags) &&
> (conf->raid_disks - mddev->degraded) == 1) {
> + if (test_bit(FailfastIOFailure, &rdev->flags)) {
> + spin_unlock_irqrestore(&conf->device_lock, flags);
> + return;
> + }
> set_bit(MD_BROKEN, &mddev->flags);
>
> if (!mddev->fail_last_dev) {
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b60c30bfb6c7..ff105a0dcd05 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1995,8 +1995,12 @@ static int enough(struct r10conf *conf, int ignore)
> * - recovery is interrupted.
> * - &mddev->degraded is bumped.
> *
> - * @rdev is marked as &Faulty excluding case when array is failed and
> - * &mddev->fail_last_dev is off.
> + * If @rdev is marked with &FailfastIOFailure, it means that super_write
> + * failed in failfast, so the @mddev did not fail.
> + *
> + * @rdev is marked as &Faulty excluding any cases:
> + * - when @mddev is failed and &mddev->fail_last_dev is off
> + * - when @rdev is last device and &FailfastIOFailure flag is set
> */
> static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
> {
> @@ -2006,6 +2010,10 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
> spin_lock_irqsave(&conf->device_lock, flags);
>
> if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
> + if (test_bit(FailfastIOFailure, &rdev->flags)) {
> + spin_unlock_irqrestore(&conf->device_lock, flags);
> + return;
> + }
> set_bit(MD_BROKEN, &mddev->flags);
>
> if (!mddev->fail_last_dev) {
>
Powered by blists - more mailing lists