linux-kernel - Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <0106019b22c4e290-8379f848-c04a-4737-8fcc-baf9aa897164-000000@ap-northeast-1.amazonses.com>
Date: Mon, 15 Dec 2025 16:08:07 +0000
From: Kenta Akagi <k@...l.me>
To: Xiao Ni <xni@...hat.com>
Cc: song@...nel.org, yukuai3@...wei.com, mtkaczyk@...nel.org, 
	jgq516@...il.com, linux-raid@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on
 failfast io failure

Hi, Xiao
Sorry for late reply.

On 2025/11/19 12:13, Xiao Ni wrote:
> Hi Kenta
> 
> I know the patch set is already V5 now. I'm looking at all the emails to understand the evolution process. Because v5 is much more complicated than the original version. I understand the problem already. We don't want to set MD_BROKEN for failfast normal/metadata io. Can't it work that check FailFast before setting MD_BROKEN, such as:
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 202e510f73a4..7acac7eef514 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1755,12 +1755,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> 
>         if (test_bit(In_sync, &rdev->flags) &&
>             (conf->raid_disks - mddev->degraded) == 1) {
> -               set_bit(MD_BROKEN, &mddev->flags);
> 
> -               if (!mddev->fail_last_dev) {
> +               if (!mddev->fail_last_dev || test_bit(FailFast, &rdev->flags)) {
>                         conf->recovery_disabled = mddev->recovery_disabled;
>                         return;
>                 }
> +
> +               set_bit(MD_BROKEN, &mddev->flags);
>         }
>         set_bit(Blocked, &rdev->flags);
>         if (test_and_clear_bit(In_sync, &rdev->flags))
> 
> I know mostly this way can't resolve this problem. Because you, Linan and Kuai already talked a lot about this problem. But I don't know why this change can't work. 

I may have been too concerned with maintaining the behavior seen from the user space.
and I think this method is good.

We may need to consider that raid1/raid10 will never MD_BROKEN when rdev has failfast.
MD_BROKEN was introduced to raid1/raid10 in 2023, and I believe there are
almost no cases where people expect it to become "broken" when the last rdev behaves strangely.
So, rather than complicating error handling for failfast, it seems better to have a way of not
set MD_BROKEN when failfast is configured.

I'll sending the next patch using this approach.

Thanks!
Akagi

> 
> Best Regards
> 
> Xiao
> 
> On Fri, Aug 29, 2025 at 12:39 AM Kenta Akagi <k@...l.me <mailto:k@...l.me>> wrote:
> 
>     This commit ensures that an MD_FAILFAST IO failure does not put
>     the array into a broken state.
> 
>     When failfast is enabled on rdev in RAID1 or RAID10,
>     the array may be flagged MD_BROKEN in the following cases.
>     - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>     - If an MD_FAILFAST metadata write to the 'last' rdev fails
> 
>     The MD_FAILFAST bio error handler always calls md_error on IO failure,
>     under the assumption that raid{1,10}_error will neither fail
>     the last rdev nor break the array.
>     After commit 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10"),
>     calling md_error on the 'last' rdev in RAID1/10 always sets
>     the MD_BROKEN flag on the array.
>     As a result, when failfast IO fails on the last rdev, the array
>     immediately becomes failed.
> 
>     Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
>     an edge case; however, it can occur if rdevs in a non-degraded
>     array share the same path and that path is lost, or if a metadata
>     write is triggered in a degraded array and fails due to failfast.
> 
>     When a failfast metadata write fails, it is retried through the
>     following sequence. If a metadata write without failfast fails,
>     the array will be marked with MD_BROKEN.
> 
>     1. MD_SB_NEED_REWRITE is set in sb_flags by super_written.
>     2. md_super_wait, called by the function executing md_super_write,
>        returns -EAGAIN due to MD_SB_NEED_REWRITE.
>     3. The caller of md_super_wait (e.g., md_update_sb) receives the
>        negative return value and retries md_super_write.
>     4. md_super_write issues the metadata write again,
>        this time without MD_FAILFAST.
> 
>     Fixes: 9631abdbf406 ("md: Set MD_BROKEN for RAID1 and RAID10")
>     Signed-off-by: Kenta Akagi <k@...l.me <mailto:k@...l.me>>
>     ---
>      drivers/md/md.c     | 14 +++++++++-----
>      drivers/md/md.h     | 13 +++++++------
>      drivers/md/raid1.c  | 18 ++++++++++++++++--
>      drivers/md/raid10.c | 21 ++++++++++++++++++---
>      4 files changed, 50 insertions(+), 16 deletions(-)
> 
>     diff --git a/drivers/md/md.c b/drivers/md/md.c
>     index ac85ec73a409..547b88e253f7 100644
>     --- a/drivers/md/md.c
>     +++ b/drivers/md/md.c
>     @@ -999,14 +999,18 @@ static void super_written(struct bio *bio)
>             if (bio->bi_status) {
>                     pr_err("md: %s gets error=%d\n", __func__,
>                            blk_status_to_errno(bio->bi_status));
>     +               if (bio->bi_opf & MD_FAILFAST)
>     +                       set_bit(FailfastIOFailure, &rdev->flags);
>                     md_error(mddev, rdev);
>                     if (!test_bit(Faulty, &rdev->flags)
>                         && (bio->bi_opf & MD_FAILFAST)) {
>     +                       pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>     +                               mdname(mddev), rdev->bdev);
>                             set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>     -                       set_bit(LastDev, &rdev->flags);
>                     }
>     -       } else
>     -               clear_bit(LastDev, &rdev->flags);
>     +       } else {
>     +               clear_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>     +       }
> 
>             bio_put(bio);
> 
>     @@ -1048,7 +1052,7 @@ void md_super_write(struct mddev *mddev, struct md_rdev *rdev,
> 
>             if (test_bit(MD_FAILFAST_SUPPORTED, &mddev->flags) &&
>                 test_bit(FailFast, &rdev->flags) &&
>     -           !test_bit(LastDev, &rdev->flags))
>     +           !test_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags))
>                     bio->bi_opf |= MD_FAILFAST;
> 
>             atomic_inc(&mddev->pending_writes);
>     @@ -1059,7 +1063,7 @@ int md_super_wait(struct mddev *mddev)
>      {
>             /* wait for all superblock writes that were scheduled to complete */
>             wait_event(mddev->sb_wait, atomic_read(&mddev->pending_writes)==0);
>     -       if (test_and_clear_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags))
>     +       if (test_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags))
>                     return -EAGAIN;
>             return 0;
>      }
>     diff --git a/drivers/md/md.h b/drivers/md/md.h
>     index 51af29a03079..52c9fc759015 100644
>     --- a/drivers/md/md.h
>     +++ b/drivers/md/md.h
>     @@ -281,9 +281,10 @@ enum flag_bits {
>                                      * It is expects that no bad block log
>                                      * is present.
>                                      */
>     -       LastDev,                /* Seems to be the last working dev as
>     -                                * it didn't fail, so don't use FailFast
>     -                                * any more for metadata
>     +       FailfastIOFailure,      /* rdev with failfast IO failure
>     +                                * but md_error not yet completed.
>     +                                * If the last rdev has this flag,
>     +                                * error_handler must not fail the array
>                                      */
>             CollisionCheck,         /*
>                                      * check if there is collision between raid1
>     @@ -331,8 +332,8 @@ struct md_cluster_operations;
>       * @MD_CLUSTER_RESYNC_LOCKED: cluster raid only, which means node, already took
>       *                            resync lock, need to release the lock.
>       * @MD_FAILFAST_SUPPORTED: Using MD_FAILFAST on metadata writes is supported as
>     - *                         calls to md_error() will never cause the array to
>     - *                         become failed.
>     + *                         calls to md_error() with FailfastIOFailure will
>     + *                         never cause the array to become failed.
>       * @MD_HAS_PPL:  The raid array has PPL feature set.
>       * @MD_HAS_MULTIPLE_PPLS: The raid array has multiple PPLs feature set.
>       * @MD_NOT_READY: do_md_run() is active, so 'array_state', ust not report that
>     @@ -360,7 +361,7 @@ enum mddev_sb_flags {
>             MD_SB_CHANGE_DEVS,              /* Some device status has changed */
>             MD_SB_CHANGE_CLEAN,     /* transition to or from 'clean' */
>             MD_SB_CHANGE_PENDING,   /* switch from 'clean' to 'active' in progress */
>     -       MD_SB_NEED_REWRITE,     /* metadata write needs to be repeated */
>     +       MD_SB_NEED_REWRITE,     /* metadata write needs to be repeated, do not use failfast */
>      };
> 
>      #define NR_SERIAL_INFOS                8
>     diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>     index 408c26398321..8a61fd93b3ff 100644
>     --- a/drivers/md/raid1.c
>     +++ b/drivers/md/raid1.c
>     @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>                         (bio->bi_opf & MD_FAILFAST) &&
>                         /* We never try FailFast to WriteMostly devices */
>                         !test_bit(WriteMostly, &rdev->flags)) {
>     +                       set_bit(FailfastIOFailure, &rdev->flags);
>                             md_error(r1_bio->mddev, rdev);
>                     }
> 
>     @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>       *     - recovery is interrupted.
>       *     - &mddev->degraded is bumped.
>       *
>     - * @rdev is marked as &Faulty excluding case when array is failed and
>     - * &mddev->fail_last_dev is off.
>     + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>     + * then @mddev and @rdev will not be marked as failed.
>     + *
>     + * @rdev is marked as &Faulty excluding any cases:
>     + *     - when @mddev is failed and &mddev->fail_last_dev is off
>     + *     - when @rdev is last device and &FailfastIOFailure flag is set
>       */
>      static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>      {
>     @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> 
>             if (test_bit(In_sync, &rdev->flags) &&
>                 (conf->raid_disks - mddev->degraded) == 1) {
>     +               if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>     +                       spin_unlock_irqrestore(&conf->device_lock, flags);
>     +                       pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>     +                               "last device but ignoring it\n",
>     +                               mdname(mddev), rdev->bdev);
>     +                       return;
>     +               }
>                     set_bit(MD_BROKEN, &mddev->flags);
> 
>                     if (!mddev->fail_last_dev) {
>     @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>             if (test_bit(FailFast, &rdev->flags)) {
>                     /* Don't try recovering from here - just fail it
>                      * ... unless it is the last working device of course */
>     +               set_bit(FailfastIOFailure, &rdev->flags);
>                     md_error(mddev, rdev);
>                     if (test_bit(Faulty, &rdev->flags))
>                             /* Don't try to read from here, but make sure
>     @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>                     fix_read_error(conf, r1_bio);
>                     unfreeze_array(conf);
>             } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>     +               set_bit(FailfastIOFailure, &rdev->flags);
>                     md_error(mddev, rdev);
>             } else {
>                     r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>     diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>     index b60c30bfb6c7..530ad6503189 100644
>     --- a/drivers/md/raid10.c
>     +++ b/drivers/md/raid10.c
>     @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>                             dec_rdev = 0;
>                             if (test_bit(FailFast, &rdev->flags) &&
>                                 (bio->bi_opf & MD_FAILFAST)) {
>     +                               set_bit(FailfastIOFailure, &rdev->flags);
>                                     md_error(rdev->mddev, rdev);
>                             }
> 
>     @@ -1995,8 +1996,12 @@ static int enough(struct r10conf *conf, int ignore)
>       *     - recovery is interrupted.
>       *     - &mddev->degraded is bumped.
>       *
>     - * @rdev is marked as &Faulty excluding case when array is failed and
>     - * &mddev->fail_last_dev is off.
>     + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>     + * then @mddev and @rdev will not be marked as failed.
>     + *
>     + * @rdev is marked as &Faulty excluding any cases:
>     + *     - when @mddev is failed and &mddev->fail_last_dev is off
>     + *     - when @rdev is last device and &FailfastIOFailure flag is set
>       */
>      static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>      {
>     @@ -2006,6 +2011,13 @@ static void raid10_error(struct mddev *mddev, struct md_rdev *rdev)
>             spin_lock_irqsave(&conf->device_lock, flags);
> 
>             if (test_bit(In_sync, &rdev->flags) && !enough(conf, rdev->raid_disk)) {
>     +               if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>     +                       spin_unlock_irqrestore(&conf->device_lock, flags);
>     +                       pr_warn_ratelimited("md/raid10:%s: Failfast IO failure on %pg, "
>     +                               "last device but ignoring it\n",
>     +                               mdname(mddev), rdev->bdev);
>     +                       return;
>     +               }
>                     set_bit(MD_BROKEN, &mddev->flags);
> 
>                     if (!mddev->fail_last_dev) {
>     @@ -2413,6 +2425,7 @@ static void sync_request_write(struct mddev *mddev, struct r10bio *r10_bio)
>                                     continue;
>                     } else if (test_bit(FailFast, &rdev->flags)) {
>                             /* Just give up on this device */
>     +                       set_bit(FailfastIOFailure, &rdev->flags);
>                             md_error(rdev->mddev, rdev);
>                             continue;
>                     }
>     @@ -2865,8 +2878,10 @@ static void handle_read_error(struct mddev *mddev, struct r10bio *r10_bio)
>                     freeze_array(conf, 1);
>                     fix_read_error(conf, mddev, r10_bio);
>                     unfreeze_array(conf);
>     -       } else
>     +       } else {
>     +               set_bit(FailfastIOFailure, &rdev->flags);
>                     md_error(mddev, rdev);
>     +       }
> 
>             rdev_dec_pending(rdev, mddev);
>             r10_bio->state = 0;
>     -- 
>     2.50.1
> 
>