linux-kernel - Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5687cdaf-b8f8-2e88-6440-74867a35c330@huaweicloud.com>
Date: Mon, 8 Sep 2025 09:27:57 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Kenta Akagi <k@...l.me>, Yu Kuai <yukuai1@...weicloud.com>,
 Li Nan <linan666@...weicloud.com>, Song Liu <song@...nel.org>,
 Mariusz Tkaczyk <mtkaczyk@...nel.org>, Guoqing Jiang <jgq516@...il.com>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast
 io failure

Hi,

在 2025/09/05 23:07, Kenta Akagi 写道:
> 
> On 2025/09/02 1:48, Kenta Akagi wrote:
>>
> 
> Hi,
> 
> By the way, I found two more issues.
> This can be reproduced by causing a Failfast IO failure using the attached procedure.
> I wrote patches for them. Would it make sense to send these as v4?
> 
> * raid1,raid10: IO error returned even when all writes succeed in narrow_write_error
> When only one rdev remains, a failfast write bio is retried in
> narrow_write_error() via handle_write_completed(). Even if all retries succeed,
> R1BIO_Uptodate is not set. As a result, the upper layer get BLK_STS_IOERR.
> 
> This behavior appears to be intentional, under the assumption of at least one bad sector.
> However, this behavior is undesirable for failfast retries.
> In addition, although less likely, even on a normal HDD or SSD,
> if all split writes succeed, the upper layer would still receive
> BLK_STS_IOERR despite the write having ultimately completed successfully.
> Of course, this does not apply if the other rdev writes succeed.
> 
> I'm not exactly sure how to handle the following point.
> - Can a retried write be marked R1BIO_Uptodate after one failure in non-failfast?
>    - Should it only be marked Uptodate for failfast?

If retry succeed and no badblocks is set, we should set the IO uptodate,
return error to user is not good.

> - Should failfast retries be handled outside narrow_write_error?

Please don't do this unless we have to.
> 
> For now, I attempt to set R1BIO_Uptodate whenever narrow_write_error() succeeds
> and there are no bad blocks, regardless of MD_FAILFAST.
> 
> * raid10: no retry scheduled for failfast read failure
> raid10_end_read_request lacks a path to retry when a FailFast IO fails.
> As a result, when Failfast Read IOs fail on all rdevs,
> the upper layer receives BLK_STS_IOERR without a retry,
> 
> Looking at the two commits below, it seems only raid10_end_read_request
> lacks the failfast read retry handling, while raid1_end_read_request has it.
> In RAID1, the retry works as expected.
> 
> I don't know why raid10_end_read_request lacks this, but it is probably
> just a simple oversight.
> 
> - 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
> - 2e52d449bcec ("md/raid1: add failfast handling for reads.")
> 

Yes, this looks correct.

Thanks,
Kuai

> Here's how to reproduce :
> 
> # prepare nvmet/nvme-tcp and md array
> sh-5.2# cat << 'EOF' > loopback-nvme.sh
>> set -eu
> nqn="nqn.2025-08.io.example:nvmet-test-$1"
> back=$2
> cd /sys/kernel/config/nvmet/
> mkdir subsystems/$nqn
> echo 1 > subsystems/${nqn}/attr_allow_any_host
> mkdir subsystems/${nqn}/namespaces/1
> echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path
> echo 1 > subsystems/${nqn}/namespaces/1/enable
> ports="ports/1"
> if [ ! -d $ports ]; then
>          mkdir $ports
>          cd $ports
>          echo 127.0.0.1 > addr_traddr
>          echo tcp       > addr_trtype
>          echo 4420      > addr_trsvcid
>          echo ipv4      > addr_adrfam
>          cd ../../
> fi
> ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/
> nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420
>> EOF
> sh-5.2# chmod +x loopback-nvme.sh
> sh-5.2# modprobe -a nvme-tcp nvmet-tcp
> sh-5.2# truncate -s 1g a.img b.img
> sh-5.2# losetup --show -f a.img
> /dev/loop0
> sh-5.2# losetup --show -f b.img
> /dev/loop1
> sh-5.2# ./loopback-nvme.sh 0 /dev/loop0
> connecting to device: nvme0
> sh-5.2# ./loopback-nvme.sh 1 /dev/loop1
> connecting to device: nvme1
> sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \
> --failfast /dev/nvme0n1 --failfast /dev/nvme1n1
> ...
> mdadm: array /dev/md0 started.
> 
> # run fio
> sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \
> --bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1
> 
> # It can reproduce the issue by block nvme traffic during fio
> sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP;
> sh-5.2# sleep 10; # twice the default KATO value
> sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP
> 
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 408c26398321..ce4dff63f50a 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
>   static bool narrow_write_error(struct r1bio *r1_bio, int i)
>   {
>          struct mddev *mddev = r1_bio->mddev;
> @@ -2519,13 +2521,16 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>          sector_t sector;
>          int sectors;
>          int sect_to_write = r1_bio->sectors;
> -       bool ok = true;
> +       bool write_ok = true;
> +       bool setbad_ok = true;
> +       bool bbl_enabled = !(rdev->badblocks.shift < 0);
> 
> -       if (rdev->badblocks.shift < 0)
> -               return false;
> +       if (bbl_enabled)
> +               block_sectors = roundup(1 << rdev->badblocks.shift,
> +                                       bdev_logical_block_size(rdev->bdev) >> 9);
> +       else
> +               block_sectors = 1;
> 
> -       block_sectors = roundup(1 << rdev->badblocks.shift,
> -                               bdev_logical_block_size(rdev->bdev) >> 9);
>          sector = r1_bio->sector;
>          sectors = ((sector + block_sectors)
>                     & ~(sector_t)(block_sectors - 1))
> @@ -2543,18 +2554,25 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
>                  bio_trim(wbio, sector - r1_bio->sector, sectors);
>                  wbio->bi_iter.bi_sector += rdev->data_offset;
> 
> -               if (submit_bio_wait(wbio) < 0)
> +               if (submit_bio_wait(wbio) < 0) {
>                          /* failure! */
> -                       ok = rdev_set_badblocks(rdev, sector,
> -                                               sectors, 0)
> -                               && ok;
> +                       write_ok = false;
> +                       if (bbl_enabled)
> +                               setbad_ok = rdev_set_badblocks(rdev, sector,
> +                                                       sectors, 0)
> +                                       && setbad_ok;
> +               }
> 
>                  bio_put(wbio);
>                  sect_to_write -= sectors;
>                  sector += sectors;
>                  sectors = block_sectors;
>          }
> -       return ok;
> +
> +       if (!write_ok &&
> +           (!setbad_ok || !bbl_enabled))
> +               return false;
> +       return true;
>   }
> 
>   static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
> @@ -2585,26 +2603,34 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>          int m, idx;
>          bool fail = false;
> 
> -       for (m = 0; m < conf->raid_disks * 2 ; m++)
> -               if (r1_bio->bios[m] == IO_MADE_GOOD) {
> -                       struct md_rdev *rdev = conf->mirrors[m].rdev;
> +       for (m = 0; m < conf->raid_disks * 2 ; m++) {
> +               struct md_rdev *rdev = conf->mirrors[m].rdev;
> +               struct bio *bio = r1_bio->bios[m];
> +               if (bio == IO_MADE_GOOD) {
>                          rdev_clear_badblocks(rdev,
>                                               r1_bio->sector,
>                                               r1_bio->sectors, 0);
>                          rdev_dec_pending(rdev, conf->mddev);
> -               } else if (r1_bio->bios[m] != NULL) {
> +               } else if (bio != NULL) {
>                          /* This drive got a write error.  We need to
>                           * narrow down and record precise write
>                           * errors.
>                           */
>                          fail = true;
> -                       if (!narrow_write_error(r1_bio, m))
> -                               md_error(conf->mddev,
> -                                        conf->mirrors[m].rdev);
> +                       if (!narrow_write_error(r1_bio, m)){
> +                               md_error(conf->mddev, rdev);
>                                  /* an I/O failed, we can't clear the bitmap */
> -                       rdev_dec_pending(conf->mirrors[m].rdev,
> -                                        conf->mddev);
> +                       } else if(test_bit(In_sync, &rdev->flags) &&
> +                               !test_bit(Faulty, &rdev->flags) &&
> +                               rdev_has_badblock(rdev,
> +                                       r1_bio->sector,
> +                                       r1_bio->sectors) == 0) {
> +                               set_bit(R1BIO_Uptodate, &r1_bio->state);
> +                       }
> +                       rdev_dec_pending(rdev, conf->mddev);
>                  }
> +       }
>          if (fail) {
>                  spin_lock_irq(&conf->device_lock);
>                  list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b60c30bfb6c7..7145daf1543b 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio)
>                   * wait for the 'master' bio.
>                   */
>                  set_bit(R10BIO_Uptodate, &r10_bio->state);
> +       } else if (test_bit(FailFast, &rdev->flags) &&
> +                test_bit(R10BIO_FailFast, &r10_bio->state)) {
> +               /* This was a fail-fast read so we definitely
> +                * want to retry */
> +               ;
>          } else if (!raid1_should_handle_error(bio)) {
>                  uptodate = 1;
>          } else {
> 
> 
>>
>> On 2025/09/01 16:48, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/09/01 12:22, Kenta Akagi 写道:
>>>>
>>>>
>>>> On 2025/09/01 12:22, Li Nan wrote:
>>>>>
>>>>>
>>>>> 在 2025/8/31 2:10, Kenta Akagi 写道:
>>>>>>
>>>>>>
>>>>>> On 2025/08/30 17:48, Li Nan wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2025/8/29 20:21, Kenta Akagi 写道:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/08/29 11:54, Li Nan wrote:
>>>>>>>>>
>>>>>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>>>>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>>>>>>>>> the array into a broken state.
>>>>>>>>>>
>>>>>>>>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>>>>>>>>> the array may be flagged MD_BROKEN in the following cases.
>>>>>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>>>>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>>>>>>> index 408c26398321..8a61fd93b3ff 100644
>>>>>>>>>> --- a/drivers/md/raid1.c
>>>>>>>>>> +++ b/drivers/md/raid1.c
>>>>>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>>>>>>                   (bio->bi_opf & MD_FAILFAST) &&
>>>>>>>>>>                   /* We never try FailFast to WriteMostly devices */
>>>>>>>>>>                   !test_bit(WriteMostly, &rdev->flags)) {
>>>>>>>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>>                   md_error(r1_bio->mddev, rdev);
>>>>>>>>>>               }
>>>>>>>>>>       @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>>>>>>        *    - recovery is interrupted.
>>>>>>>>>>        *    - &mddev->degraded is bumped.
>>>>>>>>>>        *
>>>>>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>>>>>>> - * &mddev->fail_last_dev is off.
>>>>>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>>>>>>>>> + * then @mddev and @rdev will not be marked as failed.
>>>>>>>>>> + *
>>>>>>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>>>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>>>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>>>>>>        */
>>>>>>>>>>       static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>>>>       {
>>>>>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>>>>             if (test_bit(In_sync, &rdev->flags) &&
>>>>>>>>>>               (conf->raid_disks - mddev->degraded) == 1) {
>>>>>>>>>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>>>>>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>>>>>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>>>>>>>>> +                "last device but ignoring it\n",
>>>>>>>>>> +                mdname(mddev), rdev->bdev);
>>>>>>>>>> +            return;
>>>>>>>>>> +        }
>>>>>>>>>>               set_bit(MD_BROKEN, &mddev->flags);
>>>>>>>>>>                 if (!mddev->fail_last_dev) {
>>>>>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>>>>>>>>           if (test_bit(FailFast, &rdev->flags)) {
>>>>>>>>>>               /* Don't try recovering from here - just fail it
>>>>>>>>>>                * ... unless it is the last working device of course */
>>>>>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>>               md_error(mddev, rdev);
>>>>>>>>>>               if (test_bit(Faulty, &rdev->flags))
>>>>>>>>>>                   /* Don't try to read from here, but make sure
>>>>>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>>>>>>>>               fix_read_error(conf, r1_bio);
>>>>>>>>>>               unfreeze_array(conf);
>>>>>>>>>>           } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>>>>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>>               md_error(mddev, rdev);
>>>>>>>>>>           } else {
>>>>>>>>>>               r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>>>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>>>>>>> index b60c30bfb6c7..530ad6503189 100644
>>>>>>>>>> --- a/drivers/md/raid10.c
>>>>>>>>>> +++ b/drivers/md/raid10.c
>>>>>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>>>>>>>>                   dec_rdev = 0;
>>>>>>>>>>                   if (test_bit(FailFast, &rdev->flags) &&
>>>>>>>>>>                       (bio->bi_opf & MD_FAILFAST)) {
>>>>>>>>>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>>                       md_error(rdev->mddev, rdev);
>>>>>>>>>>                   }
>>>>>>>>>>       
>>>>>>>>>
>>>>>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>>>>>>>>
>>>>>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
>>>>>>>>
>>>>>>>>> IO1                    IO2
>>>>>>>>> set FailfastIOFailure
>>>>>>>>>                         set FailfastIOFailure
>>>>>>>>>      md_error
>>>>>>>>>       raid1_error
>>>>>>>>>        test_and_clear FailfastIOFailur
>>>>>>>>>                            md_error
>>>>>>>>>                           raid1_error
>>>>>>>>>                            //FailfastIOFailur is cleared
>>>>>>>>>                            set MD_BROKEN
>>>>>>>>>
>>>>>>>>> Maybe we should check whether FailfastIOFailure is already set before
>>>>>>>>> setting it. It also needs to be considered in metadata writes.
>>>>>>>> Thank you for reviewing.
>>>>>>>>
>>>>>>>> I agree, this seems to be as you described.
>>>>>>>> So, should it be implemented as follows?
>>>>>>>>
>>>>>>>> bool old=false;
>>>>>>>> do{
>>>>>>>>      spin_lock_irqsave(&conf->device_lock, flags);
>>>>>>>>      old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>      spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>>>> }while(old);
>>>>>>>>
>>>>>>>> However, since I am concerned about potential deadlocks,
>>>>>>>> so I am considering two alternative approaches:
>>>>>>>>
>>>>>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures.
>>>>>>>>
>>>>>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing
>>>>>>>> error handling of Failfast and non-Failfast IOs appears to be rare.
>>>>>>>> In any case, the final outcome would be the same, i.e. the array
>>>>>>>> ends up with MD_BROKEN. Therefore, I think this should not cause
>>>>>>>> issues. I think the same applies to test_and_set_bit.
>>>>>>>>
>>>>>>>> IO1                    IO2                    IO3
>>>>>>>> FailfastIOFailure      Normal IOFailure       FailfastIOFailure
>>>>>>>> atomic_inc
>>>>>>>>                                                     md_error                                     atomic_inc
>>>>>>>>       raid1_error
>>>>>>>>        atomic_dec //2to1
>>>>>>>>                            md_error
>>>>>>>>                             raid1_error           md_error
>>>>>>>>                              atomic_dec //1to0     raid1_error
>>>>>>>>                                                     atomic_dec //0
>>>>>>>>                                                       set MD_BROKEN
>>>>>>>>
>>>>>>>> * Alternatively, create a separate error handler,
>>>>>>>>       e.g. md_error_failfast(), that clearly does not fail the array.
>>>>>>>>
>>>>>>>> This approach would require somewhat larger changes and may not
>>>>>>>> be very elegant, but it seems to be a reliable way to ensure
>>>>>>>> MD_BROKEN is never set at the wrong timing.
>>>>>>>>
>>>>>>>> Which of these three approaches would you consider preferable?
>>>>>>>> I would appreciate your feedback.
>>>>>>>>
>>>>>>>>
>>>>>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
>>>>>>>> when the array is degraded.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Akagi
>>>>>>>>
>>>>>>>
>>>>>>> I took a closer look at the FailFast code and found a few issues, using
>>>>>>> RAID1 as an example:
>>>>>>>
>>>>>>> For normal read/write IO, FailFast is only triggered when there is another
>>>>>>> disk is available, as seen in read_balance() and raid1_write_request().
>>>>>>> In raid1_error(), MD_BROKEN is set only when no other disks are available.
>>>>>>
>>>>>> Hi,
>>>>>> Agree, I think so too.
>>>>>>
>>>>>>> So, the FailFast for normal read/write is not triggered in the scenario you
>>>>>>> described in cover-letter.
>>>>>>
>>>>>> This corresponds to the case described in the commit message of PATCH v3 1/3.
>>>>>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
>>>>>> an edge case; however, it can occur if rdevs in a non-degraded
>>>>>> array share the same path and that path is lost, or if a metadata
>>>>>> write is triggered in a degraded array and fails due to failfast."
>>>>>>
>>>>>> To describe it in more detail, the flow is as follows:
>>>>>>
>>>>>> Prerequisites:
>>>>>>
>>>>>> - Both rdevs are in-sync
>>>>>> - Both rdevs have in-flight MD_FAILFAST bios
>>>>>> - Both rdevs depend on the same lower-level path
>>>>>>      (e.g., nvme-tcp over a single Ethernet interface)
>>>>>>
>>>>>> Sequence:
>>>>>>
>>>>>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage),
>>>>>>      in the case of nvme-tcp:
>>>>>>       - The Ethernet connection is lost on the node where md is running over 5 seconds
>>>>>>       - Then the connection is restored. Idk the details of nvme-tcp implementation,
>>>>>>         but it seems that failfast IOs finish only after the connection is back.
>>>>>> - All failfast bios fail, raid1_end_write_request is called.
>>>>>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev.
>>>>>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely.
>>>>>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding.
>>>>>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur.
>>>>>>
>>>>>
>>>>> Thanks for your patient explanation. I understand. Maybe we need a separate
>>>>> error-handling path for failfast. How about adding an extra parameter to md_error()?
>>>>
>>>> Thank you for reviewing.
>>>>
>>>> I am thinking of proceeding like as follows.
>>>> md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function.
>>>>
>>>
>>> It doesn't matter if it's a exported symbol, we should just keep code as
>>> simple as possible.
>>>> ...
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index ac85ec73a409..855cddeb0c09 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread);
>>>>
>>>> -void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail)
>>>>    {
>>>> @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>                   return;
>>>> -       mddev->pers->error_handler(mddev, rdev);
>>>> +       mddev->pers->error_handler(mddev, rdev, nofail);
>>>>
>>>> @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>    }
>>>> +
>>>> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> +{
>>>> +       return _md_error(mddev, rdev, false);
>>>> +}
>>>>    EXPORT_SYMBOL(md_error);
>>>>
>>>> +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev)
>>>> +{
>>>> +       WARN_ON(mddev->pers->head.id != ID_RAID1 &&
>>>> +               mddev->pers->head.id != ID_RAID10);
>>>> +       return _md_error(mddev, rdev, true);
>>>> +}
>>>> +EXPORT_SYMBOL(md_error_failfast);
>>>> +
>>>
>>> I will prefer we add a common procedures to fix this problme.
>>>
>>> How about the first patch to serialize all the md_error(), and then
>>> and a new helper md_bio_failue_error(mddev, rdev, bio), called when
>>> bio is failed, in this helper:
>>>
>>> 1) if bio is not failfast, call md_error() and return true; otherwise:
>>> 2) if rdev contain the last data copy, return false directly, caller
>>> should check return value and retry, otherwise:
>>> 3) call md_error() and return true;
>>
>> Hi,
>> I think this approach has some issues. There are cases where md_error is
>> called only when MD_FAILFAST is set.
>>
>> One example is the processing below in raid1_end_write_request.
>>
>>> Then, for raid1, the callers will look like:
>>>
>>> iff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 1baaf52c603c..c6d150e9f1a7 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio)
>>>          if (bio->bi_status) {
>>>                  pr_err("md: %s gets error=%d\n", __func__,
>>>                         blk_status_to_errno(bio->bi_status));
>>> -               md_error(mddev, rdev);
>>> -               if (!test_bit(Faulty, &rdev->flags)
>>> -                   && (bio->bi_opf & MD_FAILFAST)) {
>>> +               if (!md_bio_failure_error(mddev, rdev, bio)) {
>>>                          set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>>                          set_bit(LastDev, &rdev->flags);
>>>                  }
>>>
>>> @@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio)
>>>                          set_bit(MD_RECOVERY_NEEDED, &
>>>                                  conf->mddev->recovery);
>>>
>>> -               if (test_bit(FailFast, &rdev->flags) &&
>>> -                   (bio->bi_opf & MD_FAILFAST) &&
>>>                      /* We never try FailFast to WriteMostly devices */
>>> -                   !test_bit(WriteMostly, &rdev->flags)) {
>>> -                       md_error(r1_bio->mddev, rdev);
>>> -               }
>>> -
>>> -               /*
>>> -                * When the device is faulty, it is not necessary to
>>> -                * handle write error.
>>> -                */
>>> -               if (!test_bit(Faulty, &rdev->flags))
>>> +               if(!test_bit(WriteMostly, &rdev->flags) &&
>>> +                  !md_bio_failure_error(mddev, rdev, bio)) {
>>>                          set_bit(R1BIO_WriteError, &r1_bio->state);
>>> -               else {
>>> +               } else {
>>>                          /* Finished with this branch */
>>>                          r1_bio->bios[mirror] = NULL;
>>>                          to_put = bio;
>>
>> In the current raid1_end_write_request implementation,
>> - md_error is called only in the Failfast case.
>> - Afterwards, if the rdev is not Faulty (that is, not Failfast,
>>    or Failfast but the last rdev — which originally was not expected
>>    MD_BROKEN in RAID1), R1BIO_WriteError is set.
>> In the suggested implementation, it seems that a non-Failfast write
>> failure will immediately mark the rdev as Faulty, without retries.
>>
>> This could be avoided by testing MD_FAILFAST before call the
>> new helper md_bio_failure_error, but I believe duplicating the
>> same check in both caller/callee would be undesirable.
>>
>> Should we try to avoid modifying pers->error_handler?
>> One possible alternative approach is as follows.
>>
>> - serialize calls to md_error regardless of whether Failfast or not
>> - raid{1,10}_error is:
>>    - The remaining copy (rdev) is marked with the LastDev flag
>>    - clear MD_FAILFAST_SUPPORTED for prohibit super_write using Failfast
>> - super_written will simply put MD_SB_NEED_REWRITE without calling
>>    md_error when MD_FAILFAST bio and LastDev rdev.
>>
>> After the changes, I believe it is rare for super_written to be called with error on
>> multiple rdevs due to failfast. super_write is caused by errors from normal failfast
>> IO and invoked via MD_SB_CHANGE_DEVS through the serialized raid1_error. Since
>> MD_FAILFAST_SUPPORTED is cleared, metadata writes occur without failfast.
>>
>> It's not exactly a common procedure, but as it doesn't add functions to md.c,
>> I think this approach is preferable to adding md_error_failfast().
>>
>> ...
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 1baaf52c603c..ba524fa96091 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1003,14 +1003,15 @@ static void super_written(struct bio *bio)
>>          if (bio->bi_status) {
>>                  pr_err("md: %s gets error=%d\n", __func__,
>>                         blk_status_to_errno(bio->bi_status));
>> -               md_error(mddev, rdev);
>> -               if (!test_bit(Faulty, &rdev->flags)
>> +               if (test_bit(LastDev, &rdev->flags)
>>                      && (bio->bi_opf & MD_FAILFAST)) {
>> +                       pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>> +                               mdname(mddev), rdev->bdev);
>>                          set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>> -                       set_bit(LastDev, &rdev->flags);
>> +               } else {
>> +                       md_error(mddev, rdev);
>>                  }
>> -       } else
>> -               clear_bit(LastDev, &rdev->flags);
>> +       }
>>
>>          bio_put(bio);
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 408c26398321..a52c5277add7 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>                      (bio->bi_opf & MD_FAILFAST) &&
>>                      /* We never try FailFast to WriteMostly devices */
>>                      !test_bit(WriteMostly, &rdev->flags)) {
>> -                       md_error(r1_bio->mddev, rdev);
>> +                       raid1_md_error_failfast(r1_bio->mddev, rdev);
>>                  }
>>
>>                  /*
>> @@ -1733,6 +1733,27 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>          seq_printf(seq, "]");
>>   }
>>
>> +static void _raid1_md_error(struct mddev *mddev, struct md_rdev *rdev, bool failfast){
>> +       struct r1conf *conf = mddev->private;
>> +       unsigned long flags;
>> +
>> +       spin_lock_irqsave(&conf->new_lock_for_md_error, flags);
>> +       if (failfast)
>> +               set_bit(FailfastIOFailure, &rdev->flags);
>> +       md_error(mddev, rdev);
>> +       if (failfast)
>> +               WARN_ON(!test_and_clear_bit(FailfastIOFailure, &rdev->flags));
>> +       spin_unlock_irqrestore(&conf->new_lock_for_md_error, flags);
>> +}
>> +
>> +static void raid1_md_error(struct mddev *mddev, struct md_rdev *rdev){
>> +       return _raid1_md_error(mddev, rdev, false);
>> +}
>> +
>> +static void raid1_md_error_failfast(struct mddev *mddev, struct md_rdev *rdev){
>> +       return _raid1_md_error(mddev, rdev, true);
>> +}
>> +
>>   /**
>>    * raid1_error() - RAID1 error handler.
>>    * @mddev: affected md device.
>> @@ -1758,6 +1783,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>
>>          if (test_bit(In_sync, &rdev->flags) &&
>>              (conf->raid_disks - mddev->degraded) == 1) {
>> +               if (test_bit(FailfastIOFailure, &rdev->flags)) {
>> +                       spin_unlock_irqrestore(&conf->device_lock, flags);
>> +                       pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>> +                               "last device but ignoring it\n",
>> +                               mdname(mddev), rdev->bdev);
>> +                       return;
>> +               }
>>                  set_bit(MD_BROKEN, &mddev->flags);
>>
>>                  if (!mddev->fail_last_dev) {
>> @@ -1767,8 +1799,16 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>                  }
>>          }
>>          set_bit(Blocked, &rdev->flags);
>> -       if (test_and_clear_bit(In_sync, &rdev->flags))
>> +       if (test_and_clear_bit(In_sync, &rdev->flags)) {
>>                  mddev->degraded++;
>> +               clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>> +               for (i = 0; i < conf->raid_disks; i++) {
>> +                       struct md_rdev *rdev2 = conf->mirrors[i].rdev;
>> +                       if (rdev2 && rdev != rdev2 &&
>> +                           test_bit(In_sync, &rdev2->flags))
>> +                               set_bit(LastDev, &rdev2->flags);
>> +               }
>> +       }
>>          set_bit(Faulty, &rdev->flags);
>>          spin_unlock_irqrestore(&conf->device_lock, flags);
>>          /*
>> @@ -2118,7 +2158,7 @@ static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector,
>>          }
>>          /* need to record an error - either for the block or the device */
>>          if (!rdev_set_badblocks(rdev, sector, sectors, 0))
>> -               md_error(rdev->mddev, rdev);
>> +               raid1_md_error(rdev->mddev, rdev);
>>          return 0;
>>   }
>> ...
>>
>>
>> Thanks,
>> Akagi
>>
>>> @@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>           */
>>>
>>>          bio = r1_bio->bios[r1_bio->read_disk];
>>> -       bio_put(bio);
>>>          r1_bio->bios[r1_bio->read_disk] = NULL;
>>>
>>>          rdev = conf->mirrors[r1_bio->read_disk].rdev;
>>> @@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>                  freeze_array(conf, 1);
>>>                  fix_read_error(conf, r1_bio);
>>>                  unfreeze_array(conf);
>>> -       } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>> -               md_error(mddev, rdev);
>>> -       } else {
>>> +       } else if (mddev->ro == 0 &&
>>> +                  !md_bio_failure_error(mddev, rdev, bio)) {
>>>                  r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>          }
>>>
>>> +       bio_put(bio);
>>>          rdev_dec_pending(rdev, conf->mddev);
>>>          sector = r1_bio->sector;
>>> -       bio = r1_bio->master_bio;
>>>
>>>          /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
>>>          r1_bio->state = 0;
>>> -       raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
>>> +       raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, r1_bio);
>>>          allow_barrier(conf, sector);
>>>   }
>>>
>>>
>>>>    /* seq_file implementation /proc/mdstat */
>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>> index 51af29a03079..6ca1aea630ce 100644
>>>> --- a/drivers/md/md.h
>>>> +++ b/drivers/md/md.h
>>>> @@ -758,3 +758,3 @@ struct md_personality
>>>>            */
>>>> -       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
>>>> +       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>>>>           int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
>>>> @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev);
>>>>    extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>>>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>>>>    extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>>>> +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev);
>>>>    extern void md_finish_reshape(struct mddev *mddev);
>>>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
>>>> index f1d8811a542a..8aea51227a96 100644
>>>> --- a/drivers/md/raid0.c
>>>> +++ b/drivers/md/raid0.c
>>>> @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
>>>>
>>>> -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev,
>>>> +       bool nofail __maybe_unused)
>>>>    {
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index 408c26398321..d93275899e9e 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>     * @rdev: member device to fail.
>>>> + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set
>>>>     *
>>>> @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>     *
>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>> - * &mddev->fail_last_dev is off.
>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>> + *     - when @mddev is failed and &mddev->fail_last_dev is off
>>>> + *     - when @rdev is last device and @nofail is true
>>>>     */
>>>> -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev,
>>>> +       bool nofail)
>>>>    {
>>>> @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>               (conf->raid_disks - mddev->degraded) == 1) {
>>>> +               if (nofail) {
>>>> +                       spin_unlock_irqrestore(&conf->device_lock, flags);
>>>> +                       pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, "
>>>> +                               "last device but ignoring it\n",
>>>> +                               mdname(mddev), rdev->bdev);
>>>> +                       return;
>>>> +               }
>>>>                   set_bit(MD_BROKEN, &mddev->flags);
>>>> ...
>>>>
>>>>> Kuai, do you have any suggestions?
>>>>>
>>>>>>> Normal writes may call md_error() in narrow_write_error. Normal reads do
>>>>>>> not execute md_error() on the last disk.
>>>>>>>
>>>>>>> Perhaps you should get more information to confirm how MD_BROKEN is set in
>>>>>>> normal read/write IO.
>>>>>>
>>>>>> Should I add the above sequence of events to the cover letter, or commit message?
>>>>>>
>>>>>
>>>>> I think we should mention this in the commit message.
>>>>
>>>> Understood. I will explicitly describe this in the commit message in v4.
>>>>
>>>> Thanks,
>>>> Akagi
>>>>
>>>>>> Thanks,
>>>>>> Akagi
>>>>>>
>>>>>>> -- 
>>>>>>> Thanks,
>>>>>>> Nan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> .
>>>>>
>>>>> -- 
>>>>> Thanks,
>>>>> Nan
>>>>>
>>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>
> 
> .
>