linux-kernel - Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <03f3e836-97b8-458d-881b-900e512dbada@mgml.me>
Date: Sat, 6 Sep 2025 00:07:08 +0900
From: Kenta Akagi <k@...l.me>
To: Yu Kuai <yukuai1@...weicloud.com>, Li Nan <linan666@...weicloud.com>,
        Song Liu <song@...nel.org>, Mariusz Tkaczyk <mtkaczyk@...nel.org>,
        Guoqing Jiang <jgq516@...il.com>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
        "yukuai (C)" <yukuai3@...wei.com>, Kenta Akagi <k@...l.me>
Subject: Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast
 io failure


On 2025/09/02 1:48, Kenta Akagi wrote:
> 

Hi,

By the way, I found two more issues.
This can be reproduced by causing a Failfast IO failure using the attached procedure.
I wrote patches for them. Would it make sense to send these as v4?

* raid1,raid10: IO error returned even when all writes succeed in narrow_write_error
When only one rdev remains, a failfast write bio is retried in
narrow_write_error() via handle_write_completed(). Even if all retries succeed, 
R1BIO_Uptodate is not set. As a result, the upper layer get BLK_STS_IOERR.

This behavior appears to be intentional, under the assumption of at least one bad sector.
However, this behavior is undesirable for failfast retries.
In addition, although less likely, even on a normal HDD or SSD,
if all split writes succeed, the upper layer would still receive
BLK_STS_IOERR despite the write having ultimately completed successfully.
Of course, this does not apply if the other rdev writes succeed.

I'm not exactly sure how to handle the following point.
- Can a retried write be marked R1BIO_Uptodate after one failure in non-failfast?
  - Should it only be marked Uptodate for failfast?
- Should failfast retries be handled outside narrow_write_error?

For now, I attempt to set R1BIO_Uptodate whenever narrow_write_error() succeeds
and there are no bad blocks, regardless of MD_FAILFAST.

* raid10: no retry scheduled for failfast read failure
raid10_end_read_request lacks a path to retry when a FailFast IO fails.
As a result, when Failfast Read IOs fail on all rdevs,
the upper layer receives BLK_STS_IOERR without a retry,

Looking at the two commits below, it seems only raid10_end_read_request
lacks the failfast read retry handling, while raid1_end_read_request has it.
In RAID1, the retry works as expected.

I don't know why raid10_end_read_request lacks this, but it is probably
just a simple oversight.

- 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
- 2e52d449bcec ("md/raid1: add failfast handling for reads.")

Here's how to reproduce :

# prepare nvmet/nvme-tcp and md array
sh-5.2# cat << 'EOF' > loopback-nvme.sh
> set -eu
nqn="nqn.2025-08.io.example:nvmet-test-$1"
back=$2
cd /sys/kernel/config/nvmet/
mkdir subsystems/$nqn
echo 1 > subsystems/${nqn}/attr_allow_any_host
mkdir subsystems/${nqn}/namespaces/1
echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path
echo 1 > subsystems/${nqn}/namespaces/1/enable
ports="ports/1"
if [ ! -d $ports ]; then
        mkdir $ports
        cd $ports
        echo 127.0.0.1 > addr_traddr
        echo tcp       > addr_trtype
        echo 4420      > addr_trsvcid
        echo ipv4      > addr_adrfam
        cd ../../
fi
ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/
nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420
> EOF
sh-5.2# chmod +x loopback-nvme.sh
sh-5.2# modprobe -a nvme-tcp nvmet-tcp
sh-5.2# truncate -s 1g a.img b.img
sh-5.2# losetup --show -f a.img
/dev/loop0
sh-5.2# losetup --show -f b.img
/dev/loop1
sh-5.2# ./loopback-nvme.sh 0 /dev/loop0
connecting to device: nvme0
sh-5.2# ./loopback-nvme.sh 1 /dev/loop1
connecting to device: nvme1
sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \
--failfast /dev/nvme0n1 --failfast /dev/nvme1n1
...
mdadm: array /dev/md0 started.

# run fio
sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \
--bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1

# It can reproduce the issue by block nvme traffic during fio
sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP;
sh-5.2# sleep 10; # twice the default KATO value
sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP


diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 408c26398321..ce4dff63f50a 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
 static bool narrow_write_error(struct r1bio *r1_bio, int i)
 {
        struct mddev *mddev = r1_bio->mddev;
@@ -2519,13 +2521,16 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
        sector_t sector;
        int sectors;
        int sect_to_write = r1_bio->sectors;
-       bool ok = true;
+       bool write_ok = true;
+       bool setbad_ok = true;
+       bool bbl_enabled = !(rdev->badblocks.shift < 0);

-       if (rdev->badblocks.shift < 0)
-               return false;
+       if (bbl_enabled)
+               block_sectors = roundup(1 << rdev->badblocks.shift,
+                                       bdev_logical_block_size(rdev->bdev) >> 9);
+       else
+               block_sectors = 1;

-       block_sectors = roundup(1 << rdev->badblocks.shift,
-                               bdev_logical_block_size(rdev->bdev) >> 9);
        sector = r1_bio->sector;
        sectors = ((sector + block_sectors)
                   & ~(sector_t)(block_sectors - 1))
@@ -2543,18 +2554,25 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
                bio_trim(wbio, sector - r1_bio->sector, sectors);
                wbio->bi_iter.bi_sector += rdev->data_offset;

-               if (submit_bio_wait(wbio) < 0)
+               if (submit_bio_wait(wbio) < 0) {
                        /* failure! */
-                       ok = rdev_set_badblocks(rdev, sector,
-                                               sectors, 0)
-                               && ok;
+                       write_ok = false;
+                       if (bbl_enabled)
+                               setbad_ok = rdev_set_badblocks(rdev, sector,
+                                                       sectors, 0)
+                                       && setbad_ok;
+               }

                bio_put(wbio);
                sect_to_write -= sectors;
                sector += sectors;
                sectors = block_sectors;
        }
-       return ok;
+
+       if (!write_ok &&
+           (!setbad_ok || !bbl_enabled))
+               return false;
+       return true;
 }

 static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
@@ -2585,26 +2603,34 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
        int m, idx;
        bool fail = false;

-       for (m = 0; m < conf->raid_disks * 2 ; m++)
-               if (r1_bio->bios[m] == IO_MADE_GOOD) {
-                       struct md_rdev *rdev = conf->mirrors[m].rdev;
+       for (m = 0; m < conf->raid_disks * 2 ; m++) {
+               struct md_rdev *rdev = conf->mirrors[m].rdev;
+               struct bio *bio = r1_bio->bios[m];
+               if (bio == IO_MADE_GOOD) {
                        rdev_clear_badblocks(rdev,
                                             r1_bio->sector,
                                             r1_bio->sectors, 0);
                        rdev_dec_pending(rdev, conf->mddev);
-               } else if (r1_bio->bios[m] != NULL) {
+               } else if (bio != NULL) {
                        /* This drive got a write error.  We need to
                         * narrow down and record precise write
                         * errors.
                         */
                        fail = true;
-                       if (!narrow_write_error(r1_bio, m))
-                               md_error(conf->mddev,
-                                        conf->mirrors[m].rdev);
+                       if (!narrow_write_error(r1_bio, m)){
+                               md_error(conf->mddev, rdev);
                                /* an I/O failed, we can't clear the bitmap */
-                       rdev_dec_pending(conf->mirrors[m].rdev,
-                                        conf->mddev);
+                       } else if(test_bit(In_sync, &rdev->flags) &&
+                               !test_bit(Faulty, &rdev->flags) &&
+                               rdev_has_badblock(rdev,
+                                       r1_bio->sector,
+                                       r1_bio->sectors) == 0) {
+                               set_bit(R1BIO_Uptodate, &r1_bio->state);
+                       }
+                       rdev_dec_pending(rdev, conf->mddev);
                }
+       }
        if (fail) {
                spin_lock_irq(&conf->device_lock);
                list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b60c30bfb6c7..7145daf1543b 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio)
                 * wait for the 'master' bio.
                 */
                set_bit(R10BIO_Uptodate, &r10_bio->state);
+       } else if (test_bit(FailFast, &rdev->flags) &&
+                test_bit(R10BIO_FailFast, &r10_bio->state)) {
+               /* This was a fail-fast read so we definitely
+                * want to retry */
+               ;
        } else if (!raid1_should_handle_error(bio)) {
                uptodate = 1;
        } else {


> 
> On 2025/09/01 16:48, Yu Kuai wrote:
>> Hi,
>>
>> 在 2025/09/01 12:22, Kenta Akagi 写道:
>>>
>>>
>>> On 2025/09/01 12:22, Li Nan wrote:
>>>>
>>>>
>>>> 在 2025/8/31 2:10, Kenta Akagi 写道:
>>>>>
>>>>>
>>>>> On 2025/08/30 17:48, Li Nan wrote:
>>>>>>
>>>>>>
>>>>>> 在 2025/8/29 20:21, Kenta Akagi 写道:
>>>>>>>
>>>>>>>
>>>>>>> On 2025/08/29 11:54, Li Nan wrote:
>>>>>>>>
>>>>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>>>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>>>>>>>> the array into a broken state.
>>>>>>>>>
>>>>>>>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>>>>>>>> the array may be flagged MD_BROKEN in the following cases.
>>>>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>>>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>>>>>> index 408c26398321..8a61fd93b3ff 100644
>>>>>>>>> --- a/drivers/md/raid1.c
>>>>>>>>> +++ b/drivers/md/raid1.c
>>>>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>>>>>                  (bio->bi_opf & MD_FAILFAST) &&
>>>>>>>>>                  /* We never try FailFast to WriteMostly devices */
>>>>>>>>>                  !test_bit(WriteMostly, &rdev->flags)) {
>>>>>>>>> +            set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>                  md_error(r1_bio->mddev, rdev);
>>>>>>>>>              }
>>>>>>>>>      @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>>>>>       *    - recovery is interrupted.
>>>>>>>>>       *    - &mddev->degraded is bumped.
>>>>>>>>>       *
>>>>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>>>>>> - * &mddev->fail_last_dev is off.
>>>>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>>>>>>>> + * then @mddev and @rdev will not be marked as failed.
>>>>>>>>> + *
>>>>>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>>>>>> + *    - when @mddev is failed and &mddev->fail_last_dev is off
>>>>>>>>> + *    - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>>>>>       */
>>>>>>>>>      static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>>>      {
>>>>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>>>            if (test_bit(In_sync, &rdev->flags) &&
>>>>>>>>>              (conf->raid_disks - mddev->degraded) == 1) {
>>>>>>>>> +        if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>>>>>>>> +            spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>>>>> +            pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>>>>>>>> +                "last device but ignoring it\n",
>>>>>>>>> +                mdname(mddev), rdev->bdev);
>>>>>>>>> +            return;
>>>>>>>>> +        }
>>>>>>>>>              set_bit(MD_BROKEN, &mddev->flags);
>>>>>>>>>                if (!mddev->fail_last_dev) {
>>>>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>>>>>>>          if (test_bit(FailFast, &rdev->flags)) {
>>>>>>>>>              /* Don't try recovering from here - just fail it
>>>>>>>>>               * ... unless it is the last working device of course */
>>>>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>              md_error(mddev, rdev);
>>>>>>>>>              if (test_bit(Faulty, &rdev->flags))
>>>>>>>>>                  /* Don't try to read from here, but make sure
>>>>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>>>>>>>              fix_read_error(conf, r1_bio);
>>>>>>>>>              unfreeze_array(conf);
>>>>>>>>>          } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>>>>>>>> +        set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>              md_error(mddev, rdev);
>>>>>>>>>          } else {
>>>>>>>>>              r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>>>>>> index b60c30bfb6c7..530ad6503189 100644
>>>>>>>>> --- a/drivers/md/raid10.c
>>>>>>>>> +++ b/drivers/md/raid10.c
>>>>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>>>>>>>                  dec_rdev = 0;
>>>>>>>>>                  if (test_bit(FailFast, &rdev->flags) &&
>>>>>>>>>                      (bio->bi_opf & MD_FAILFAST)) {
>>>>>>>>> +                set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>                      md_error(rdev->mddev, rdev);
>>>>>>>>>                  }
>>>>>>>>>      
>>>>>>>>
>>>>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>>>>>>>
>>>>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
>>>>>>>
>>>>>>>> IO1                    IO2
>>>>>>>> set FailfastIOFailure
>>>>>>>>                        set FailfastIOFailure
>>>>>>>>     md_error
>>>>>>>>      raid1_error
>>>>>>>>       test_and_clear FailfastIOFailur
>>>>>>>>                           md_error
>>>>>>>>                          raid1_error
>>>>>>>>                           //FailfastIOFailur is cleared
>>>>>>>>                           set MD_BROKEN
>>>>>>>>
>>>>>>>> Maybe we should check whether FailfastIOFailure is already set before
>>>>>>>> setting it. It also needs to be considered in metadata writes.
>>>>>>> Thank you for reviewing.
>>>>>>>
>>>>>>> I agree, this seems to be as you described.
>>>>>>> So, should it be implemented as follows?
>>>>>>>
>>>>>>> bool old=false;
>>>>>>> do{
>>>>>>>     spin_lock_irqsave(&conf->device_lock, flags);
>>>>>>>     old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>     spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>>> }while(old);
>>>>>>>
>>>>>>> However, since I am concerned about potential deadlocks,
>>>>>>> so I am considering two alternative approaches:
>>>>>>>
>>>>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures.
>>>>>>>
>>>>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing
>>>>>>> error handling of Failfast and non-Failfast IOs appears to be rare.
>>>>>>> In any case, the final outcome would be the same, i.e. the array
>>>>>>> ends up with MD_BROKEN. Therefore, I think this should not cause
>>>>>>> issues. I think the same applies to test_and_set_bit.
>>>>>>>
>>>>>>> IO1                    IO2                    IO3
>>>>>>> FailfastIOFailure      Normal IOFailure       FailfastIOFailure
>>>>>>> atomic_inc
>>>>>>>                                                    md_error                                     atomic_inc
>>>>>>>      raid1_error
>>>>>>>       atomic_dec //2to1
>>>>>>>                           md_error
>>>>>>>                            raid1_error           md_error
>>>>>>>                             atomic_dec //1to0     raid1_error
>>>>>>>                                                    atomic_dec //0
>>>>>>>                                                      set MD_BROKEN
>>>>>>>
>>>>>>> * Alternatively, create a separate error handler,
>>>>>>>      e.g. md_error_failfast(), that clearly does not fail the array.
>>>>>>>
>>>>>>> This approach would require somewhat larger changes and may not
>>>>>>> be very elegant, but it seems to be a reliable way to ensure
>>>>>>> MD_BROKEN is never set at the wrong timing.
>>>>>>>
>>>>>>> Which of these three approaches would you consider preferable?
>>>>>>> I would appreciate your feedback.
>>>>>>>
>>>>>>>
>>>>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
>>>>>>> when the array is degraded.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Akagi
>>>>>>>
>>>>>>
>>>>>> I took a closer look at the FailFast code and found a few issues, using
>>>>>> RAID1 as an example:
>>>>>>
>>>>>> For normal read/write IO, FailFast is only triggered when there is another
>>>>>> disk is available, as seen in read_balance() and raid1_write_request().
>>>>>> In raid1_error(), MD_BROKEN is set only when no other disks are available.
>>>>>
>>>>> Hi,
>>>>> Agree, I think so too.
>>>>>
>>>>>> So, the FailFast for normal read/write is not triggered in the scenario you
>>>>>> described in cover-letter.
>>>>>
>>>>> This corresponds to the case described in the commit message of PATCH v3 1/3.
>>>>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
>>>>> an edge case; however, it can occur if rdevs in a non-degraded
>>>>> array share the same path and that path is lost, or if a metadata
>>>>> write is triggered in a degraded array and fails due to failfast."
>>>>>
>>>>> To describe it in more detail, the flow is as follows:
>>>>>
>>>>> Prerequisites:
>>>>>
>>>>> - Both rdevs are in-sync
>>>>> - Both rdevs have in-flight MD_FAILFAST bios
>>>>> - Both rdevs depend on the same lower-level path
>>>>>     (e.g., nvme-tcp over a single Ethernet interface)
>>>>>
>>>>> Sequence:
>>>>>
>>>>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage),
>>>>>     in the case of nvme-tcp:
>>>>>      - The Ethernet connection is lost on the node where md is running over 5 seconds
>>>>>      - Then the connection is restored. Idk the details of nvme-tcp implementation,
>>>>>        but it seems that failfast IOs finish only after the connection is back.
>>>>> - All failfast bios fail, raid1_end_write_request is called.
>>>>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev.
>>>>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely.
>>>>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding.
>>>>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur.
>>>>>
>>>>
>>>> Thanks for your patient explanation. I understand. Maybe we need a separate
>>>> error-handling path for failfast. How about adding an extra parameter to md_error()?
>>>
>>> Thank you for reviewing.
>>>
>>> I am thinking of proceeding like as follows.
>>> md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function.
>>>
>>
>> It doesn't matter if it's a exported symbol, we should just keep code as
>> simple as possible.
>>> ...
>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index ac85ec73a409..855cddeb0c09 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread);
>>>
>>> -void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail)
>>>   {
>>> @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>                  return;
>>> -       mddev->pers->error_handler(mddev, rdev);
>>> +       mddev->pers->error_handler(mddev, rdev, nofail);
>>>
>>> @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>   }
>>> +
>>> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>> +{
>>> +       return _md_error(mddev, rdev, false);
>>> +}
>>>   EXPORT_SYMBOL(md_error);
>>>
>>> +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev)
>>> +{
>>> +       WARN_ON(mddev->pers->head.id != ID_RAID1 &&
>>> +               mddev->pers->head.id != ID_RAID10);
>>> +       return _md_error(mddev, rdev, true);
>>> +}
>>> +EXPORT_SYMBOL(md_error_failfast);
>>> +
>>
>> I will prefer we add a common procedures to fix this problme.
>>
>> How about the first patch to serialize all the md_error(), and then
>> and a new helper md_bio_failue_error(mddev, rdev, bio), called when
>> bio is failed, in this helper:
>>
>> 1) if bio is not failfast, call md_error() and return true; otherwise:
>> 2) if rdev contain the last data copy, return false directly, caller
>> should check return value and retry, otherwise:
>> 3) call md_error() and return true;
> 
> Hi,
> I think this approach has some issues. There are cases where md_error is
> called only when MD_FAILFAST is set.
> 
> One example is the processing below in raid1_end_write_request.
> 
>> Then, for raid1, the callers will look like:
>>
>> iff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 1baaf52c603c..c6d150e9f1a7 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio)
>>         if (bio->bi_status) {
>>                 pr_err("md: %s gets error=%d\n", __func__,
>>                        blk_status_to_errno(bio->bi_status));
>> -               md_error(mddev, rdev);
>> -               if (!test_bit(Faulty, &rdev->flags)
>> -                   && (bio->bi_opf & MD_FAILFAST)) {
>> +               if (!md_bio_failure_error(mddev, rdev, bio)) {
>>                         set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>                         set_bit(LastDev, &rdev->flags);
>>                 }
>>
>> @@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio)
>>                         set_bit(MD_RECOVERY_NEEDED, &
>>                                 conf->mddev->recovery);
>>
>> -               if (test_bit(FailFast, &rdev->flags) &&
>> -                   (bio->bi_opf & MD_FAILFAST) &&
>>                     /* We never try FailFast to WriteMostly devices */
>> -                   !test_bit(WriteMostly, &rdev->flags)) {
>> -                       md_error(r1_bio->mddev, rdev);
>> -               }
>> -
>> -               /*
>> -                * When the device is faulty, it is not necessary to
>> -                * handle write error.
>> -                */
>> -               if (!test_bit(Faulty, &rdev->flags))
>> +               if(!test_bit(WriteMostly, &rdev->flags) &&
>> +                  !md_bio_failure_error(mddev, rdev, bio)) {
>>                         set_bit(R1BIO_WriteError, &r1_bio->state);
>> -               else {
>> +               } else {
>>                         /* Finished with this branch */
>>                         r1_bio->bios[mirror] = NULL;
>>                         to_put = bio;
> 
> In the current raid1_end_write_request implementation, 
> - md_error is called only in the Failfast case.
> - Afterwards, if the rdev is not Faulty (that is, not Failfast, 
>   or Failfast but the last rdev — which originally was not expected 
>   MD_BROKEN in RAID1), R1BIO_WriteError is set.
> In the suggested implementation, it seems that a non-Failfast write 
> failure will immediately mark the rdev as Faulty, without retries.
> 
> This could be avoided by testing MD_FAILFAST before call the 
> new helper md_bio_failure_error, but I believe duplicating the 
> same check in both caller/callee would be undesirable.
> 
> Should we try to avoid modifying pers->error_handler?
> One possible alternative approach is as follows.
> 
> - serialize calls to md_error regardless of whether Failfast or not
> - raid{1,10}_error is:
>   - The remaining copy (rdev) is marked with the LastDev flag
>   - clear MD_FAILFAST_SUPPORTED for prohibit super_write using Failfast
> - super_written will simply put MD_SB_NEED_REWRITE without calling 
>   md_error when MD_FAILFAST bio and LastDev rdev.
> 
> After the changes, I believe it is rare for super_written to be called with error on
> multiple rdevs due to failfast. super_write is caused by errors from normal failfast 
> IO and invoked via MD_SB_CHANGE_DEVS through the serialized raid1_error. Since
> MD_FAILFAST_SUPPORTED is cleared, metadata writes occur without failfast.
> 
> It's not exactly a common procedure, but as it doesn't add functions to md.c, 
> I think this approach is preferable to adding md_error_failfast().
> 
> ...
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1baaf52c603c..ba524fa96091 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1003,14 +1003,15 @@ static void super_written(struct bio *bio)
>         if (bio->bi_status) {
>                 pr_err("md: %s gets error=%d\n", __func__,
>                        blk_status_to_errno(bio->bi_status));
> -               md_error(mddev, rdev);
> -               if (!test_bit(Faulty, &rdev->flags)
> +               if (test_bit(LastDev, &rdev->flags)
>                     && (bio->bi_opf & MD_FAILFAST)) {
> +                       pr_warn("md: %s: Metadata write will be repeated to %pg\n",
> +                               mdname(mddev), rdev->bdev);
>                         set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
> -                       set_bit(LastDev, &rdev->flags);
> +               } else {
> +                       md_error(mddev, rdev);
>                 }
> -       } else
> -               clear_bit(LastDev, &rdev->flags);
> +       }
> 
>         bio_put(bio);
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 408c26398321..a52c5277add7 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>                     (bio->bi_opf & MD_FAILFAST) &&
>                     /* We never try FailFast to WriteMostly devices */
>                     !test_bit(WriteMostly, &rdev->flags)) {
> -                       md_error(r1_bio->mddev, rdev);
> +                       raid1_md_error_failfast(r1_bio->mddev, rdev);
>                 }
> 
>                 /*
> @@ -1733,6 +1733,27 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>         seq_printf(seq, "]");
>  }
> 
> +static void _raid1_md_error(struct mddev *mddev, struct md_rdev *rdev, bool failfast){
> +       struct r1conf *conf = mddev->private;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&conf->new_lock_for_md_error, flags);
> +       if (failfast)
> +               set_bit(FailfastIOFailure, &rdev->flags);
> +       md_error(mddev, rdev);
> +       if (failfast)
> +               WARN_ON(!test_and_clear_bit(FailfastIOFailure, &rdev->flags));
> +       spin_unlock_irqrestore(&conf->new_lock_for_md_error, flags);
> +}
> +
> +static void raid1_md_error(struct mddev *mddev, struct md_rdev *rdev){
> +       return _raid1_md_error(mddev, rdev, false);
> +}
> +
> +static void raid1_md_error_failfast(struct mddev *mddev, struct md_rdev *rdev){
> +       return _raid1_md_error(mddev, rdev, true);
> +}
> +
>  /**
>   * raid1_error() - RAID1 error handler.
>   * @mddev: affected md device.
> @@ -1758,6 +1783,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
> 
>         if (test_bit(In_sync, &rdev->flags) &&
>             (conf->raid_disks - mddev->degraded) == 1) {
> +               if (test_bit(FailfastIOFailure, &rdev->flags)) {
> +                       spin_unlock_irqrestore(&conf->device_lock, flags);
> +                       pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
> +                               "last device but ignoring it\n",
> +                               mdname(mddev), rdev->bdev);
> +                       return;
> +               }
>                 set_bit(MD_BROKEN, &mddev->flags);
> 
>                 if (!mddev->fail_last_dev) {
> @@ -1767,8 +1799,16 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>                 }
>         }
>         set_bit(Blocked, &rdev->flags);
> -       if (test_and_clear_bit(In_sync, &rdev->flags))
> +       if (test_and_clear_bit(In_sync, &rdev->flags)) {
>                 mddev->degraded++;
> +               clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
> +               for (i = 0; i < conf->raid_disks; i++) {
> +                       struct md_rdev *rdev2 = conf->mirrors[i].rdev;
> +                       if (rdev2 && rdev != rdev2 &&
> +                           test_bit(In_sync, &rdev2->flags))
> +                               set_bit(LastDev, &rdev2->flags);
> +               }
> +       }
>         set_bit(Faulty, &rdev->flags);
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>         /*
> @@ -2118,7 +2158,7 @@ static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector,
>         }
>         /* need to record an error - either for the block or the device */
>         if (!rdev_set_badblocks(rdev, sector, sectors, 0))
> -               md_error(rdev->mddev, rdev);
> +               raid1_md_error(rdev->mddev, rdev);
>         return 0;
>  }
> ...
> 
> 
> Thanks,
> Akagi
> 
>> @@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>          */
>>
>>         bio = r1_bio->bios[r1_bio->read_disk];
>> -       bio_put(bio);
>>         r1_bio->bios[r1_bio->read_disk] = NULL;
>>
>>         rdev = conf->mirrors[r1_bio->read_disk].rdev;
>> @@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>                 freeze_array(conf, 1);
>>                 fix_read_error(conf, r1_bio);
>>                 unfreeze_array(conf);
>> -       } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>> -               md_error(mddev, rdev);
>> -       } else {
>> +       } else if (mddev->ro == 0 &&
>> +                  !md_bio_failure_error(mddev, rdev, bio)) {
>>                 r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>         }
>>
>> +       bio_put(bio);
>>         rdev_dec_pending(rdev, conf->mddev);
>>         sector = r1_bio->sector;
>> -       bio = r1_bio->master_bio;
>>
>>         /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
>>         r1_bio->state = 0;
>> -       raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
>> +       raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, r1_bio);
>>         allow_barrier(conf, sector);
>>  }
>>
>>
>>>   /* seq_file implementation /proc/mdstat */
>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>> index 51af29a03079..6ca1aea630ce 100644
>>> --- a/drivers/md/md.h
>>> +++ b/drivers/md/md.h
>>> @@ -758,3 +758,3 @@ struct md_personality
>>>           */
>>> -       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
>>> +       void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>>>          int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
>>> @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev);
>>>   extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>>>   extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>>> +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev);
>>>   extern void md_finish_reshape(struct mddev *mddev);
>>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
>>> index f1d8811a542a..8aea51227a96 100644
>>> --- a/drivers/md/raid0.c
>>> +++ b/drivers/md/raid0.c
>>> @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
>>>
>>> -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
>>> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev,
>>> +       bool nofail __maybe_unused)
>>>   {
>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>> index 408c26398321..d93275899e9e 100644
>>> --- a/drivers/md/raid1.c
>>> +++ b/drivers/md/raid1.c
>>> @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>    * @rdev: member device to fail.
>>> + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set
>>>    *
>>> @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>    *
>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>> - * &mddev->fail_last_dev is off.
>>> + * @rdev is marked as &Faulty excluding any cases:
>>> + *     - when @mddev is failed and &mddev->fail_last_dev is off
>>> + *     - when @rdev is last device and @nofail is true
>>>    */
>>> -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>> +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev,
>>> +       bool nofail)
>>>   {
>>> @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>              (conf->raid_disks - mddev->degraded) == 1) {
>>> +               if (nofail) {
>>> +                       spin_unlock_irqrestore(&conf->device_lock, flags);
>>> +                       pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, "
>>> +                               "last device but ignoring it\n",
>>> +                               mdname(mddev), rdev->bdev);
>>> +                       return;
>>> +               }
>>>                  set_bit(MD_BROKEN, &mddev->flags);
>>> ...
>>>
>>>> Kuai, do you have any suggestions?
>>>>
>>>>>> Normal writes may call md_error() in narrow_write_error. Normal reads do
>>>>>> not execute md_error() on the last disk.
>>>>>>
>>>>>> Perhaps you should get more information to confirm how MD_BROKEN is set in
>>>>>> normal read/write IO.
>>>>>
>>>>> Should I add the above sequence of events to the cover letter, or commit message?
>>>>>
>>>>
>>>> I think we should mention this in the commit message.
>>>
>>> Understood. I will explicitly describe this in the commit message in v4.
>>>
>>> Thanks,
>>> Akagi
>>>
>>>>> Thanks,
>>>>> Akagi
>>>>>
>>>>>> -- 
>>>>>> Thanks,
>>>>>> Nan
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> .
>>>>
>>>> -- 
>>>> Thanks,
>>>> Nan
>>>>
>>>>
>>>
>>> .
>>>
>>
>>
>