[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5687cdaf-b8f8-2e88-6440-74867a35c330@huaweicloud.com>
Date: Mon, 8 Sep 2025 09:27:57 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Kenta Akagi <k@...l.me>, Yu Kuai <yukuai1@...weicloud.com>,
Li Nan <linan666@...weicloud.com>, Song Liu <song@...nel.org>,
Mariusz Tkaczyk <mtkaczyk@...nel.org>, Guoqing Jiang <jgq516@...il.com>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH v3 1/3] md/raid1,raid10: Do not set MD_BROKEN on failfast
io failure
Hi,
在 2025/09/05 23:07, Kenta Akagi 写道:
>
> On 2025/09/02 1:48, Kenta Akagi wrote:
>>
>
> Hi,
>
> By the way, I found two more issues.
> This can be reproduced by causing a Failfast IO failure using the attached procedure.
> I wrote patches for them. Would it make sense to send these as v4?
>
> * raid1,raid10: IO error returned even when all writes succeed in narrow_write_error
> When only one rdev remains, a failfast write bio is retried in
> narrow_write_error() via handle_write_completed(). Even if all retries succeed,
> R1BIO_Uptodate is not set. As a result, the upper layer get BLK_STS_IOERR.
>
> This behavior appears to be intentional, under the assumption of at least one bad sector.
> However, this behavior is undesirable for failfast retries.
> In addition, although less likely, even on a normal HDD or SSD,
> if all split writes succeed, the upper layer would still receive
> BLK_STS_IOERR despite the write having ultimately completed successfully.
> Of course, this does not apply if the other rdev writes succeed.
>
> I'm not exactly sure how to handle the following point.
> - Can a retried write be marked R1BIO_Uptodate after one failure in non-failfast?
> - Should it only be marked Uptodate for failfast?
If retry succeed and no badblocks is set, we should set the IO uptodate,
return error to user is not good.
> - Should failfast retries be handled outside narrow_write_error?
Please don't do this unless we have to.
>
> For now, I attempt to set R1BIO_Uptodate whenever narrow_write_error() succeeds
> and there are no bad blocks, regardless of MD_FAILFAST.
>
> * raid10: no retry scheduled for failfast read failure
> raid10_end_read_request lacks a path to retry when a FailFast IO fails.
> As a result, when Failfast Read IOs fail on all rdevs,
> the upper layer receives BLK_STS_IOERR without a retry,
>
> Looking at the two commits below, it seems only raid10_end_read_request
> lacks the failfast read retry handling, while raid1_end_read_request has it.
> In RAID1, the retry works as expected.
>
> I don't know why raid10_end_read_request lacks this, but it is probably
> just a simple oversight.
>
> - 8d3ca83dcf9c ("md/raid10: add failfast handling for reads.")
> - 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>
Yes, this looks correct.
Thanks,
Kuai
> Here's how to reproduce :
>
> # prepare nvmet/nvme-tcp and md array
> sh-5.2# cat << 'EOF' > loopback-nvme.sh
>> set -eu
> nqn="nqn.2025-08.io.example:nvmet-test-$1"
> back=$2
> cd /sys/kernel/config/nvmet/
> mkdir subsystems/$nqn
> echo 1 > subsystems/${nqn}/attr_allow_any_host
> mkdir subsystems/${nqn}/namespaces/1
> echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path
> echo 1 > subsystems/${nqn}/namespaces/1/enable
> ports="ports/1"
> if [ ! -d $ports ]; then
> mkdir $ports
> cd $ports
> echo 127.0.0.1 > addr_traddr
> echo tcp > addr_trtype
> echo 4420 > addr_trsvcid
> echo ipv4 > addr_adrfam
> cd ../../
> fi
> ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/
> nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420
>> EOF
> sh-5.2# chmod +x loopback-nvme.sh
> sh-5.2# modprobe -a nvme-tcp nvmet-tcp
> sh-5.2# truncate -s 1g a.img b.img
> sh-5.2# losetup --show -f a.img
> /dev/loop0
> sh-5.2# losetup --show -f b.img
> /dev/loop1
> sh-5.2# ./loopback-nvme.sh 0 /dev/loop0
> connecting to device: nvme0
> sh-5.2# ./loopback-nvme.sh 1 /dev/loop1
> connecting to device: nvme1
> sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \
> --failfast /dev/nvme0n1 --failfast /dev/nvme1n1
> ...
> mdadm: array /dev/md0 started.
>
> # run fio
> sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \
> --bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1
>
> # It can reproduce the issue by block nvme traffic during fio
> sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP;
> sh-5.2# sleep 10; # twice the default KATO value
> sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP
>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 408c26398321..ce4dff63f50a 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> static bool narrow_write_error(struct r1bio *r1_bio, int i)
> {
> struct mddev *mddev = r1_bio->mddev;
> @@ -2519,13 +2521,16 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
> sector_t sector;
> int sectors;
> int sect_to_write = r1_bio->sectors;
> - bool ok = true;
> + bool write_ok = true;
> + bool setbad_ok = true;
> + bool bbl_enabled = !(rdev->badblocks.shift < 0);
>
> - if (rdev->badblocks.shift < 0)
> - return false;
> + if (bbl_enabled)
> + block_sectors = roundup(1 << rdev->badblocks.shift,
> + bdev_logical_block_size(rdev->bdev) >> 9);
> + else
> + block_sectors = 1;
>
> - block_sectors = roundup(1 << rdev->badblocks.shift,
> - bdev_logical_block_size(rdev->bdev) >> 9);
> sector = r1_bio->sector;
> sectors = ((sector + block_sectors)
> & ~(sector_t)(block_sectors - 1))
> @@ -2543,18 +2554,25 @@ static bool narrow_write_error(struct r1bio *r1_bio, int i)
> bio_trim(wbio, sector - r1_bio->sector, sectors);
> wbio->bi_iter.bi_sector += rdev->data_offset;
>
> - if (submit_bio_wait(wbio) < 0)
> + if (submit_bio_wait(wbio) < 0) {
> /* failure! */
> - ok = rdev_set_badblocks(rdev, sector,
> - sectors, 0)
> - && ok;
> + write_ok = false;
> + if (bbl_enabled)
> + setbad_ok = rdev_set_badblocks(rdev, sector,
> + sectors, 0)
> + && setbad_ok;
> + }
>
> bio_put(wbio);
> sect_to_write -= sectors;
> sector += sectors;
> sectors = block_sectors;
> }
> - return ok;
> +
> + if (!write_ok &&
> + (!setbad_ok || !bbl_enabled))
> + return false;
> + return true;
> }
>
> static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
> @@ -2585,26 +2603,34 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
> int m, idx;
> bool fail = false;
>
> - for (m = 0; m < conf->raid_disks * 2 ; m++)
> - if (r1_bio->bios[m] == IO_MADE_GOOD) {
> - struct md_rdev *rdev = conf->mirrors[m].rdev;
> + for (m = 0; m < conf->raid_disks * 2 ; m++) {
> + struct md_rdev *rdev = conf->mirrors[m].rdev;
> + struct bio *bio = r1_bio->bios[m];
> + if (bio == IO_MADE_GOOD) {
> rdev_clear_badblocks(rdev,
> r1_bio->sector,
> r1_bio->sectors, 0);
> rdev_dec_pending(rdev, conf->mddev);
> - } else if (r1_bio->bios[m] != NULL) {
> + } else if (bio != NULL) {
> /* This drive got a write error. We need to
> * narrow down and record precise write
> * errors.
> */
> fail = true;
> - if (!narrow_write_error(r1_bio, m))
> - md_error(conf->mddev,
> - conf->mirrors[m].rdev);
> + if (!narrow_write_error(r1_bio, m)){
> + md_error(conf->mddev, rdev);
> /* an I/O failed, we can't clear the bitmap */
> - rdev_dec_pending(conf->mirrors[m].rdev,
> - conf->mddev);
> + } else if(test_bit(In_sync, &rdev->flags) &&
> + !test_bit(Faulty, &rdev->flags) &&
> + rdev_has_badblock(rdev,
> + r1_bio->sector,
> + r1_bio->sectors) == 0) {
> + set_bit(R1BIO_Uptodate, &r1_bio->state);
> + }
> + rdev_dec_pending(rdev, conf->mddev);
> }
> + }
> if (fail) {
> spin_lock_irq(&conf->device_lock);
> list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index b60c30bfb6c7..7145daf1543b 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -399,6 +399,11 @@ static void raid10_end_read_request(struct bio *bio)
> * wait for the 'master' bio.
> */
> set_bit(R10BIO_Uptodate, &r10_bio->state);
> + } else if (test_bit(FailFast, &rdev->flags) &&
> + test_bit(R10BIO_FailFast, &r10_bio->state)) {
> + /* This was a fail-fast read so we definitely
> + * want to retry */
> + ;
> } else if (!raid1_should_handle_error(bio)) {
> uptodate = 1;
> } else {
>
>
>>
>> On 2025/09/01 16:48, Yu Kuai wrote:
>>> Hi,
>>>
>>> 在 2025/09/01 12:22, Kenta Akagi 写道:
>>>>
>>>>
>>>> On 2025/09/01 12:22, Li Nan wrote:
>>>>>
>>>>>
>>>>> 在 2025/8/31 2:10, Kenta Akagi 写道:
>>>>>>
>>>>>>
>>>>>> On 2025/08/30 17:48, Li Nan wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2025/8/29 20:21, Kenta Akagi 写道:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 2025/08/29 11:54, Li Nan wrote:
>>>>>>>>>
>>>>>>>>> 在 2025/8/29 0:32, Kenta Akagi 写道:
>>>>>>>>>> This commit ensures that an MD_FAILFAST IO failure does not put
>>>>>>>>>> the array into a broken state.
>>>>>>>>>>
>>>>>>>>>> When failfast is enabled on rdev in RAID1 or RAID10,
>>>>>>>>>> the array may be flagged MD_BROKEN in the following cases.
>>>>>>>>>> - If MD_FAILFAST IOs to multiple rdevs fail simultaneously
>>>>>>>>>> - If an MD_FAILFAST metadata write to the 'last' rdev fails
>>>>>>>>>
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>>>>>>>> index 408c26398321..8a61fd93b3ff 100644
>>>>>>>>>> --- a/drivers/md/raid1.c
>>>>>>>>>> +++ b/drivers/md/raid1.c
>>>>>>>>>> @@ -470,6 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>>>>>>>>>> (bio->bi_opf & MD_FAILFAST) &&
>>>>>>>>>> /* We never try FailFast to WriteMostly devices */
>>>>>>>>>> !test_bit(WriteMostly, &rdev->flags)) {
>>>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>> md_error(r1_bio->mddev, rdev);
>>>>>>>>>> }
>>>>>>>>>> @@ -1746,8 +1747,12 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>>>>>>>> * - recovery is interrupted.
>>>>>>>>>> * - &mddev->degraded is bumped.
>>>>>>>>>> *
>>>>>>>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>>>>>>>> - * &mddev->fail_last_dev is off.
>>>>>>>>>> + * If @rdev has &FailfastIOFailure and it is the 'last' rdev,
>>>>>>>>>> + * then @mddev and @rdev will not be marked as failed.
>>>>>>>>>> + *
>>>>>>>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>>>>>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off
>>>>>>>>>> + * - when @rdev is last device and &FailfastIOFailure flag is set
>>>>>>>>>> */
>>>>>>>>>> static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>>>> {
>>>>>>>>>> @@ -1758,6 +1763,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>>>>>>>> if (test_bit(In_sync, &rdev->flags) &&
>>>>>>>>>> (conf->raid_disks - mddev->degraded) == 1) {
>>>>>>>>>> + if (test_and_clear_bit(FailfastIOFailure, &rdev->flags)) {
>>>>>>>>>> + spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>>>>>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>>>>>>>>>> + "last device but ignoring it\n",
>>>>>>>>>> + mdname(mddev), rdev->bdev);
>>>>>>>>>> + return;
>>>>>>>>>> + }
>>>>>>>>>> set_bit(MD_BROKEN, &mddev->flags);
>>>>>>>>>> if (!mddev->fail_last_dev) {
>>>>>>>>>> @@ -2148,6 +2160,7 @@ static int fix_sync_read_error(struct r1bio *r1_bio)
>>>>>>>>>> if (test_bit(FailFast, &rdev->flags)) {
>>>>>>>>>> /* Don't try recovering from here - just fail it
>>>>>>>>>> * ... unless it is the last working device of course */
>>>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>> md_error(mddev, rdev);
>>>>>>>>>> if (test_bit(Faulty, &rdev->flags))
>>>>>>>>>> /* Don't try to read from here, but make sure
>>>>>>>>>> @@ -2652,6 +2665,7 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>>>>>>>>> fix_read_error(conf, r1_bio);
>>>>>>>>>> unfreeze_array(conf);
>>>>>>>>>> } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>> md_error(mddev, rdev);
>>>>>>>>>> } else {
>>>>>>>>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>>>>>>>>> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
>>>>>>>>>> index b60c30bfb6c7..530ad6503189 100644
>>>>>>>>>> --- a/drivers/md/raid10.c
>>>>>>>>>> +++ b/drivers/md/raid10.c
>>>>>>>>>> @@ -488,6 +488,7 @@ static void raid10_end_write_request(struct bio *bio)
>>>>>>>>>> dec_rdev = 0;
>>>>>>>>>> if (test_bit(FailFast, &rdev->flags) &&
>>>>>>>>>> (bio->bi_opf & MD_FAILFAST)) {
>>>>>>>>>> + set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>>>> md_error(rdev->mddev, rdev);
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thank you for the patch. There may be an issue with 'test_and_clear'.
>>>>>>>>>
>>>>>>>>> If two write IO go to the same rdev, MD_BROKEN may be set as below:
>>>>>>>>
>>>>>>>>> IO1 IO2
>>>>>>>>> set FailfastIOFailure
>>>>>>>>> set FailfastIOFailure
>>>>>>>>> md_error
>>>>>>>>> raid1_error
>>>>>>>>> test_and_clear FailfastIOFailur
>>>>>>>>> md_error
>>>>>>>>> raid1_error
>>>>>>>>> //FailfastIOFailur is cleared
>>>>>>>>> set MD_BROKEN
>>>>>>>>>
>>>>>>>>> Maybe we should check whether FailfastIOFailure is already set before
>>>>>>>>> setting it. It also needs to be considered in metadata writes.
>>>>>>>> Thank you for reviewing.
>>>>>>>>
>>>>>>>> I agree, this seems to be as you described.
>>>>>>>> So, should it be implemented as follows?
>>>>>>>>
>>>>>>>> bool old=false;
>>>>>>>> do{
>>>>>>>> spin_lock_irqsave(&conf->device_lock, flags);
>>>>>>>> old = test_and_set_bit(FailfastIOFailure, &rdev->flags);
>>>>>>>> spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>>>>> }while(old);
>>>>>>>>
>>>>>>>> However, since I am concerned about potential deadlocks,
>>>>>>>> so I am considering two alternative approaches:
>>>>>>>>
>>>>>>>> * Add an atomic_t counter to md_rdev to track failfast IO failures.
>>>>>>>>
>>>>>>>> This may set MD_BROKEN at a slightly incorrect timing, but mixing
>>>>>>>> error handling of Failfast and non-Failfast IOs appears to be rare.
>>>>>>>> In any case, the final outcome would be the same, i.e. the array
>>>>>>>> ends up with MD_BROKEN. Therefore, I think this should not cause
>>>>>>>> issues. I think the same applies to test_and_set_bit.
>>>>>>>>
>>>>>>>> IO1 IO2 IO3
>>>>>>>> FailfastIOFailure Normal IOFailure FailfastIOFailure
>>>>>>>> atomic_inc
>>>>>>>> md_error atomic_inc
>>>>>>>> raid1_error
>>>>>>>> atomic_dec //2to1
>>>>>>>> md_error
>>>>>>>> raid1_error md_error
>>>>>>>> atomic_dec //1to0 raid1_error
>>>>>>>> atomic_dec //0
>>>>>>>> set MD_BROKEN
>>>>>>>>
>>>>>>>> * Alternatively, create a separate error handler,
>>>>>>>> e.g. md_error_failfast(), that clearly does not fail the array.
>>>>>>>>
>>>>>>>> This approach would require somewhat larger changes and may not
>>>>>>>> be very elegant, but it seems to be a reliable way to ensure
>>>>>>>> MD_BROKEN is never set at the wrong timing.
>>>>>>>>
>>>>>>>> Which of these three approaches would you consider preferable?
>>>>>>>> I would appreciate your feedback.
>>>>>>>>
>>>>>>>>
>>>>>>>> For metadata writes, I plan to clear_bit MD_FAILFAST_SUPPORTED
>>>>>>>> when the array is degraded.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Akagi
>>>>>>>>
>>>>>>>
>>>>>>> I took a closer look at the FailFast code and found a few issues, using
>>>>>>> RAID1 as an example:
>>>>>>>
>>>>>>> For normal read/write IO, FailFast is only triggered when there is another
>>>>>>> disk is available, as seen in read_balance() and raid1_write_request().
>>>>>>> In raid1_error(), MD_BROKEN is set only when no other disks are available.
>>>>>>
>>>>>> Hi,
>>>>>> Agree, I think so too.
>>>>>>
>>>>>>> So, the FailFast for normal read/write is not triggered in the scenario you
>>>>>>> described in cover-letter.
>>>>>>
>>>>>> This corresponds to the case described in the commit message of PATCH v3 1/3.
>>>>>> "Normally, MD_FAILFAST IOs are not issued to the 'last' rdev, so this is
>>>>>> an edge case; however, it can occur if rdevs in a non-degraded
>>>>>> array share the same path and that path is lost, or if a metadata
>>>>>> write is triggered in a degraded array and fails due to failfast."
>>>>>>
>>>>>> To describe it in more detail, the flow is as follows:
>>>>>>
>>>>>> Prerequisites:
>>>>>>
>>>>>> - Both rdevs are in-sync
>>>>>> - Both rdevs have in-flight MD_FAILFAST bios
>>>>>> - Both rdevs depend on the same lower-level path
>>>>>> (e.g., nvme-tcp over a single Ethernet interface)
>>>>>>
>>>>>> Sequence:
>>>>>>
>>>>>> - A bios with REQ_FAILFAST_DEV fails (e.g., due to a temporary network outage),
>>>>>> in the case of nvme-tcp:
>>>>>> - The Ethernet connection is lost on the node where md is running over 5 seconds
>>>>>> - Then the connection is restored. Idk the details of nvme-tcp implementation,
>>>>>> but it seems that failfast IOs finish only after the connection is back.
>>>>>> - All failfast bios fail, raid1_end_write_request is called.
>>>>>> - md_error() marks one rdev Faulty; the other rdev becomes the 'last' rdev.
>>>>>> - md_error() on the last rdev sets MD_BROKEN on the array - fail_last_dev=1 is unlikely.
>>>>>> - The write is retried via handle_write_finished -> narrow_write_error, usually succeeding.
>>>>>> - MD_BROKEN remains set, leaving the array in a state where no further writes can occur.
>>>>>>
>>>>>
>>>>> Thanks for your patient explanation. I understand. Maybe we need a separate
>>>>> error-handling path for failfast. How about adding an extra parameter to md_error()?
>>>>
>>>> Thank you for reviewing.
>>>>
>>>> I am thinking of proceeding like as follows.
>>>> md_error is EXPORT_SYMBOL. I think that it is undesirable to change the ABI of this function.
>>>>
>>>
>>> It doesn't matter if it's a exported symbol, we should just keep code as
>>> simple as possible.
>>>> ...
>>>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>>>> index ac85ec73a409..855cddeb0c09 100644
>>>> --- a/drivers/md/md.c
>>>> +++ b/drivers/md/md.c
>>>> @@ -8197,3 +8197,3 @@ EXPORT_SYMBOL(md_unregister_thread);
>>>>
>>>> -void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail)
>>>> {
>>>> @@ -8204,3 +8204,3 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> return;
>>>> - mddev->pers->error_handler(mddev, rdev);
>>>> + mddev->pers->error_handler(mddev, rdev, nofail);
>>>>
>>>> @@ -8222,4 +8222,26 @@ void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> }
>>>> +
>>>> +void md_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> +{
>>>> + return _md_error(mddev, rdev, false);
>>>> +}
>>>> EXPORT_SYMBOL(md_error);
>>>>
>>>> +void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev)
>>>> +{
>>>> + WARN_ON(mddev->pers->head.id != ID_RAID1 &&
>>>> + mddev->pers->head.id != ID_RAID10);
>>>> + return _md_error(mddev, rdev, true);
>>>> +}
>>>> +EXPORT_SYMBOL(md_error_failfast);
>>>> +
>>>
>>> I will prefer we add a common procedures to fix this problme.
>>>
>>> How about the first patch to serialize all the md_error(), and then
>>> and a new helper md_bio_failue_error(mddev, rdev, bio), called when
>>> bio is failed, in this helper:
>>>
>>> 1) if bio is not failfast, call md_error() and return true; otherwise:
>>> 2) if rdev contain the last data copy, return false directly, caller
>>> should check return value and retry, otherwise:
>>> 3) call md_error() and return true;
>>
>> Hi,
>> I think this approach has some issues. There are cases where md_error is
>> called only when MD_FAILFAST is set.
>>
>> One example is the processing below in raid1_end_write_request.
>>
>>> Then, for raid1, the callers will look like:
>>>
>>> iff --git a/drivers/md/md.c b/drivers/md/md.c
>>> index 1baaf52c603c..c6d150e9f1a7 100644
>>> --- a/drivers/md/md.c
>>> +++ b/drivers/md/md.c
>>> @@ -1003,9 +1003,7 @@ static void super_written(struct bio *bio)
>>> if (bio->bi_status) {
>>> pr_err("md: %s gets error=%d\n", __func__,
>>> blk_status_to_errno(bio->bi_status));
>>> - md_error(mddev, rdev);
>>> - if (!test_bit(Faulty, &rdev->flags)
>>> - && (bio->bi_opf & MD_FAILFAST)) {
>>> + if (!md_bio_failure_error(mddev, rdev, bio)) {
>>> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>>> set_bit(LastDev, &rdev->flags);
>>> }
>>>
>>> @@ -466,20 +472,11 @@ static void raid1_end_write_request(struct bio *bio)
>>> set_bit(MD_RECOVERY_NEEDED, &
>>> conf->mddev->recovery);
>>>
>>> - if (test_bit(FailFast, &rdev->flags) &&
>>> - (bio->bi_opf & MD_FAILFAST) &&
>>> /* We never try FailFast to WriteMostly devices */
>>> - !test_bit(WriteMostly, &rdev->flags)) {
>>> - md_error(r1_bio->mddev, rdev);
>>> - }
>>> -
>>> - /*
>>> - * When the device is faulty, it is not necessary to
>>> - * handle write error.
>>> - */
>>> - if (!test_bit(Faulty, &rdev->flags))
>>> + if(!test_bit(WriteMostly, &rdev->flags) &&
>>> + !md_bio_failure_error(mddev, rdev, bio)) {
>>> set_bit(R1BIO_WriteError, &r1_bio->state);
>>> - else {
>>> + } else {
>>> /* Finished with this branch */
>>> r1_bio->bios[mirror] = NULL;
>>> to_put = bio;
>>
>> In the current raid1_end_write_request implementation,
>> - md_error is called only in the Failfast case.
>> - Afterwards, if the rdev is not Faulty (that is, not Failfast,
>> or Failfast but the last rdev — which originally was not expected
>> MD_BROKEN in RAID1), R1BIO_WriteError is set.
>> In the suggested implementation, it seems that a non-Failfast write
>> failure will immediately mark the rdev as Faulty, without retries.
>>
>> This could be avoided by testing MD_FAILFAST before call the
>> new helper md_bio_failure_error, but I believe duplicating the
>> same check in both caller/callee would be undesirable.
>>
>> Should we try to avoid modifying pers->error_handler?
>> One possible alternative approach is as follows.
>>
>> - serialize calls to md_error regardless of whether Failfast or not
>> - raid{1,10}_error is:
>> - The remaining copy (rdev) is marked with the LastDev flag
>> - clear MD_FAILFAST_SUPPORTED for prohibit super_write using Failfast
>> - super_written will simply put MD_SB_NEED_REWRITE without calling
>> md_error when MD_FAILFAST bio and LastDev rdev.
>>
>> After the changes, I believe it is rare for super_written to be called with error on
>> multiple rdevs due to failfast. super_write is caused by errors from normal failfast
>> IO and invoked via MD_SB_CHANGE_DEVS through the serialized raid1_error. Since
>> MD_FAILFAST_SUPPORTED is cleared, metadata writes occur without failfast.
>>
>> It's not exactly a common procedure, but as it doesn't add functions to md.c,
>> I think this approach is preferable to adding md_error_failfast().
>>
>> ...
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 1baaf52c603c..ba524fa96091 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1003,14 +1003,15 @@ static void super_written(struct bio *bio)
>> if (bio->bi_status) {
>> pr_err("md: %s gets error=%d\n", __func__,
>> blk_status_to_errno(bio->bi_status));
>> - md_error(mddev, rdev);
>> - if (!test_bit(Faulty, &rdev->flags)
>> + if (test_bit(LastDev, &rdev->flags)
>> && (bio->bi_opf & MD_FAILFAST)) {
>> + pr_warn("md: %s: Metadata write will be repeated to %pg\n",
>> + mdname(mddev), rdev->bdev);
>> set_bit(MD_SB_NEED_REWRITE, &mddev->sb_flags);
>> - set_bit(LastDev, &rdev->flags);
>> + } else {
>> + md_error(mddev, rdev);
>> }
>> - } else
>> - clear_bit(LastDev, &rdev->flags);
>> + }
>>
>> bio_put(bio);
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>> index 408c26398321..a52c5277add7 100644
>> --- a/drivers/md/raid1.c
>> +++ b/drivers/md/raid1.c
>> @@ -470,7 +470,7 @@ static void raid1_end_write_request(struct bio *bio)
>> (bio->bi_opf & MD_FAILFAST) &&
>> /* We never try FailFast to WriteMostly devices */
>> !test_bit(WriteMostly, &rdev->flags)) {
>> - md_error(r1_bio->mddev, rdev);
>> + raid1_md_error_failfast(r1_bio->mddev, rdev);
>> }
>>
>> /*
>> @@ -1733,6 +1733,27 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>> seq_printf(seq, "]");
>> }
>>
>> +static void _raid1_md_error(struct mddev *mddev, struct md_rdev *rdev, bool failfast){
>> + struct r1conf *conf = mddev->private;
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&conf->new_lock_for_md_error, flags);
>> + if (failfast)
>> + set_bit(FailfastIOFailure, &rdev->flags);
>> + md_error(mddev, rdev);
>> + if (failfast)
>> + WARN_ON(!test_and_clear_bit(FailfastIOFailure, &rdev->flags));
>> + spin_unlock_irqrestore(&conf->new_lock_for_md_error, flags);
>> +}
>> +
>> +static void raid1_md_error(struct mddev *mddev, struct md_rdev *rdev){
>> + return _raid1_md_error(mddev, rdev, false);
>> +}
>> +
>> +static void raid1_md_error_failfast(struct mddev *mddev, struct md_rdev *rdev){
>> + return _raid1_md_error(mddev, rdev, true);
>> +}
>> +
>> /**
>> * raid1_error() - RAID1 error handler.
>> * @mddev: affected md device.
>> @@ -1758,6 +1783,13 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>
>> if (test_bit(In_sync, &rdev->flags) &&
>> (conf->raid_disks - mddev->degraded) == 1) {
>> + if (test_bit(FailfastIOFailure, &rdev->flags)) {
>> + spin_unlock_irqrestore(&conf->device_lock, flags);
>> + pr_warn_ratelimited("md/raid1:%s: Failfast IO failure on %pg, "
>> + "last device but ignoring it\n",
>> + mdname(mddev), rdev->bdev);
>> + return;
>> + }
>> set_bit(MD_BROKEN, &mddev->flags);
>>
>> if (!mddev->fail_last_dev) {
>> @@ -1767,8 +1799,16 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>> }
>> }
>> set_bit(Blocked, &rdev->flags);
>> - if (test_and_clear_bit(In_sync, &rdev->flags))
>> + if (test_and_clear_bit(In_sync, &rdev->flags)) {
>> mddev->degraded++;
>> + clear_bit(MD_FAILFAST_SUPPORTED, &mddev->flags);
>> + for (i = 0; i < conf->raid_disks; i++) {
>> + struct md_rdev *rdev2 = conf->mirrors[i].rdev;
>> + if (rdev2 && rdev != rdev2 &&
>> + test_bit(In_sync, &rdev2->flags))
>> + set_bit(LastDev, &rdev2->flags);
>> + }
>> + }
>> set_bit(Faulty, &rdev->flags);
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> /*
>> @@ -2118,7 +2158,7 @@ static int r1_sync_page_io(struct md_rdev *rdev, sector_t sector,
>> }
>> /* need to record an error - either for the block or the device */
>> if (!rdev_set_badblocks(rdev, sector, sectors, 0))
>> - md_error(rdev->mddev, rdev);
>> + raid1_md_error(rdev->mddev, rdev);
>> return 0;
>> }
>> ...
>>
>>
>> Thanks,
>> Akagi
>>
>>> @@ -2630,7 +2627,6 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>> */
>>>
>>> bio = r1_bio->bios[r1_bio->read_disk];
>>> - bio_put(bio);
>>> r1_bio->bios[r1_bio->read_disk] = NULL;
>>>
>>> rdev = conf->mirrors[r1_bio->read_disk].rdev;
>>> @@ -2639,19 +2635,18 @@ static void handle_read_error(struct r1conf *conf, struct r1bio *r1_bio)
>>> freeze_array(conf, 1);
>>> fix_read_error(conf, r1_bio);
>>> unfreeze_array(conf);
>>> - } else if (mddev->ro == 0 && test_bit(FailFast, &rdev->flags)) {
>>> - md_error(mddev, rdev);
>>> - } else {
>>> + } else if (mddev->ro == 0 &&
>>> + !md_bio_failure_error(mddev, rdev, bio)) {
>>> r1_bio->bios[r1_bio->read_disk] = IO_BLOCKED;
>>> }
>>>
>>> + bio_put(bio);
>>> rdev_dec_pending(rdev, conf->mddev);
>>> sector = r1_bio->sector;
>>> - bio = r1_bio->master_bio;
>>>
>>> /* Reuse the old r1_bio so that the IO_BLOCKED settings are preserved */
>>> r1_bio->state = 0;
>>> - raid1_read_request(mddev, bio, r1_bio->sectors, r1_bio);
>>> + raid1_read_request(mddev, r1_bio->maxter_bio, r1_bio->sectors, r1_bio);
>>> allow_barrier(conf, sector);
>>> }
>>>
>>>
>>>> /* seq_file implementation /proc/mdstat */
>>>> diff --git a/drivers/md/md.h b/drivers/md/md.h
>>>> index 51af29a03079..6ca1aea630ce 100644
>>>> --- a/drivers/md/md.h
>>>> +++ b/drivers/md/md.h
>>>> @@ -758,3 +758,3 @@ struct md_personality
>>>> */
>>>> - void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev);
>>>> + void (*error_handler)(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>>>> int (*hot_add_disk) (struct mddev *mddev, struct md_rdev *rdev);
>>>> @@ -903,3 +903,5 @@ extern void md_write_end(struct mddev *mddev);
>>>> extern void md_done_sync(struct mddev *mddev, int blocks, int ok);
>>>> +void _md_error(struct mddev *mddev, struct md_rdev *rdev, bool nofail);
>>>> extern void md_error(struct mddev *mddev, struct md_rdev *rdev);
>>>> +extern void md_error_failfast(struct mddev *mddev, struct md_rdev *rdev);
>>>> extern void md_finish_reshape(struct mddev *mddev);
>>>> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
>>>> index f1d8811a542a..8aea51227a96 100644
>>>> --- a/drivers/md/raid0.c
>>>> +++ b/drivers/md/raid0.c
>>>> @@ -637,3 +637,4 @@ static void raid0_status(struct seq_file *seq, struct mddev *mddev)
>>>>
>>>> -static void raid0_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> +static void raid0_error(struct mddev *mddev, struct md_rdev *rdev,
>>>> + bool nofail __maybe_unused)
>>>> {
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index 408c26398321..d93275899e9e 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -1739,2 +1739,3 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>> * @rdev: member device to fail.
>>>> + * @nofail: @mdev and @rdev must not fail even if @rdev is the last when @nofail set
>>>> *
>>>> @@ -1748,6 +1749,8 @@ static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>>>> *
>>>> - * @rdev is marked as &Faulty excluding case when array is failed and
>>>> - * &mddev->fail_last_dev is off.
>>>> + * @rdev is marked as &Faulty excluding any cases:
>>>> + * - when @mddev is failed and &mddev->fail_last_dev is off
>>>> + * - when @rdev is last device and @nofail is true
>>>> */
>>>> -static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> +static void raid1_error(struct mddev *mddev, struct md_rdev *rdev,
>>>> + bool nofail)
>>>> {
>>>> @@ -1760,2 +1763,9 @@ static void raid1_error(struct mddev *mddev, struct md_rdev *rdev)
>>>> (conf->raid_disks - mddev->degraded) == 1) {
>>>> + if (nofail) {
>>>> + spin_unlock_irqrestore(&conf->device_lock, flags);
>>>> + pr_warn_ratelimited("md/raid1:%s: IO failure on %pg, "
>>>> + "last device but ignoring it\n",
>>>> + mdname(mddev), rdev->bdev);
>>>> + return;
>>>> + }
>>>> set_bit(MD_BROKEN, &mddev->flags);
>>>> ...
>>>>
>>>>> Kuai, do you have any suggestions?
>>>>>
>>>>>>> Normal writes may call md_error() in narrow_write_error. Normal reads do
>>>>>>> not execute md_error() on the last disk.
>>>>>>>
>>>>>>> Perhaps you should get more information to confirm how MD_BROKEN is set in
>>>>>>> normal read/write IO.
>>>>>>
>>>>>> Should I add the above sequence of events to the cover letter, or commit message?
>>>>>>
>>>>>
>>>>> I think we should mention this in the commit message.
>>>>
>>>> Understood. I will explicitly describe this in the commit message in v4.
>>>>
>>>> Thanks,
>>>> Akagi
>>>>
>>>>>> Thanks,
>>>>>> Akagi
>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> Nan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> .
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> Nan
>>>>>
>>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>
>
> .
>
Powered by blists - more mailing lists