[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260105144025.12478-1-k@mgml.me>
Date: Mon, 5 Jan 2026 23:40:23 +0900
From: Kenta Akagi <k@...l.me>
To: Song Liu <song@...nel.org>, Yu Kuai <yukuai@...as.com>,
Shaohua Li <shli@...com>, Mariusz Tkaczyk <mtkaczyk@...nel.org>,
Xiao Ni <xni@...hat.com>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
Kenta Akagi <k@...l.me>
Subject: [PATCH v6 0/2] Don't set MD_BROKEN on failfast bio failure
Changes from V5:
- Prevent md being broken when rdev has FailFast, regardless of the bios flag.
Thanks to Xiao for the advice:
https://lore.kernel.org/linux-raid/CALTww2_nJqyA99cG9YNarXEB4wimFK=pKy=qrxdkfB60PaUa1w@mail.gmail.com/#t
- Dropping preparation refactor, flag rename, error logging improvement
commits
Changes from V4:
- Use device_lock to serialize md_error() instead of adding a new
spinlock_t.
- Rename new function md_bio_failure_error() to md_cond_error().
- Add helper function pers->should_error() to determine whether to fail
rdev in failfast bio failure, instead of using the LastDev flag.
- Avoid changing the behavior of the LastDev flag.
- Drop fix for R{1,10}BIO_Uptodate not being set despite successful
retry; this will be sent separately after Nan's refactor.
- Drop fix for the message 'Operation continuing on 0 devices'; as it is
outside the scope of this patch, it will be sent separately.
- Improve logging when metadata writing fails.
- Rename LastDev to RetryingSBWrite.
Changes from V3:
- The error handling in md_error() is now serialized, and a new helper
function, md_bio_failure_error, has been introduced.
- MD_FAILFAST bio failures are now processed by md_bio_failure_error
instead of signaling via FailfastIOFailure.
- RAID10: Fix missing reschedule of failfast read bio failure
- Regardless of failfast, in narrow_write_error, writes that succeed
in retry are returned to the higher layer as success
Changes from V2:
- Fix to prevent the array from being marked broken for all
Failfast IOs, not just metadata.
- Reflecting the review, update raid{1,10}_error to clear
FailfastIOFailure so that devices are properly marked Faulty.
Changes from V1:
- Avoid setting MD_BROKEN instead of clearing it
- Add pr_crit() when setting MD_BROKEN
- Fix the message may shown after all rdevs failure:
"Operation continuing on 0 devices"
v5: https://lore.kernel.org/linux-raid/20251027150433.18193-1-k@mgml.me/
v4: https://lore.kernel.org/linux-raid/20250915034210.8533-1-k@mgml.me/
v3: https://lore.kernel.org/linux-raid/20250828163216.4225-1-k@mgml.me/
v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/
v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/
When multiple MD_FAILFAST bios fail simultaneously on Failfast-enabled
rdevs in RAID1/RAID10, the following issues can occur:
* MD_BROKEN is set and the array halts, even though this should not occur
under the intended Failfast design.
* Writes retried through narrow_write_error succeed, but the I/O is still
reported as BLK_STS_IOERR
* NOTE: a fix for this was removed in v5, will be send separetely
https://lore.kernel.org/linux-raid/6f0f9730-4bbe-7f3c-1b50-690bb77d5d90@huaweicloud.com/
* RAID10 only: If a Failfast read I/O fails, it is not retried on any
remaining rdev, and as a result, the upper layer receives an I/O error.
Simultaneous bio failures across multiple rdevs are uncommon; however,
rdevs serviced via nvme-tcp can still experience them due to something as
simple as an Ethernet fault. The issue can be reproduced using the
following steps.
# prepare nvmet/nvme-tcp and md array
sh-5.2# cat << 'EOF' > loopback-nvme.sh
set -eu
nqn="nqn.2025-08.io.example:nvmet-test-$1"
back=$2
cd /sys/kernel/config/nvmet/
mkdir subsystems/$nqn
echo 1 > subsystems/${nqn}/attr_allow_any_host
mkdir subsystems/${nqn}/namespaces/1
echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path
echo 1 > subsystems/${nqn}/namespaces/1/enable
ports="ports/1"
if [ ! -d $ports ]; then
mkdir $ports
cd $ports
echo 127.0.0.1 > addr_traddr
echo tcp > addr_trtype
echo 4420 > addr_trsvcid
echo ipv4 > addr_adrfam
cd ../../
fi
ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/
nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420
EOF
sh-5.2# chmod +x loopback-nvme.sh
sh-5.2# modprobe -a nvme-tcp nvmet-tcp
sh-5.2# truncate -s 1g a.img b.img
sh-5.2# losetup --show -f a.img
/dev/loop0
sh-5.2# losetup --show -f b.img
/dev/loop1
sh-5.2# ./loopback-nvme.sh 0 /dev/loop0
connecting to device: nvme0
sh-5.2# ./loopback-nvme.sh 1 /dev/loop1
connecting to device: nvme1
sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \
--failfast /dev/nvme0n1 --failfast /dev/nvme1n1
...
mdadm: array /dev/md0 started.
# run fio
sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \
--bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1 &
# It can reproduce the issue by block nvme traffic during fio
sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP;
sh-5.2# sleep 10; # twice the default KATO value
sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP
Patch 1 prevent array broken when FailFast is set on rdev
Patch 2 adds the missing retry path for Failfast read errors in RAID10.
Kenta Akagi (2):
md: Don't set MD_BROKEN for RAID1 and RAID10 when using FailFast
md/raid10: fix failfast read error not rescheduled
drivers/md/md.c | 6 ++++--
drivers/md/raid1.c | 8 +++++++-
drivers/md/raid10.c | 15 ++++++++++++++-
3 files changed, 25 insertions(+), 4 deletions(-)
--
2.50.1
Powered by blists - more mailing lists