linux-kernel - [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250915034210.8533-1-k@mgml.me>
Date: Mon, 15 Sep 2025 12:42:01 +0900
From: Kenta Akagi <k@...l.me>
To: Song Liu <song@...nel.org>, Yu Kuai <yukuai3@...wei.com>,
        Mariusz Tkaczyk <mtkaczyk@...nel.org>, Shaohua Li <shli@...com>,
        Guoqing Jiang <jgq516@...il.com>
Cc: linux-raid@...r.kernel.org, linux-kernel@...r.kernel.org,
        Kenta Akagi <k@...l.me>
Subject: [PATCH v4 0/9] Don't set MD_BROKEN on failfast bio failure

Changes from V3:
- The error handling in md_error() is now serialized, and a new helper
  function, md_bio_failure_error, has been introduced.
- MD_FAILFAST bio failures are now processed by md_bio_failure_error
  instead of signaling via FailfastIOFailure.
- RAID10: Fix missing reschedule of failfast read bio failure
- Regardless of failfast, in narrow_write_error, writes that succeed
  in retry are returned to the higher layer as success
Changes from V2:
- Fix to prevent the array from being marked broken for all
  Failfast IOs, not just metadata.
- Reflecting the review, update raid{1,10}_error to clear
  FailfastIOFailure so that devices are properly marked Faulty.
Changes from V1:
- Avoid setting MD_BROKEN instead of clearing it
- Add pr_crit() when setting MD_BROKEN
- Fix the message may shown after all rdevs failure:
  "Operation continuing on 0 devices"

v3: https://lore.kernel.org/linux-raid/20250828163216.4225-1-k@mgml.me/
v2: https://lore.kernel.org/linux-raid/20250817172710.4892-1-k@mgml.me/
v1: https://lore.kernel.org/linux-raid/20250812090119.153697-1-k@mgml.me/

When multiple MD_FAILFAST bios fail simultaneously on Failfast-enabled
rdevs in RAID1/RAID10, the following issues can occur:
* MD_BROKEN is set and the array halts, even though this should not occur
  under the intended Failfast design.
* Writes retried through narrow_write_error succeed, but the I/O is still
  reported as BLK_STS_IOERR
* RAID10 only: If a Failfast read I/O fails, it is not retried on any
  remaining rdev, and as a result, the upper layer receives an I/O error.

Simultaneous bio failures across multiple rdevs are uncommon; however,
rdevs serviced via nvme-tcp can still experience them due to something as
simple as an Ethernet fault. The issue can be reproduced using the
following steps.

# prepare nvmet/nvme-tcp and md array #
sh-5.2# cat << 'EOF' > loopback-nvme.sh
set -eu
nqn="nqn.2025-08.io.example:nvmet-test-$1"
back=$2
cd /sys/kernel/config/nvmet/
mkdir subsystems/$nqn
echo 1 > subsystems/${nqn}/attr_allow_any_host
mkdir subsystems/${nqn}/namespaces/1
echo -n ${back} > subsystems/${nqn}/namespaces/1/device_path
echo 1 > subsystems/${nqn}/namespaces/1/enable
ports="ports/1"
if [ ! -d $ports ]; then
        mkdir $ports
        cd $ports
        echo 127.0.0.1 > addr_traddr
        echo tcp       > addr_trtype
        echo 4420      > addr_trsvcid
        echo ipv4      > addr_adrfam
        cd ../../
fi
ln -s /sys/kernel/config/nvmet/subsystems/${nqn} ${ports}/subsystems/
nvme connect -t tcp -n $nqn -a 127.0.0.1 -s 4420
EOF

sh-5.2# chmod +x loopback-nvme.sh
sh-5.2# modprobe -a nvme-tcp nvmet-tcp
sh-5.2# truncate -s 1g a.img b.img
sh-5.2# losetup --show -f a.img
/dev/loop0
sh-5.2# losetup --show -f b.img
/dev/loop1
sh-5.2# ./loopback-nvme.sh 0 /dev/loop0
connecting to device: nvme0
sh-5.2# ./loopback-nvme.sh 1 /dev/loop1
connecting to device: nvme1
sh-5.2# mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 \
--failfast /dev/nvme0n1 --failfast /dev/nvme1n1
...
mdadm: array /dev/md0 started.

# run fio #
sh-5.2# fio --name=test --filename=/dev/md0 --rw=randrw --rwmixread=50 \
--bs=4k --numjobs=9 --time_based --runtime=300s --group_reporting --direct=1

# It can reproduce the issue by block nvme traffic during fio #
sh-5.2# iptables -A INPUT -m tcp -p tcp --dport 4420 -j DROP;
sh-5.2# sleep 10; # twice the default KATO value
sh-5.2# iptables -D INPUT -m tcp -p tcp --dport 4420 -j DROP


Patch 1–3 serve as preparatory changes for patch 4.
Patch 4 prevents MD_FAILFAST bio failure causing the array to fail.
Patch 5, regardless of FAILFAST, reports success to the upper layer
if a write retry in narrow_write_error succeeds without marking badblock.
Patch 6 fixes an issue where writes are not retried in a no-bbl
configuration when last rdevs MD_FAILFAST bio failure.
Patch 7 adds the missing retry path for Failfast read errors in RAID10.
Patch 8-9 modify the pr_crit handling in raid{1,10}_error.

Kenta Akagi (9):
  md/raid1,raid10: Set the LastDev flag when the configuration changes
  md: serialize md_error()
  md: introduce md_bio_failure_error()
  md/raid1,raid10: Don't set MD_BROKEN on failfast bio failure
  md/raid1,raid10: Set R{1,10}BIO_Uptodate when successful retry of a
    failed bio
  md/raid1,raid10: Fix missing retries Failfast write bios on no-bbl
    rdevs
  md/raid10: fix failfast read error not rescheduled
  md/raid1,raid10: Add error message when setting MD_BROKEN
  md/raid1,raid10: Fix: Operation continuing on 0 devices.

 drivers/md/md.c     |  61 ++++++++++++++---
 drivers/md/md.h     |  12 +++-
 drivers/md/raid1.c  | 156 ++++++++++++++++++++++++++++++++------------
 drivers/md/raid10.c | 115 ++++++++++++++++++++++++++------
 4 files changed, 270 insertions(+), 74 deletions(-)

-- 
2.50.1