[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250127090049.7952-1-dougvj@dougvj.net>
Date: Mon, 27 Jan 2025 02:00:41 -0700
From: Doug V Johnson <dougvj@...gvj.net>
To:
Cc: Doug Johnson <dougvj@...il.com>,
Doug V Johnson <dougvj@...gvj.net>,
Song Liu <song@...nel.org>,
Yu Kuai <yukuai3@...wei.com>,
linux-raid@...r.kernel.org (open list:SOFTWARE RAID (Multiple Disks) SUPPORT),
linux-kernel@...r.kernel.org (open list)
Subject: [PATCH v2 1/2] md/raid5: freeze reshape when encountering a bad read
While adding an additional drive to a raid6 array, the reshape stalled
at about 13% complete and any I/O operations on the array hung,
creating an effective soft lock. The kernel reported a hung task in
mdXX_reshape thread and I had to use magic sysrq to recover as systemd
hung as well.
I first suspected an issue with one of the underlying block devices and
as precaution I recovered the data in read only mode to a new array, but
it turned out to be in the RAID layer as I was able to recreate the
issue from a superblock dump in sparse files.
After poking around some I discovered that I had somehow propagated the
bad block list to several devices in the array such that a few blocks
were unreable. The bad read reported correctly in userspace during
recovery, but it wasn't obvious that it was from a bad block list
metadata at the time and instead confirmed my bias suspecting hardware
issues
I was able to reproduce the issue with a minimal test case using small
loopback devices. I put a script for this in a github repository:
https://github.com/dougvj/md_badblock_reshape_stall_test
This patch handles bad reads during a reshape by introducing a
handle_failed_reshape function in a similar manner to
handle_failed_resync. The function aborts the current stripe by
unmarking STRIPE_EXPANDING and STRIP_EXPAND_READY, sets the
MD_RECOVERY_FROZEN bit, reverts the head of the reshape to the safe
position, and reports the situation in dmesg.
Signed-off-by: Doug V Johnson <dougvj@...gvj.net>
---
drivers/md/raid5.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5c79429acc64..bc0b0c2540f0 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3738,6 +3738,27 @@ handle_failed_sync(struct r5conf *conf, struct stripe_head *sh,
md_done_sync(conf->mddev, RAID5_STRIPE_SECTORS(conf), !abort);
}
+static void
+handle_failed_reshape(struct r5conf *conf, struct stripe_head *sh,
+ struct stripe_head_state *s)
+{
+ // Abort the current stripe
+ clear_bit(STRIPE_EXPANDING, &sh->state);
+ clear_bit(STRIPE_EXPAND_READY, &sh->state);
+ pr_warn_ratelimited("md/raid:%s: read error during reshape at %lu, cannot progress",
+ mdname(conf->mddev),
+ (unsigned long)sh->sector);
+ // Freeze the reshape
+ set_bit(MD_RECOVERY_FROZEN, &conf->mddev->recovery);
+ // Revert progress to safe position
+ spin_lock_irq(&conf->device_lock);
+ conf->reshape_progress = conf->reshape_safe;
+ spin_unlock_irq(&conf->device_lock);
+ // report failed md sync
+ md_done_sync(conf->mddev, 0, 0);
+ wake_up(&conf->wait_for_reshape);
+}
+
static int want_replace(struct stripe_head *sh, int disk_idx)
{
struct md_rdev *rdev;
@@ -4987,6 +5008,8 @@ static void handle_stripe(struct stripe_head *sh)
handle_failed_stripe(conf, sh, &s, disks);
if (s.syncing + s.replacing)
handle_failed_sync(conf, sh, &s);
+ if (test_bit(STRIPE_EXPANDING, &sh->state))
+ handle_failed_reshape(conf, sh, &s);
}
/* Now we check to see if any write operations have recently
--
2.48.1
Powered by blists - more mailing lists