[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de>
Date: Mon, 25 Jan 2021 22:32:30 +0100
From: Donald Buczek <buczek@...gen.mpg.de>
To: Guoqing Jiang <guoqing.jiang@...ud.ionos.com>,
Song Liu <song@...nel.org>, linux-raid@...r.kernel.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
it+raid@...gen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
transition
On 25.01.21 09:54, Donald Buczek wrote:
> Dear Guoqing,
>
> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
>
> My current theory of what is happening is:
>
> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>
> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>
> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>
> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
>
> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
>
> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
>
> Does this make sense?
>
> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>
> ```
> #! /bin/bash
>
> (
> while true; do
> echo "start check"
> echo check > /sys/block/md3/md/sync_action
> sleep 10
> echo "stop check"
> echo idle > /sys/block/md3/md/sync_action
> sleep 2
> done
> ) &
>
> (
> while true; do
> dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
> sync /mnt/raid_md3/bigfile
> rm /mnt/raid_md3/bigfile
> sleep .1
> done
> ) &
>
> start="$(date +%s)"
> cd /sys/block/md3/md
> wp_count=0
> while true; do
> array_state=$(cat array_state)
> if [ "$array_state" = write-pending ]; then
> wp_count=$(($wp_count+1))
> else
> wp_count=0
> fi
> echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
> if [ $wp_count -ge 3 ]; then
> kill -- -$$
> exit
> fi
> sleep 1
> done
> ```
>
> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
>
> 309 check 6283872 / 8378368 active-idle 4144
> 310 check 6283872 / 8378368 active 1702
> 311 check 6807528 / 8378368 active 4152
> 312 check 7331184 / 8378368 clean 3021
> stop check
> 313 check 7331184 / 8378368 write-pending 3905
> 314 check 7331184 / 8378368 write-pending 3905
> 315 check 7331184 / 8378368 write-pending 3905
> Terminated
>
> If I add
>
> pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
>
> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
>
> [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
> [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
> [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
> [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>
> If I add
>
> pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>
> at the head of md_check_recovery, I get:
>
> [ 789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
> [ 789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
> [ 789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
> [ 789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
> [ 789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>
> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
>
> What approach would you suggest to fix this?
I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.
D.
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
mddev_lock(mddev) == 0) {
+ set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
flush_workqueue(md_misc_wq);
if (mddev->sync_thread) {
set_bit(MD_RECOVERY_INTR, &mddev->recovery);
md_reap_sync_thread(mddev);
}
+ clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
mddev_unlock(mddev);
}
} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
--
2.30.0
Powered by blists - more mailing lists