linux-kernel - Re: md_raid: mdX_raid6 looping after sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc342de0-98d2-1733-39cd-cc1999777ff3@molgen.mpg.de>
Date:   Mon, 25 Jan 2021 22:32:30 +0100
From:   Donald Buczek <buczek@...gen.mpg.de>
To:     Guoqing Jiang <guoqing.jiang@...ud.ionos.com>,
        Song Liu <song@...nel.org>, linux-raid@...r.kernel.org,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        it+raid@...gen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
 transition



On 25.01.21 09:54, Donald Buczek wrote:
> Dear Guoqing,
> 
> a colleague of mine was able to produce the issue inside a vm and were able to find a procedure to run the vm into the issue within minutes (not unreliably after hours on a physical system as before). This of course helped to pinpoint the problem.
> 
> My current theory of what is happening is:
> 
> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by md_write_start() when file-system I/O wants to do a write and the array transitions from "clean" to "active". (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
> 
> - Before raid5d gets to write the superblock (its busy processing active stripes because of the sync activity) , userspace wants to pause the check by `echo idle > /sys/block/mdX/md/sync_action`
> 
> - action_store() takes the reconfig_mutex before trying to stop the sync thread. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
> 
> - raid5d is running in its main loop. raid5d()->handle_active_stripes() returns a positive batch size ( https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't process any stripe because of MD_SB_CHANGE_PENDING. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 ). This is the reason, raid5d is busy looping.
> 
> - raid5d()->md_check_recovery() is called by the raid5d main loop. One of its duties is to write the superblock, if a change is pending. However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able to take the reconfig_mutex. (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 , https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) Both is not true, so the superblock is not written and MD_SB_CHANGE_PENDING is not cleared.
> 
> - (as discussed previously) the sync thread is waiting for the number of active stripes to go down and doesn't terminate. The userspace thread is waiting for the sync thread to terminate.
> 
> Does this make sense?
> 
> Just for reference, I add the procedure which triggers the issue on the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
> 
> ```
> #! /bin/bash
> 
> (
>          while true; do
>                  echo "start check"
>                  echo check > /sys/block/md3/md/sync_action
>                  sleep 10
>                  echo "stop check"
>                  echo idle > /sys/block/md3/md/sync_action
>                  sleep 2
>          done
> ) &
> 
> (
>          while true; do
>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero of=/mnt/raid_md3/bigfile status=none
>                  sync /mnt/raid_md3/bigfile
>                  rm /mnt/raid_md3/bigfile
>                  sleep .1
>          done
> ) &
> 
> start="$(date +%s)"
> cd /sys/block/md3/md
> wp_count=0
> while true; do
>          array_state=$(cat array_state)
>          if [ "$array_state" = write-pending ]; then
>                  wp_count=$(($wp_count+1))
>          else
>                  wp_count=0
>          fi
>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat sync_completed) $array_state $(cat stripe_cache_active)
>          if [ $wp_count -ge 3 ]; then
>                  kill -- -$$
>                  exit
>          fi
>          sleep 1
> done
> ```
> 
> The time, this needs to trigger the issue, varies from under a minute to one hour with 5 minute being typical. The output ends like this:
> 
>      309 check 6283872 / 8378368 active-idle 4144
>      310 check 6283872 / 8378368 active 1702
>      311 check 6807528 / 8378368 active 4152
>      312 check 7331184 / 8378368 clean 3021
>      stop check
>      313 check 7331184 / 8378368 write-pending 3905
>      314 check 7331184 / 8378368 write-pending 3905
>      315 check 7331184 / 8378368 write-pending 3905
>      Terminated
> 
> If I add
> 
>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", batch_size, released, mddev->sb_flags);
> 
> in raid5d after the call to handle_active_stripes and enable the debug location after the deadlock occurred, I get
> 
>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 mdddev->sb_flags 6
> 
> If I add
> 
>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
> 
> at the head of md_check_recovery, I get:
> 
>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
> 
> More debug lines in md_check_recovery confirm the control flow ( `if (mddev_trylock(mddev))` block not taken )
> 
> What approach would you suggest to fix this?

I naively tried the following patch and it seems to fix the problem. The test procedure didn't trigger the deadlock in 10 hours.

D.

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 2d21c298ffa7..f40429843906 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char *page, size_t len)
  			clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
  		if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
  		    mddev_lock(mddev) == 0) {
+			set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
  			flush_workqueue(md_misc_wq);
  			if (mddev->sync_thread) {
  				set_bit(MD_RECOVERY_INTR, &mddev->recovery);
  				md_reap_sync_thread(mddev);
  			}
+			clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
  			mddev_unlock(mddev);
  		}
  	} else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
-- 
2.30.0