linux-kernel - Re: md_raid: mdX_raid6 looping after sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c3390ab0-d038-f1c3-5544-67ae9c8408b1@cloud.ionos.com>
Date:   Tue, 26 Jan 2021 01:44:07 +0100
From:   Guoqing Jiang <guoqing.jiang@...ud.ionos.com>
To:     Donald Buczek <buczek@...gen.mpg.de>, Song Liu <song@...nel.org>,
        linux-raid@...r.kernel.org,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        it+raid@...gen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
 transition

Hi Donald,

On 1/25/21 22:32, Donald Buczek wrote:
> 
> 
> On 25.01.21 09:54, Donald Buczek wrote:
>> Dear Guoqing,
>>
>> a colleague of mine was able to produce the issue inside a vm and were 
>> able to find a procedure to run the vm into the issue within minutes 
>> (not unreliably after hours on a physical system as before). This of 
>> course helped to pinpoint the problem.
>>
>> My current theory of what is happening is:
>>
>> - MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING are set by 
>> md_write_start() when file-system I/O wants to do a write and the 
>> array transitions from "clean" to "active". 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8308)
>>
>> - Before raid5d gets to write the superblock (its busy processing 
>> active stripes because of the sync activity) , userspace wants to 
>> pause the check by `echo idle > /sys/block/mdX/md/sync_action`
>>
>> - action_store() takes the reconfig_mutex before trying to stop the 
>> sync thread. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L4689) Dump 
>> of struct mddev of email 1/19/21 confirms reconf_mutex non-zero.
>>
>> - raid5d is running in its main loop. 
>> raid5d()->handle_active_stripes() returns a positive batch size ( 
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L6329 
>> ) although raid5d()->handle_active_stripes()->handle_stripe() doesn't 
>> process any stripe because of MD_SB_CHANGE_PENDING. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/raid5.c#L4729 
>> ). This is the reason, raid5d is busy looping.
>>
>> - raid5d()->md_check_recovery() is called by the raid5d main loop. One 
>> of its duties is to write the superblock, if a change is pending. 
>> However to do so, it needs either MD_ALLOW_SB_UPDATE or must be able 
>> to take the reconfig_mutex. 
>> (https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L8967 
>> , 
>> https://elixir.bootlin.com/linux/v5.4.57/source/drivers/md/md.c#L9006) 
>> Both is not true, so the superblock is not written and 
>> MD_SB_CHANGE_PENDING is not cleared.
>>
>> - (as discussed previously) the sync thread is waiting for the number 
>> of active stripes to go down and doesn't terminate. The userspace 
>> thread is waiting for the sync thread to terminate.
>>
>> Does this make sense?

Yes, exactly! That was my thought too, the scenario is twisted.

Then resync thread is blocked due to there are too many active stripes, 
because raid5d is in busy loop since SB_CHANGE_PENDING is set which 
means tactive stripes can't be decreased, and echo idle cmd can't make 
progress given resync thread is blocked while the cmd still hold 
reconfig_mutex which make raid5d in busy loop and can't clear 
SB_CHANGE_PENDING flag.

And raid5 could suffer from the same issue I think.

>>
>> Just for reference, I add the procedure which triggers the issue on 
>> the vm (with /dev/md3 mounted on /mnt/raid_md3) and some debug output:
>>
>> ```
>> #! /bin/bash
>>
>> (
>>          while true; do
>>                  echo "start check"
>>                  echo check > /sys/block/md3/md/sync_action
>>                  sleep 10
>>                  echo "stop check"
>>                  echo idle > /sys/block/md3/md/sync_action
>>                  sleep 2
>>          done
>> ) &
>>
>> (
>>          while true; do
>>                  dd bs=1k count=$((5*1024*1024)) if=/dev/zero 
>> of=/mnt/raid_md3/bigfile status=none
>>                  sync /mnt/raid_md3/bigfile
>>                  rm /mnt/raid_md3/bigfile
>>                  sleep .1
>>          done
>> ) &
>>
>> start="$(date +%s)"
>> cd /sys/block/md3/md
>> wp_count=0
>> while true; do
>>          array_state=$(cat array_state)
>>          if [ "$array_state" = write-pending ]; then
>>                  wp_count=$(($wp_count+1))
>>          else
>>                  wp_count=0
>>          fi
>>          echo $(($(date +%s)-$start)) $(cat sync_action) $(cat 
>> sync_completed) $array_state $(cat stripe_cache_active)
>>          if [ $wp_count -ge 3 ]; then
>>                  kill -- -$$
>>                  exit
>>          fi
>>          sleep 1
>> done
>> ```
>>
>> The time, this needs to trigger the issue, varies from under a minute 
>> to one hour with 5 minute being typical. The output ends like this:
>>
>>      309 check 6283872 / 8378368 active-idle 4144
>>      310 check 6283872 / 8378368 active 1702
>>      311 check 6807528 / 8378368 active 4152
>>      312 check 7331184 / 8378368 clean 3021
>>      stop check
>>      313 check 7331184 / 8378368 write-pending 3905
>>      314 check 7331184 / 8378368 write-pending 3905
>>      315 check 7331184 / 8378368 write-pending 3905
>>      Terminated
>>
>> If I add
>>
>>      pr_debug("XXX batch_size %d release %d mdddev->sb_flags %lx\n", 
>> batch_size, released, mddev->sb_flags);
>>
>> in raid5d after the call to handle_active_stripes and enable the debug 
>> location after the deadlock occurred, I get
>>
>>      [ 3123.939143] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939156] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939170] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>      [ 3123.939184] [1223] raid5d:6332: XXX batch_size 8 release 0 
>> mdddev->sb_flags 6
>>
>> If I add
>>
>>      pr_debug("XXX 1 %s:%d mddev->flags %08lx mddev->sb_flags 
>> %08lx\n", __FILE__, __LINE__, mddev->flags, mddev->sb_flags);
>>
>> at the head of md_check_recovery, I get:
>>
>>      [  789.555462] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555477] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555491] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555505] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>      [  789.555520] [1191] md_check_recovery:8970: XXX 1 
>> drivers/md/md.c:8970 mddev->flags 00000000 mddev->sb_flags 00000006
>>
>> More debug lines in md_check_recovery confirm the control flow ( `if 
>> (mddev_trylock(mddev))` block not taken )
>>

That is great that you have a reproducer now!

>> What approach would you suggest to fix this?
> 
> I naively tried the following patch and it seems to fix the problem. The 
> test procedure didn't trigger the deadlock in 10 hours.
> 
> D.
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 2d21c298ffa7..f40429843906 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -4687,11 +4687,13 @@ action_store(struct mddev *mddev, const char 
> *page, size_t len)
>               clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
>           if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) &&
>               mddev_lock(mddev) == 0) {
> +            set_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>               flush_workqueue(md_misc_wq);
>               if (mddev->sync_thread) {
>                   set_bit(MD_RECOVERY_INTR, &mddev->recovery);
>                   md_reap_sync_thread(mddev);
>               }
> +            clear_bit(MD_ALLOW_SB_UPDATE, &mddev->flags);
>               mddev_unlock(mddev);
>           }
>       } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))

Yes, it could break the deadlock issue, but I am not sure if it is the 
right way given we only set ALLOW_SB_UPDATE in suspend which makes sense 
since the io will be quiesced, but write idle action can't guarantee the 
  similar thing. I prefer to make resync thread not wait forever here.

wait_event_lock_irq(
	conf->wait_for_stripe,
	!list_empty(conf->inactive_list + hash) && 

	(atomic_read(&conf->active_stripes)
	< (conf->max_nr_stripes * 3 / 4)
	|| !test_bit(R5_INACTIVE_BLOCKED,
		     &conf->cache_state)
	*(conf->hash_locks + hash));

So, could you please try the attached?

Thanks,
Guoqing

View attachment "raid5-proposal" of type "text/plain" (5862 bytes)