linux-kernel - Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20240314161211.14002-1-dan@danm.net>
Date: Thu, 14 Mar 2024 10:12:11 -0600
From: Dan Moulding <dan@...m.net>
To: yukuai1@...weicloud.com
Cc: dan@...m.net,
	gregkh@...uxfoundation.org,
	junxiao.bi@...cle.com,
	linux-kernel@...r.kernel.org,
	linux-raid@...r.kernel.org,
	regressions@...ts.linux.dev,
	song@...nel.org,
	stable@...r.kernel.org,
	yukuai3@...wei.com
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

> How about the following patch?
> 
> Thanks,
> Kuai
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 3ad5f3c7f91e..0b2e6060f2c9 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
> 
>          md_check_recovery(mddev);
> 
> -       blk_start_plug(&plug);
>          handled = 0;
>          spin_lock_irq(&conf->device_lock);
>          while (1) {
> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>                  int batch_size, released;
>                  unsigned int offset;
> 
> +               /*
> +                * md_check_recovery() can't clear sb_flags, usually 
> because of
> +                * 'reconfig_mutex' can't be grabbed, wait for 
> mddev_unlock() to
> +                * wake up raid5d().
> +                */
> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> +                       goto skip;
> +
>                  released = release_stripe_list(conf, 
> conf->temp_inactive_list);
>                  if (released)
>                          clear_bit(R5_DID_ALLOC, &conf->cache_state);
> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>                          spin_lock_irq(&conf->device_lock);
>                  }
>          }
> +skip:
>          pr_debug("%d stripes handled\n", handled);
> -
>          spin_unlock_irq(&conf->device_lock);
>          if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>              mutex_trylock(&conf->cache_size_mutex)) {
> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>                  mutex_unlock(&conf->cache_size_mutex);
>          }
> 
> +       blk_start_plug(&plug);
>          flush_deferred_bios(conf);
> 
>          r5l_flush_stripe_to_raid(conf->log);

I can confirm that this patch also works. I'm unable to reproduce the
hang after applying this instead of the first patch provided by
Junxiao. So looks like both ways are succesful in avoiding the hang.

-- Dan