linux-kernel - Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system; successfully bisected

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <db4f5f1b-5eba-2cdb-fad0-7aa725cea508@huaweicloud.com>
Date: Fri, 15 Mar 2024 09:17:56 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Dan Moulding <dan@...m.net>, yukuai1@...weicloud.com
Cc: gregkh@...uxfoundation.org, junxiao.bi@...cle.com,
 linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
 regressions@...ts.linux.dev, song@...nel.org, stable@...r.kernel.org,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system;
 successfully bisected

Hi,

在 2024/03/15 0:12, Dan Moulding 写道:
>> How about the following patch?
>>
>> Thanks,
>> Kuai
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 3ad5f3c7f91e..0b2e6060f2c9 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread)
>>
>>           md_check_recovery(mddev);
>>
>> -       blk_start_plug(&plug);
>>           handled = 0;
>>           spin_lock_irq(&conf->device_lock);
>>           while (1) {
>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread)
>>                   int batch_size, released;
>>                   unsigned int offset;
>>
>> +               /*
>> +                * md_check_recovery() can't clear sb_flags, usually
>> because of
>> +                * 'reconfig_mutex' can't be grabbed, wait for
>> mddev_unlock() to
>> +                * wake up raid5d().
>> +                */
>> +               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
>> +                       goto skip;
>> +
>>                   released = release_stripe_list(conf,
>> conf->temp_inactive_list);
>>                   if (released)
>>                           clear_bit(R5_DID_ALLOC, &conf->cache_state);
>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread)
>>                           spin_lock_irq(&conf->device_lock);
>>                   }
>>           }
>> +skip:
>>           pr_debug("%d stripes handled\n", handled);
>> -
>>           spin_unlock_irq(&conf->device_lock);
>>           if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) &&
>>               mutex_trylock(&conf->cache_size_mutex)) {
>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread)
>>                   mutex_unlock(&conf->cache_size_mutex);
>>           }
>>
>> +       blk_start_plug(&plug);
>>           flush_deferred_bios(conf);
>>
>>           r5l_flush_stripe_to_raid(conf->log);
> 
> I can confirm that this patch also works. I'm unable to reproduce the
> hang after applying this instead of the first patch provided by
> Junxiao. So looks like both ways are succesful in avoiding the hang.
> 

Thanks a lot for the testing! Can you also give following patch a try?
It removes the change to blk_plug, because Dan and Song are worried
about performance degradation, so we need to verify the performance
before consider that patch.

Anyway, I think following patch can fix this problem as well.

Thanks,
Kuai

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 3ad5f3c7f91e..ae8665be9940 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6728,6 +6728,9 @@ static void raid5d(struct md_thread *thread)
                 int batch_size, released;
                 unsigned int offset;

+               if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+                       goto skip;
+
                 released = release_stripe_list(conf, 
conf->temp_inactive_list);
                 if (released)
                         clear_bit(R5_DID_ALLOC, &conf->cache_state);
@@ -6766,6 +6769,7 @@ static void raid5d(struct md_thread *thread)
                         spin_lock_irq(&conf->device_lock);
                 }
         }
+skip:
         pr_debug("%d stripes handled\n", handled);

         spin_unlock_irq(&conf->device_lock);


> -- Dan
> .
>