linux-kernel - Re: md_raid: mdX_raid6 looping after sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ab757e2b-3ff0-33d9-d30c-61669b738664@huaweicloud.com>
Date:   Wed, 23 Aug 2023 09:22:05 +0800
From:   Yu Kuai <yukuai1@...weicloud.com>
To:     Dragan Stancevic <dragan@...ncevic.com>, song@...nel.org
Cc:     buczek@...gen.mpg.de, guoqing.jiang@...ux.dev,
        it+raid@...gen.mpg.de, linux-kernel@...r.kernel.org,
        linux-raid@...r.kernel.org, msmith626@...il.com,
        yukuai1@...weicloud.com, "yukuai (C)" <yukuai3@...wei.com>,
        "yangerkun@...wei.com" <yangerkun@...wei.com>
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
 transition

Hi,

在 2023/08/23 5:16, Dragan Stancevic 写道:
> On Tue, 3/28/23 17:01 Song Liu wrote:
>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@...il.com>
>> wr=
>> ote:
>>   >
>>   > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
>> <msmith626@...il.com>=
>>    wrote:
>>   > >
>>   > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
>> <guoqing.jiang@li=
>> nux.dev> wrote:
>>   > > >
>>   > > >
>>   > > >
>>   > > > On 3/14/23 21:25, Marc Smith wrote:
>>   > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>>   > > > > <guoqing.jiang@...ud.ionos.com> wrote:
>>   > > > >> Hi Donald,
>>   > > > >>
>>   > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>>   > > > >>> Dear Guoqing,
>>   > > > >>>
>>   > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>   > > > >>>>
>>   > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>>   > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>   > > > >>>>>> sync_thread, like this.
>>   > > > >>>>>>
>>   > > > >>>>>>           /* resync has finished, collect result */
>>   > > > >>>>>>           mddev_unlock(mddev);
>>   > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
>>   > > > >>>>>>           mddev_lock(mddev);
>>   > > > >>>>> As above: While we wait for the sync thread to terminate,
>> would=
>> n't it
>>   > > > >>>>> be a problem, if another user space operation takes the mutex?
>>   > > > >>>> I don't think other places can be blocked while hold mutex,
>> othe=
>> rwise
>>   > > > >>>> these places can cause potential deadlock. Please try above
>> two =
>> lines
>>   > > > >>>> change. And perhaps others have better idea.
>>   > > > >>> Yes, this works. No deadlock after >11000 seconds,
>>   > > > >>>
>>   > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
>> 434, 12=
>> 65,
>>   > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>   > > > >> Great. I will send a formal patch with your reported-by and
>> tested=
>> -by.
>>   > > > >>
>>   > > > >> Thanks,
>>   > > > >> Guoqing
>>   > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
>> like 1/=
>> 2
>>   > > > > of the patches that supposedly resolve this were applied to the
>> sta=
>> ble
>>   > > > > kernels, however, one was omitted due to a regression:
>>   > > > > md: don't unregister sync_thread with reconfig_mutex held
>> (upstream
>>   > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>>   > > > >
>>   > > > > I don't see any follow-up on the thread from June 8th 2022
>> asking f=
>> or
>>   > > > > this patch to be dropped from all stable kernels since it caused a
>>   > > > > regression.
>>   > > > >
>>   > > > > The patch doesn't appear to be present in the current mainline
>> kern=
>> el
>>   > > > > (6.3-rc2) either. So I assume this issue is still present
>> there, or=
>>    it
>>   > > > > was resolved differently and I just can't find the commit/patch.
>>   > > >
>>   > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before
>> rea=
>> p
>>   > > > sync_thread in action_store".
>>   > >
>>   > > Okay, let me try applying that patch... it does not appear to be
>>   > > present in my 5.4.229 kernel source. Thanks.
>>   >
>>   > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>>   > sync_thread in action_store"' patch on top of vanilla 5.4.229 source
>>   > appears to fix the problem for me -- I can't reproduce the issue with
>>   > the script, and it's been running for >24 hours now. (Previously I was
>>   > able to induce the issue within a matter of minutes.)
>>
>> Hi Marc,
>>
>> Could you please run your reproducer on the md-tmp branch?
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>
>> This contains a different version of the fix by Yu Kuai.
>>
>> Thanks,
>> Song
>>
> 
> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. The change
> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue for me.
> 
> But I did pull the changes from the md-tmp branch you are refering:
> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
> 
> I was not totally clear on which change exactly to pull, but I pulled
> the following changes:
> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp	Yu Kuai	1 -7/+15
> 2023-03-28 md: wake up 'resync_wait' at last in md_reap_sync_thread()	Yu Kuai	1 -1/+1
> 2023-03-28 md: refactor idle/frozen_sync_thread()	Yu Kuai	2 -4/+22
> 2023-03-28 md: add a mutex to synchronize idle and frozen in action_store()	Yu Kuai	2 -0/+8
> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen'	Yu Kuai	1 -16/+45
> 
> I used to be able to reproduce the lockup within minutes, but with those
> changes the test system has been running for more than 120 hours.
> 
> When you said a "different fix", can you confirm that I grabbed the right
> changes and that I need all 5 of them.

Yes, you grabbed the right changes, and these patches is merged to
linux-next as well.
> 
> And second question was, has this fix been submitted upstream yet?
> If so which kernel version?

This fix is currently in linux-next, and will be applied to v6.6-rc1
soon.

Thanks,
Kuai

> 
> Thank you
> 
> 
> .
>