linux-kernel - Re: md_raid: mdX_raid6 looping after sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2061b123-6332-1456-e7c3-b713752527fb@stancevic.com>
Date:   Wed, 23 Aug 2023 10:33:34 -0500
From:   Dragan Stancevic <dragan@...ncevic.com>
To:     Yu Kuai <yukuai1@...weicloud.com>, song@...nel.org
Cc:     buczek@...gen.mpg.de, guoqing.jiang@...ux.dev,
        it+raid@...gen.mpg.de, linux-kernel@...r.kernel.org,
        linux-raid@...r.kernel.org, msmith626@...il.com,
        "yukuai (C)" <yukuai3@...wei.com>,
        "yangerkun@...wei.com" <yangerkun@...wei.com>
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
 transition

Hi Kuai-

On 8/22/23 20:22, Yu Kuai wrote:
> Hi,
> 
> 在 2023/08/23 5:16, Dragan Stancevic 写道:
>> On Tue, 3/28/23 17:01 Song Liu wrote:
>>> On Thu, Mar 16, 2023 at 8:25=E2=80=AFAM Marc Smith <msmith626@...il.com>
>>> wr=
>>> ote:
>>>   >
>>>   > On Tue, Mar 14, 2023 at 10:45=E2=80=AFAM Marc Smith
>>> <msmith626@...il.com>=
>>>    wrote:
>>>   > >
>>>   > > On Tue, Mar 14, 2023 at 9:55=E2=80=AFAM Guoqing Jiang
>>> <guoqing.jiang@li=
>>> nux.dev> wrote:
>>>   > > >
>>>   > > >
>>>   > > >
>>>   > > > On 3/14/23 21:25, Marc Smith wrote:
>>>   > > > > On Mon, Feb 8, 2021 at 7:49=E2=80=AFPM Guoqing Jiang
>>>   > > > > <guoqing.jiang@...ud.ionos.com> wrote:
>>>   > > > >> Hi Donald,
>>>   > > > >>
>>>   > > > >> On 2/8/21 19:41, Donald Buczek wrote:
>>>   > > > >>> Dear Guoqing,
>>>   > > > >>>
>>>   > > > >>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>   > > > >>>>
>>>   > > > >>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>   > > > >>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>   > > > >>>>>> sync_thread, like this.
>>>   > > > >>>>>>
>>>   > > > >>>>>>           /* resync has finished, collect result */
>>>   > > > >>>>>>           mddev_unlock(mddev);
>>>   > > > >>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>   > > > >>>>>>           mddev_lock(mddev);
>>>   > > > >>>>> As above: While we wait for the sync thread to terminate,
>>> would=
>>> n't it
>>>   > > > >>>>> be a problem, if another user space operation takes the 
>>> mutex?
>>>   > > > >>>> I don't think other places can be blocked while hold mutex,
>>> othe=
>>> rwise
>>>   > > > >>>> these places can cause potential deadlock. Please try above
>>> two =
>>> lines
>>>   > > > >>>> change. And perhaps others have better idea.
>>>   > > > >>> Yes, this works. No deadlock after >11000 seconds,
>>>   > > > >>>
>>>   > > > >>> (Time till deadlock from previous runs/seconds: 1723, 37,
>>> 434, 12=
>>> 65,
>>>   > > > >>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>   > > > >> Great. I will send a formal patch with your reported-by and
>>> tested=
>>> -by.
>>>   > > > >>
>>>   > > > >> Thanks,
>>>   > > > >> Guoqing
>>>   > > > > I'm still hitting this issue with Linux 5.4.229 -- it looks
>>> like 1/=
>>> 2
>>>   > > > > of the patches that supposedly resolve this were applied to 
>>> the
>>> sta=
>>> ble
>>>   > > > > kernels, however, one was omitted due to a regression:
>>>   > > > > md: don't unregister sync_thread with reconfig_mutex held
>>> (upstream
>>>   > > > > commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
>>>   > > > >
>>>   > > > > I don't see any follow-up on the thread from June 8th 2022
>>> asking f=
>>> or
>>>   > > > > this patch to be dropped from all stable kernels since it 
>>> caused a
>>>   > > > > regression.
>>>   > > > >
>>>   > > > > The patch doesn't appear to be present in the current mainline
>>> kern=
>>> el
>>>   > > > > (6.3-rc2) either. So I assume this issue is still present
>>> there, or=
>>>    it
>>>   > > > > was resolved differently and I just can't find the 
>>> commit/patch.
>>>   > > >
>>>   > > > It should be fixed by commit 9dfbdafda3b3"md: unlock mddev 
>>> before
>>> rea=
>>> p
>>>   > > > sync_thread in action_store".
>>>   > >
>>>   > > Okay, let me try applying that patch... it does not appear to be
>>>   > > present in my 5.4.229 kernel source. Thanks.
>>>   >
>>>   > Yes, applying this '9dfbdafda3b3 "md: unlock mddev before reap
>>>   > sync_thread in action_store"' patch on top of vanilla 5.4.229 source
>>>   > appears to fix the problem for me -- I can't reproduce the issue 
>>> with
>>>   > the script, and it's been running for >24 hours now. (Previously 
>>> I was
>>>   > able to induce the issue within a matter of minutes.)
>>>
>>> Hi Marc,
>>>
>>> Could you please run your reproducer on the md-tmp branch?
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>>
>>> This contains a different version of the fix by Yu Kuai.
>>>
>>> Thanks,
>>> Song
>>>
>>
>> Hi Song, I can easily reproduce this issue on 5.10.133 and 5.10.53. 
>> The change
>> "9dfbdafda3b3 "md: unlock mddev before reap" does not fix the issue 
>> for me.
>>
>> But I did pull the changes from the md-tmp branch you are refering:
>> https://git.kernel.org/pub/scm/linux/kernel/git/song/md.git/log/?h=3Dmd-tmp
>>
>> I was not totally clear on which change exactly to pull, but I pulled
>> the following changes:
>> 2023-03-28 md: enhance checking in md_check_recovery()md-tmp    Yu 
>> Kuai    1 -7/+15
>> 2023-03-28 md: wake up 'resync_wait' at last in 
>> md_reap_sync_thread()    Yu Kuai    1 -1/+1
>> 2023-03-28 md: refactor idle/frozen_sync_thread()    Yu Kuai    2 -4/+22
>> 2023-03-28 md: add a mutex to synchronize idle and frozen in 
>> action_store()    Yu Kuai    2 -0/+8
>> 2023-03-28 md: refactor action_store() for 'idle' and 'frozen'    Yu 
>> Kuai    1 -16/+45
>>
>> I used to be able to reproduce the lockup within minutes, but with those
>> changes the test system has been running for more than 120 hours.
>>
>> When you said a "different fix", can you confirm that I grabbed the right
>> changes and that I need all 5 of them.
> 
> Yes, you grabbed the right changes, and these patches is merged to
> linux-next as well.
>>
>> And second question was, has this fix been submitted upstream yet?
>> If so which kernel version?
> 
> This fix is currently in linux-next, and will be applied to v6.6-rc1
> soon.

Thank you, that is great news. I'd like to see this change backported to 
5.10 and 6.1, do you have any plans of backporting to any of the 
previous kernels?

If not, I would like to try to get your changes into 5.10 and 6.1 if 
Greg will accept them.


Four out of five of your changes were a straight cherry-pick into 5.10, 
one needed a minor conflict resolution. But I can definitely confirm 
that your changes fix the lockup issue on 5.10

I am now switching to 6.1 and will test the changes there too.


Thanks


--
Peace can only come as a natural consequence
of universal enlightenment -Dr. Nikola Tesla