linux-kernel - Re: md_raid: mdX_raid6 looping after sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <60829bc7-2eb9-f4ca-1a36-d2dbda5b0f3e@huawei.com>
Date:   Wed, 15 Mar 2023 11:02:37 +0800
From:   Yu Kuai <yukuai3@...wei.com>
To:     Guoqing Jiang <guoqing.jiang@...ux.dev>,
        Marc Smith <msmith626@...il.com>
CC:     Donald Buczek <buczek@...gen.mpg.de>, Song Liu <song@...nel.org>,
        <linux-raid@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        <it+raid@...gen.mpg.de>, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
 transition



在 2023/03/14 21:55, Guoqing Jiang 写道:
> 
> 
> On 3/14/23 21:25, Marc Smith wrote:
>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
>> <guoqing.jiang@...ud.ionos.com> wrote:
>>> Hi Donald,
>>>
>>> On 2/8/21 19:41, Donald Buczek wrote:
>>>> Dear Guoqing,
>>>>
>>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>
>>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>>> sync_thread, like this.
>>>>>>>
>>>>>>>           /* resync has finished, collect result */
>>>>>>>           mddev_unlock(mddev);
>>>>>>>           md_unregister_thread(&mddev->sync_thread);
>>>>>>>           mddev_lock(mddev);
>>>>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>>>>> be a problem, if another user space operation takes the mutex?
>>>>> I don't think other places can be blocked while hold mutex, otherwise
>>>>> these places can cause potential deadlock. Please try above two lines
>>>>> change. And perhaps others have better idea.
>>>> Yes, this works. No deadlock after >11000 seconds,
>>>>
>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>> Great. I will send a formal patch with your reported-by and tested-by.
>>>
>>> Thanks,
>>> Guoqing
>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
>> of the patches that supposedly resolve this were applied to the stable
>> kernels, however, one was omitted due to a regression:
>> md: don't unregister sync_thread with reconfig_mutex held (upstream
>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
Hi, Guoqing,

Just borrow this thread to discuss, I think this commit might have
problem in some corner cases:

t1:				t2:
action_store
  mddev_lock
   if (mddev->sync_thread)
    mddev_unlock
    md_unregister_thread
				md_check_recovery
				 set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
				 queue_work(md_misc_wq, &mddev->del_work)
    mddev_lock_nointr
    md_reap_sync_thread
    // clear running
  mddev_lock

t3:
md_start_sync
// running is not set

Our test report a problem that can be cause by this in theory, by we
can't be sure for now...

We thought about how to fix this, instead of calling
md_register_thread() here to wait for sync_thread to be done
synchronisely, we do this asynchronously like what md_set_readonly() and
do_md_stop() does.

What do you think?

Thanks,
Kuai
>>
>> I don't see any follow-up on the thread from June 8th 2022 asking for
>> this patch to be dropped from all stable kernels since it caused a
>> regression.
>>
>> The patch doesn't appear to be present in the current mainline kernel
>> (6.3-rc2) either. So I assume this issue is still present there, or it
>> was resolved differently and I just can't find the commit/patch.
> 
> It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> sync_thread in action_store".
> 
> Thanks,
> Guoqing
> .
>