[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <60829bc7-2eb9-f4ca-1a36-d2dbda5b0f3e@huawei.com>
Date: Wed, 15 Mar 2023 11:02:37 +0800
From: Yu Kuai <yukuai3@...wei.com>
To: Guoqing Jiang <guoqing.jiang@...ux.dev>,
Marc Smith <msmith626@...il.com>
CC: Donald Buczek <buczek@...gen.mpg.de>, Song Liu <song@...nel.org>,
<linux-raid@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
<it+raid@...gen.mpg.de>, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
transition
在 2023/03/14 21:55, Guoqing Jiang 写道:
>
>
> On 3/14/23 21:25, Marc Smith wrote:
>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
>> <guoqing.jiang@...ud.ionos.com> wrote:
>>> Hi Donald,
>>>
>>> On 2/8/21 19:41, Donald Buczek wrote:
>>>> Dear Guoqing,
>>>>
>>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>
>>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>>> sync_thread, like this.
>>>>>>>
>>>>>>> /* resync has finished, collect result */
>>>>>>> mddev_unlock(mddev);
>>>>>>> md_unregister_thread(&mddev->sync_thread);
>>>>>>> mddev_lock(mddev);
>>>>>> As above: While we wait for the sync thread to terminate, wouldn't it
>>>>>> be a problem, if another user space operation takes the mutex?
>>>>> I don't think other places can be blocked while hold mutex, otherwise
>>>>> these places can cause potential deadlock. Please try above two lines
>>>>> change. And perhaps others have better idea.
>>>> Yes, this works. No deadlock after >11000 seconds,
>>>>
>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>> Great. I will send a formal patch with your reported-by and tested-by.
>>>
>>> Thanks,
>>> Guoqing
>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
>> of the patches that supposedly resolve this were applied to the stable
>> kernels, however, one was omitted due to a regression:
>> md: don't unregister sync_thread with reconfig_mutex held (upstream
>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
Hi, Guoqing,
Just borrow this thread to discuss, I think this commit might have
problem in some corner cases:
t1: t2:
action_store
mddev_lock
if (mddev->sync_thread)
mddev_unlock
md_unregister_thread
md_check_recovery
set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
queue_work(md_misc_wq, &mddev->del_work)
mddev_lock_nointr
md_reap_sync_thread
// clear running
mddev_lock
t3:
md_start_sync
// running is not set
Our test report a problem that can be cause by this in theory, by we
can't be sure for now...
We thought about how to fix this, instead of calling
md_register_thread() here to wait for sync_thread to be done
synchronisely, we do this asynchronously like what md_set_readonly() and
do_md_stop() does.
What do you think?
Thanks,
Kuai
>>
>> I don't see any follow-up on the thread from June 8th 2022 asking for
>> this patch to be dropped from all stable kernels since it caused a
>> regression.
>>
>> The patch doesn't appear to be present in the current mainline kernel
>> (6.3-rc2) either. So I assume this issue is still present there, or it
>> was resolved differently and I just can't find the commit/patch.
>
> It should be fixed by commit 9dfbdafda3b3"md: unlock mddev before reap
> sync_thread in action_store".
>
> Thanks,
> Guoqing
> .
>
Powered by blists - more mailing lists