[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9dc19483-de0f-e8c6-bf18-10c33d0a35fd@linux.dev>
Date: Wed, 15 Mar 2023 17:30:21 +0800
From: Guoqing Jiang <guoqing.jiang@...ux.dev>
To: Yu Kuai <yukuai3@...wei.com>, Marc Smith <msmith626@...il.com>
Cc: Donald Buczek <buczek@...gen.mpg.de>, Song Liu <song@...nel.org>,
linux-raid@...r.kernel.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
it+raid@...gen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
transition
On 3/15/23 11:02, Yu Kuai wrote:
>
>
> 在 2023/03/14 21:55, Guoqing Jiang 写道:
>>
>>
>> On 3/14/23 21:25, Marc Smith wrote:
>>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
>>> <guoqing.jiang@...ud.ionos.com> wrote:
>>>> Hi Donald,
>>>>
>>>> On 2/8/21 19:41, Donald Buczek wrote:
>>>>> Dear Guoqing,
>>>>>
>>>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>>
>>>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>>>> sync_thread, like this.
>>>>>>>>
>>>>>>>> /* resync has finished, collect result */
>>>>>>>> mddev_unlock(mddev);
>>>>>>>> md_unregister_thread(&mddev->sync_thread);
>>>>>>>> mddev_lock(mddev);
>>>>>>> As above: While we wait for the sync thread to terminate,
>>>>>>> wouldn't it
>>>>>>> be a problem, if another user space operation takes the mutex?
>>>>>> I don't think other places can be blocked while hold mutex,
>>>>>> otherwise
>>>>>> these places can cause potential deadlock. Please try above two
>>>>>> lines
>>>>>> change. And perhaps others have better idea.
>>>>> Yes, this works. No deadlock after >11000 seconds,
>>>>>
>>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>> Great. I will send a formal patch with your reported-by and tested-by.
>>>>
>>>> Thanks,
>>>> Guoqing
>>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
>>> of the patches that supposedly resolve this were applied to the stable
>>> kernels, however, one was omitted due to a regression:
>>> md: don't unregister sync_thread with reconfig_mutex held (upstream
>>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> Hi, Guoqing,
>
> Just borrow this thread to discuss, I think this commit might have
> problem in some corner cases:
>
> t1: t2:
> action_store
> mddev_lock
> if (mddev->sync_thread)
> mddev_unlock
> md_unregister_thread
> md_check_recovery
> set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
> queue_work(md_misc_wq, &mddev->del_work)
> mddev_lock_nointr
> md_reap_sync_thread
> // clear running
> mddev_lock
>
> t3:
> md_start_sync
> // running is not set
What does 'running' mean? MD_RECOVERY_RUNNING?
> Our test report a problem that can be cause by this in theory, by we
> can't be sure for now...
I guess you tried to describe racy between
action_store -> md_register_thread
and
md_start_sync -> md_register_thread
Didn't you already fix them in the series?
[PATCH -next 0/5] md: fix uaf for sync_thread
Sorry, I didn't follow the problem and also your series, I might try your
test with latest mainline kernel if the test is available somewhere.
> We thought about how to fix this, instead of calling
> md_register_thread() here to wait for sync_thread to be done
> synchronisely,
IMO, md_register_thread just create and wake a thread, not sure why it
waits for sync_thread.
> we do this asynchronously like what md_set_readonly() and
> do_md_stop() does.
Still, I don't have clear picture about the problem, so I can't judge it.
Thanks,
Guoqing
Powered by blists - more mailing lists