linux-kernel - Re: md_raid: mdX_raid6 looping after sync

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9dc19483-de0f-e8c6-bf18-10c33d0a35fd@linux.dev>
Date:   Wed, 15 Mar 2023 17:30:21 +0800
From:   Guoqing Jiang <guoqing.jiang@...ux.dev>
To:     Yu Kuai <yukuai3@...wei.com>, Marc Smith <msmith626@...il.com>
Cc:     Donald Buczek <buczek@...gen.mpg.de>, Song Liu <song@...nel.org>,
        linux-raid@...r.kernel.org,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        it+raid@...gen.mpg.de
Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle"
 transition



On 3/15/23 11:02, Yu Kuai wrote:
>
>
> 在 2023/03/14 21:55, Guoqing Jiang 写道:
>>
>>
>> On 3/14/23 21:25, Marc Smith wrote:
>>> On Mon, Feb 8, 2021 at 7:49 PM Guoqing Jiang
>>> <guoqing.jiang@...ud.ionos.com> wrote:
>>>> Hi Donald,
>>>>
>>>> On 2/8/21 19:41, Donald Buczek wrote:
>>>>> Dear Guoqing,
>>>>>
>>>>> On 08.02.21 15:53, Guoqing Jiang wrote:
>>>>>>
>>>>>> On 2/8/21 12:38, Donald Buczek wrote:
>>>>>>>> 5. maybe don't hold reconfig_mutex when try to unregister
>>>>>>>> sync_thread, like this.
>>>>>>>>
>>>>>>>>           /* resync has finished, collect result */
>>>>>>>>           mddev_unlock(mddev);
>>>>>>>> md_unregister_thread(&mddev->sync_thread);
>>>>>>>>           mddev_lock(mddev);
>>>>>>> As above: While we wait for the sync thread to terminate, 
>>>>>>> wouldn't it
>>>>>>> be a problem, if another user space operation takes the mutex?
>>>>>> I don't think other places can be blocked while hold mutex, 
>>>>>> otherwise
>>>>>> these places can cause potential deadlock. Please try above two 
>>>>>> lines
>>>>>> change. And perhaps others have better idea.
>>>>> Yes, this works. No deadlock after >11000 seconds,
>>>>>
>>>>> (Time till deadlock from previous runs/seconds: 1723, 37, 434, 1265,
>>>>> 3500, 1136, 109, 1892, 1060, 664, 84, 315, 12, 820 )
>>>> Great. I will send a formal patch with your reported-by and tested-by.
>>>>
>>>> Thanks,
>>>> Guoqing
>>> I'm still hitting this issue with Linux 5.4.229 -- it looks like 1/2
>>> of the patches that supposedly resolve this were applied to the stable
>>> kernels, however, one was omitted due to a regression:
>>> md: don't unregister sync_thread with reconfig_mutex held (upstream
>>> commit 8b48ec23cc51a4e7c8dbaef5f34ebe67e1a80934)
> Hi, Guoqing,
>
> Just borrow this thread to discuss, I think this commit might have
> problem in some corner cases:
>
> t1:                t2:
> action_store
>  mddev_lock
>   if (mddev->sync_thread)
>    mddev_unlock
>    md_unregister_thread
>                 md_check_recovery
>                  set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
>                  queue_work(md_misc_wq, &mddev->del_work)
>    mddev_lock_nointr
>    md_reap_sync_thread
>    // clear running
>  mddev_lock
>
> t3:
> md_start_sync
> // running is not set

What does 'running' mean? MD_RECOVERY_RUNNING?

> Our test report a problem that can be cause by this in theory, by we
> can't be sure for now...

I guess you tried to describe racy between

action_store -> md_register_thread

and

md_start_sync -> md_register_thread

Didn't you already fix them in the series?

[PATCH -next 0/5] md: fix uaf for sync_thread

Sorry, I didn't follow the problem and also your series, I might try your
test with latest mainline kernel if the test is available somewhere.

> We thought about how to fix this, instead of calling
> md_register_thread() here to wait for sync_thread to be done
> synchronisely,

IMO, md_register_thread just create and wake a thread, not sure why it
waits for sync_thread.

> we do this asynchronously like what md_set_readonly() and  
> do_md_stop() does.

Still, I don't have clear picture about the problem, so I can't judge it.

Thanks,
Guoqing