linux-kernel - Re: [RFC] workqueue: allow system workqueue be used in memory reclaim

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <a131af22-0a5b-4be1-b77e-8716c63e8883@oracle.com>
Date:   Thu, 9 Nov 2023 15:36:32 -0800
From:   junxiao.bi@...cle.com
To:     Tejun Heo <tj@...nel.org>
Cc:     linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org,
        jiangshanlai@...il.com, song@...nel.org
Subject: Re: [RFC] workqueue: allow system workqueue be used in memory reclaim


On 11/9/23 10:58 AM, Tejun Heo wrote:
> Hello,
>
> On Tue, Nov 07, 2023 at 05:28:21PM -0800, Junxiao Bi wrote:
>> The following deadlock was triggered on Intel IMSM raid1 volumes.
>>
>> The sequence of the event is this:
>>
>> 1. memory reclaim was waiting xfs journal flushing and get stucked by
>> md flush work.
>>
>> 2. md flush work was queued into "md" workqueue, but never get executed,
>> kworker thread can not be created and also the rescuer thread was executing
>> md flush work for another md disk and get stuck because
>> "MD_SB_CHANGE_PENDING" flag was set.
>>
>> 3. That flag should be set by some md write process which was asking to
>> update md superblock to change in_sync status to 0, and then it used
>> kernfs_notify to ask "mdmon" process to update superblock, after that,
>> write process waited that flag to be cleared.
>>
>> 4. But "mdmon" was never wake up, because kernfs_notify() depended on
>> system wide workqueue "system_wq" to do the notify, but since that
>> workqueue doesn't have a rescuer thread, notify will not happen.
> Things like this can't be fixed by adding RECLAIM to system_wq because
> system_wq is shared and someone else might occupy that rescuer thread. The
> flag doesn't guarantee unlimited forward progress. It only guarantees
> forward progress of one work item.
>
> That seems to be where the problem is in #2 in the first place. If a work
> item is required during memory reclaim, it must have guaranteed forward
> progress but it looks like that's waiting for someone else who can end up
> waiting for userspace?
>
> You'll need to untangle the dependencies earlier.
Make sense. Thanks a lot for the comments.
>
> Thanks.
>