lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5ecef6a2-788e-948a-15b9-f3ccc29859a9@huawei.com>
Date:   Thu, 20 Feb 2020 09:16:31 +0800
From:   Yunsheng Lin <linyunsheng@...wei.com>
To:     Leon Romanovsky <leon@...nel.org>
CC:     Jason Gunthorpe <jgg@...pe.ca>, Lang Cheng <chenglang@...wei.com>,
        <dledford@...hat.com>, <davem@...emloft.net>,
        <salil.mehta@...wei.com>, <yisen.zhuang@...wei.com>,
        <linuxarm@...wei.com>, <netdev@...r.kernel.org>,
        <linux-rdma@...r.kernel.org>, Saeed Mahameed <saeedm@...lanox.com>,
        <bhaktipriya96@...il.com>, <tj@...nel.org>,
        Jeff Kirsher <jeffrey.t.kirsher@...el.com>
Subject: Re: [RFC rdma-next] RDMA/core: Add attribute WQ_MEM_RECLAIM to
 workqueue "infiniband"

On 2020/2/19 19:07, Leon Romanovsky wrote:
> On Wed, Feb 19, 2020 at 03:40:59PM +0800, Yunsheng Lin wrote:
>> +cc Bhaktipriya, Tejun and Jeff
>>
>> On 2020/2/19 14:45, Leon Romanovsky wrote:
>>> On Wed, Feb 19, 2020 at 09:13:23AM +0800, Yunsheng Lin wrote:
>>>> On 2020/2/18 23:31, Jason Gunthorpe wrote:
>>>>> On Tue, Feb 18, 2020 at 11:35:35AM +0800, Lang Cheng wrote:
>>>>>> The hns3 driver sets "hclge_service_task" workqueue with
>>>>>> WQ_MEM_RECLAIM flag in order to guarantee forward progress
>>>>>> under memory pressure.
>>>>>
>>>>> Don't do that. WQ_MEM_RECLAIM is only to be used by things interlinked
>>>>> with reclaimed processing.
>>>>>
>>>>> Work on queues marked with WQ_MEM_RECLAIM can't use GFP_KERNEL
>>>>> allocations, can't do certain kinds of sleeps, can't hold certain
>>>>> kinds of locks, etc.
>>
>> By the way, what kind of sleeps and locks can not be done in the work
>> queued to wq marked with WQ_MEM_RECLAIM?
> 
> I didn't see this knowledge documented, but I would assume that
> everything that can block memory reclaim progress should not be
> in such workqueue.
> 
>>
>>>>
>>>> From mlx5 driver, it seems that there is GFP_KERNEL allocations
>>>> on wq marked with WQ_MEM_RECLAIM too:
>>>>
>>>> mlx5e_tx_timeout_work() -> mlx5e_safe_reopen_channels() ->
>>>> mlx5e_safe_switch_channels() -> mlx5e_open_channels()
>>>>
>>>> kcalloc() is called with GFP_KERNEL in mlx5e_open_channels(),
>>>> and mlx5e_tx_timeout_work() is queued with priv->wq, which is
>>>> allocated with WQ_MEM_RECLAIM flags. see:
>>>>
>>>> mlx5e_netdev_init() -> create_singlethread_workqueue()
>>>
>>> There are two reasons for that, first mlx5 driver was written far before
>>> WQ_MEM_RECLAIM usage was clarified, second mlx5 has bugs.
>>>
>>>>
>>>>
>>>> From the comment in kernel/workqueue.c, the work queued with
>>>> wq with WQ_MEM_RECLAIM flag set seems to be executed without
>>>> blocking under some rare case. I still not quite understand
>>>> the comment, and I can not find any doc that point out the
>>>> GFP_KERNEL allocations can not be done in wq with WQ_MEM_RECLAIM
>>>> yet. Is there any doc that mentions that GFP_KERNEL allocations
>>>> can not be done in wq with WQ_MEM_RECLAIM?
>>>
>>> It is whole purpose of WQ_MEM_RECLAIM flag - allow progress in case of
>>> memory pressure. Allocation memory while we are under memory pressure
>>> is an invitation for a disaster.
>>
>> Ok, make sense.
>>
>>>
>>>>
>>>>
>>>> /**
>>>>  * rescuer_thread - the rescuer thread function
>>>>  * @__rescuer: self
>>>>  *
>>>>  * Workqueue rescuer thread function.  There's one rescuer for each
>>>>  * workqueue which has WQ_MEM_RECLAIM set.
>>>>  *
>>>>  * Regular work processing on a pool may block trying to create a new
>>>>  * worker which uses GFP_KERNEL allocation which has slight chance of
>>>>  * developing into deadlock if some works currently on the same queue
>>>>  * need to be processed to satisfy the GFP_KERNEL allocation.  This is
>>>>  * the problem rescuer solves.
>>>>  *
>>>>  * When such condition is possible, the pool summons rescuers of all
>>>>  * workqueues which have works queued on the pool and let them process
>>>>  * those works so that forward progress can be guaranteed.
>>>>  *
>>>>  * This should happen rarely.
>>>>  *
>>>>  * Return: 0
>>>>  */
>>>>
>>>>
>>>> The below is the reason we add the sets "hclge_service_task" workqueue
>>>> with WQ_MEM_RECLAIM through analysing why other ethernet drivers has
>>>> allocated wq with WQ_MEM_RECLAIM flag, I may be wrong about that:
>>>
>>> Many drivers are developed using copy/paste technique, so it is wrong
>>> to assume that "other ethernet drivers" did the right thing.
>>>
>>>>
>>>> hns3 ethernet driver may be used as the low level transport of a
>>>> network file system, memory reclaim data path may depend on the
>>>> worker in hns3 driver to bring back the ethernet link so that it flush
>>>> the some cache to network based disk.
>>>
>>> Unlikely that this "network file system" dependency on ethernet link is correct.
>>
>> Ok, I may be wrong about the above usecase.
>> but the below commit explicitly state that network devices may be used in
>> memory reclaim path.
>>
>> 0a38c17a21a0 ("fm10k: Remove create_workqueue"):
>>
>> fm10k: Remove create_workqueue
>>
>> alloc_workqueue replaces deprecated create_workqueue().
>>
>> A dedicated workqueue has been used since the workitem (viz
>> fm10k_service_task, which manages and runs other subtasks) is involved in
>> normal device operation and requires forward progress under memory
>> pressure.
>>
>> create_workqueue has been replaced with alloc_workqueue with max_active
>> as 0 since there is no need for throttling the number of active work
>> items.
>>
>> Since network devices may be used in memory reclaim path,
>> WQ_MEM_RECLAIM has been set to guarantee forward progress.
>>
>> flush_workqueue is unnecessary since destroy_workqueue() itself calls
>> drain_workqueue() which flushes repeatedly till the workqueue
>> becomes empty. Hence the call to flush_workqueue() has been dropped.
>>
>> Signed-off-by: Bhaktipriya Shridhar <bhaktipriya96@...il.com>
>> Acked-by: Tejun Heo <tj@...nel.org>
>> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@...el.com>
>>
>> So:
>> 1. Maybe the above commit log is misleading, and network device driver's
>>    wq does not need the WQ_MEM_RECLAIM flag, then maybe document what can
>>    not be done in the work queued to wq marked with WQ_MEM_RECLAIM, and
>>    remove the WQ_MEM_RECLAIM flag for the wq of network device driver.
> 
> I wouldn't truly count on what is written in commit messages of patch
> series which globally replaced create_workqueue() interface.
> 
>>
>>
>> 2. If the network device driver's wq does need the WQ_MEM_RECLAIM flag, then
>>    hns3 may have tow problems here: WQ_MEM_RECLAIM wq flushing !WQ_MEM_RECLAIM
>>    wq problem and GFP_KERNEL allocations in the work queued to WQ_MEM_RECLAIM wq.
> 
> You are proposing to put WQ_MEM_RECLAIM in infiniband queue and not to
> special queue inside of the driver.

In the commit log, we thought of three ways to avoid the warning.
which is:

1. Allocate the "hclge_service_task" workqueue without
WQ_MEM_RECLAIM flag, which may cause deadlock problem
when hns3 driver is used as the low level transport of
a network file system

2. Do not unregister ib_device during reset process, maybe
only disable accessing to the ib_device using disable_device()
as rdma_dev_change_netns() does.

3. Allocate the "infiniband" workqueue with WQ_MEM_RECLAIM flag.


Actually I prefer the second one, which need IB stack to provide
interface to disable and enable ib_device in the roce device driver
when there is hardware reset happening.

Is the above interface reasonable?

> 
>>
>>>
>>> Thanks
>>>
>>>>
>>>>>
>>>>> Jason
>>>>>
>>>>>
>>>>
>>>
>>> .
>>>
>>
> 
> .
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ