linux-kernel - Re: [RFC] mm: add new syscall pidfd_set

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <85145c75-f2f6-a393-daa2-967251cc3443@bytedance.com>
Date:   Wed, 12 Oct 2022 11:14:37 +0800
From:   Abel Wu <wuyun.abel@...edance.com>
To:     Michal Hocko <mhocko@...e.com>,
        Frank van der Linden <fvdl@...gle.com>
Cc:     Zhongkun He <hezhongkun.hzk@...edance.com>, corbet@....net,
        akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
        linux-doc@...r.kernel.org
Subject: Re: [RFC] mm: add new syscall pidfd_set_mempolicy()

On 10/12/22 3:29 AM, Michal Hocko wrote:
> On Tue 11-10-22 10:22:23, Frank van der Linden wrote:
>> On Tue, Oct 11, 2022 at 8:00 AM Michal Hocko <mhocko@...e.com> wrote:
>>>
>>> On Mon 10-10-22 09:22:13, Frank van der Linden wrote:
>>>> For consistency with process_madvise(), I would suggest calling it
>>>> process_set_mempolicy.
>>>
>>> This operation has per-thread rather than per-process semantic so I do
>>> not think your proposed naming is better.
>>
>> True. I suppose you could argue that it should have been
>> pidfd_madvise() then for consistency, but that ship has sailed.
> 
> madvise commands have per mm semantic. It is set_mempolicy which is
> kinda special and it allows to have per task_struct semantic even when
> the actual allocation is in the same address space. To be honest I am
> not really sure why that is this way because threads aim to share memory
> so why should they have different memory policies?
> 
> I suspect that the original thinking was that some portions that are
> private to the process want to have different affinities (e.g. stacks
> and dedicated per cpu heap arenas). E.g. worker pools which want to be
> per-cpu local with their own allocations but operate on shared data that
> requires different policies.
> 
>>>> Other than that, this makes sense. To complete
>>>> the set, perhaps a process_mbind() should be added as well. What do
>>>> you think?
>>>
>>> Is there any real usecase for this interface? How is the caller supposed
>>> to make per-range decisions without a very involved coordination with
>>> the target process?
>>
>> The use case for a potential pidfd_mbind() is basically a combination
>> of what is described for in the process_madvise proposal (
>> https://lore.kernel.org/lkml/20200901000633.1920247-1-minchan@kernel.org/
>> ), and what this proposal describes: system management software acting
>> as an orchestrator that has a better overview of the system as a whole
>> (NUMA nodes, memory tiering), and has knowledge of the layout of the
>> processes involved.

This is exactly why we are proposing pidfd/process_set_mempolicy().

> 
> Well, per address range operation is a completely different beast I
> would say. External tool would need to a) understand what that range is
> used for (e.g. stack/heap ranges, mmaped shared files like libraries or
> private mappings) and b) by in sync with memory layout modifications
> done by applications (e.g. that an mmap has been issued to back malloc
> request). Quite a lot of understanding about the specific process. I
> would say that with that intimate knowledge it is quite better to be
> part of the process and do those changes from within of the process
> itself.

Agreed, the orchestrator like system management software may not have
enough knowledge about per address range. And I also don't think it is
appropriate for orchestrators to overwrite tasks' mempolicy as well,
they are set for some purpose by the apps themselves. So I suggested
a per-mm policy which have a lower priority than the tasks'.

Thanks & BR,
Abel