[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e166592f-aeb3-4573-bb73-270a2eb90be3@gmail.com>
Date: Thu, 29 May 2025 18:21:55 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Shakeel Butt <shakeel.butt@...ux.dev>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
Jann Horn <jannh@...gle.com>, Arnd Bergmann <arnd@...db.de>,
Christian Brauner <brauner@...nel.org>, SeongJae Park <sj@...nel.org>,
Mike Rapoport <rppt@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
Barry Song <21cnbao@...il.com>, linux-mm@...ck.org,
linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-api@...r.kernel.org, Pedro Falcato <pfalcato@...e.de>
Subject: Re: [DISCUSSION] proposed mctl() API
On 29/05/2025 15:43, Lorenzo Stoakes wrote:
> ## INTRODUCTION
>
> After discussions in various threads (Usama's series adding a new prctl()
> in [0], and a proposal to adapt process_madvise() to do the same -
> conception in [1] and RFC in [2]), it seems fairly clear that it would make
> sense to explore a dedicated API to explicitly allow for actions which
> affect the virtual address space as a whole.
>
> Also, Barry is implementing a feature (currently under RFC) which could
> additionally make use of this API (see [3]).
>
> [0]: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
> [1]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> [2]: https://lore.kernel.org/all/cover.1747686021.git.lorenzo.stoakes@oracle.com/
> [3]: https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/
>
> While madvise() and process_madvise() are useful for altering the
> attributes of VMAs within a virtual address space, it isn't the right fit
> for something that affects the whole address space.
>
> Additionally, a requirement of Usama's proposal (see [0]) is that we have
> the ability to propagate the change in behaviour across fork/exec. This
> further suggests the need for a dedicated interface, as this really sits
> outside the ordinary behaviour of [process_]madvise().
>
> prctl() is too broad and encourages mm code to migrate to kernel/sys.c
> where it is at risk of bit-rotting. It can make it harder/impossible to
> isolate mm logic for testing and logic there might be missed in changes
> moving forward.
>
> It also, like so many kernel interfaces, has 'grown roots out of its pot'
> so to speak - while it started off as an ostensible 'process' manipulation
> interface, prctl() operations manipulate a myriad of task, virtual
> address space and even specific VMA attributes.
>
> At this stage it really is a 'catch-all' for things we simply couldn't fit
> elsewhere.
>
> Therefore, as suggested by the rather excellent Liam Howlett, I propose an
> mm-specific interface that _explicitly_ manipulates attributes of the
> virtual address space as a whole.
>
> I think something that mimics the simplicity of [process_]madvise() makes
> sense - have a limited set of actions that can be taken, and treat them as
> a simple action - a user requests you do XXX to the virtual address space
> (that is, across the mm_struct), and you do it.
>
Hi Lorenzo,
Thanks for writing the proposal, this is awesome!
Whatever the community agrees with, whether its this or prctl, happy to
move forward with either as both should accomplish the usecase proposed.
I will just add some points over here in defence of prctl, this is just for
discussion, and if the community disagrees, completely happy to move forward
with new syscall as well.
When it comes to having mm code in kernel/sys.c, we can just do something
like below that can actually clean it up?
diff --git a/kernel/sys.c b/kernel/sys.c
index 3a2df1bd9f64..bfadc339e2c5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2467,6 +2467,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = 0;
switch (option) {
+ case PR_SET_MM:
+ case PR_GET_THP_DISABLE:
+ case PR_SET_THP_DISABLE:
+ case PR_NEW_MM_THING:
+ error = some_function_in_mm_folder(); // in mm/mctl.c ?
+ break;
case PR_SET_PDEATHSIG:
if (!valid_signal(arg2)) {
error = -EINVAL;
when it comes to prctl becoming a catch-all thing, with above clean up,
we can be a lot more careful to what gets added to the mm side of prctl.
The advantage of this is it avoids having another syscall.
My personal view (which can be wrong :)) is that a new syscall should be
for something major,
and I feel that PR_DEFAULT_MADV_HUGEPAGE and PR_DEFAULT_MADV_NOHUGEPAGE
might be small enough to fit in prctl? but I completely understand
your point of view as well!
> ## INTERFACE
>
> The proposed interface is simply:
>
> int mctl(int pidfd, int action, unsigned int flags);
>
> Since PIDFD_SELF is now available, it is easy to invoke this for the
> current process, while also adding the flexibility of being able to apply
> actions to other processes also.
>
> The function will return 0 on success, -1 on failure, with errno set to the
> error that arose, standard stuff.
>
> The behaviour will be tailored to each action taken.
>
> To begin with, I propose a single flag:
>
> - MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec.
>
> This again will be tailored - only certain actions will be allowed to set
> this flag, and we will of course assert appropriate capabilities, etc. upon
> its use.
>
Sounds good to me. Just adding this here, the solution will be used in systemd
in exec_invoke, similar to how KSM is done with prctl in [1], so for the THP
solution, we would need MCTL_SET_DEFAULT_EXEC as it would need to be inherited
across fork+exec.
[1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5046
> All actions would, impact every VMA (if adjustments to VMAs are required).
>
> ## SECURITY
>
> Of course, security will be of utmost concern (Jann's input is important
> here :)
>
> We can vary security requirements depending on the action taken.
>
> For an initial version I suggest we simply limit operations which:
>
> - Operate on a remote process
> - Use the MCTL_SET_DEFAULT_EXEC flag
>
> To those tasks which possess the CAP_SYS_ADMIN capability.
>
> This may be too restrictive - be good to get some feedback on this.
>
> I know Jann raised concerns around privileged execution and perhaps it'd be
> useful to see whether this would make more sense for the SET_DEFAULT_EXEC
> case or not.
>
> Usama - would requiring CAP_SYS_ADMIN be egregious to your use case?
>
My knowledge is security is limited, so please bare with me, but I actually
didn't understand the security issue and the need for CAP_SYS_ADMIN for
doing VM_(NO)HUGEPAGE.
A process can already madvise its own VMAs, and this is just doing that
for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
for that process as well, which again doesnt seem like a security issue
to me.
> ## IMPLEMENTATION
>
> I think that sensibly we'd need to add some new files here, mm/mctl.c,
> include/linux/mctl.h (the latter of providing the MCTL_xxx actions and
> flags).
>
> We could find ways to share code between mm files where appropriate to
> avoid too much duplication.
>
> I suggest that the best way forward, if we were minded to examine how this
> would look in practice, would be for me to implement an RFC that adds the
> interface, and a simple MCTL_SET_NOHUGEPAGE, MCTL_CLEAR_NOHUGEPAGE
> implementation as a proof of concept.
>
> If we wanted to then go ahead with a non-RFC version, this could then form
> a foundation upon which Usama and Barry could implement their features,
> with Usama then able to add MCTL_[SET/CLEAR]_HUGEPAGE and Barry
> MCTL_[SET/CLEAR]_FADE_ON_DEATH.
>
> Obviously I don't mean to presume to suggest how we might proceed here -
> only suggesting this might be a good way of moving forward and getting
> things done as quickly as possible while allowing you guys to move forward
> with your features.
>
> Let me know if this makes sense, alternatively I could try to find a
> relatively benign action to implement as part of the base work, or we could
> simply collaborate to do it all in one series with multiple authors?
>
> ## RFC
>
> The above is all only in effect 'putting ideas out there' so this is
> entirely an RFC in spirit and intent - let me know if this makes sense in
> whole or part :)
>
> Thanks!
>
> Lorenzo
Again thanks for the proposal! Happy to move forward with this or prctl.
Just adding my 2 cents in this email.
Thanks
Usama
Powered by blists - more mailing lists