[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d7ccb47b-7124-45e9-ace0-b0fa49f881ef@lucifer.local>
Date: Fri, 30 May 2025 14:10:55 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Shakeel Butt <shakeel.butt@...ux.dev>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
Jann Horn <jannh@...gle.com>, Arnd Bergmann <arnd@...db.de>,
Christian Brauner <brauner@...nel.org>, SeongJae Park <sj@...nel.org>,
Mike Rapoport <rppt@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
Barry Song <21cnbao@...il.com>, linux-mm@...ck.org,
linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-api@...r.kernel.org, Pedro Falcato <pfalcato@...e.de>
Subject: Re: [DISCUSSION] proposed mctl() API
On Thu, May 29, 2025 at 06:21:55PM +0100, Usama Arif wrote:
>
>
> On 29/05/2025 15:43, Lorenzo Stoakes wrote:
> > ## INTRODUCTION
> >
> > After discussions in various threads (Usama's series adding a new prctl()
> > in [0], and a proposal to adapt process_madvise() to do the same -
> > conception in [1] and RFC in [2]), it seems fairly clear that it would make
> > sense to explore a dedicated API to explicitly allow for actions which
> > affect the virtual address space as a whole.
> >
> > Also, Barry is implementing a feature (currently under RFC) which could
> > additionally make use of this API (see [3]).
> >
> > [0]: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
> > [1]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > [2]: https://lore.kernel.org/all/cover.1747686021.git.lorenzo.stoakes@oracle.com/
> > [3]: https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/
> >
> > While madvise() and process_madvise() are useful for altering the
> > attributes of VMAs within a virtual address space, it isn't the right fit
> > for something that affects the whole address space.
> >
> > Additionally, a requirement of Usama's proposal (see [0]) is that we have
> > the ability to propagate the change in behaviour across fork/exec. This
> > further suggests the need for a dedicated interface, as this really sits
> > outside the ordinary behaviour of [process_]madvise().
> >
> > prctl() is too broad and encourages mm code to migrate to kernel/sys.c
> > where it is at risk of bit-rotting. It can make it harder/impossible to
> > isolate mm logic for testing and logic there might be missed in changes
> > moving forward.
> >
> > It also, like so many kernel interfaces, has 'grown roots out of its pot'
> > so to speak - while it started off as an ostensible 'process' manipulation
> > interface, prctl() operations manipulate a myriad of task, virtual
> > address space and even specific VMA attributes.
> >
> > At this stage it really is a 'catch-all' for things we simply couldn't fit
> > elsewhere.
> >
> > Therefore, as suggested by the rather excellent Liam Howlett, I propose an
> > mm-specific interface that _explicitly_ manipulates attributes of the
> > virtual address space as a whole.
> >
> > I think something that mimics the simplicity of [process_]madvise() makes
> > sense - have a limited set of actions that can be taken, and treat them as
> > a simple action - a user requests you do XXX to the virtual address space
> > (that is, across the mm_struct), and you do it.
> >
>
>
> Hi Lorenzo,
>
> Thanks for writing the proposal, this is awesome!
Thanks :)
>
> Whatever the community agrees with, whether its this or prctl, happy to
> move forward with either as both should accomplish the usecase proposed.
Awesome!
>
> I will just add some points over here in defence of prctl, this is just for
> discussion, and if the community disagrees, completely happy to move forward
> with new syscall as well.
Please do - it's vital we have all sides of the argument!
>
> When it comes to having mm code in kernel/sys.c, we can just do something
> like below that can actually clean it up?
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 3a2df1bd9f64..bfadc339e2c5 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2467,6 +2467,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>
> error = 0;
> switch (option) {
> + case PR_SET_MM:
> + case PR_GET_THP_DISABLE:
> + case PR_SET_THP_DISABLE:
> + case PR_NEW_MM_THING:
> + error = some_function_in_mm_folder(); // in mm/mctl.c ?
> + break;
> case PR_SET_PDEATHSIG:
> if (!valid_signal(arg2)) {
> error = -EINVAL;
>
> when it comes to prctl becoming a catch-all thing, with above clean up,
> we can be a lot more careful to what gets added to the mm side of prctl.
Sure, and I think we were to decide prctl() was the way to go this would be the
way it should look, or at least specifically for this case.
This addresses the bit rot issue but not how hidden the functionality is, what
prctl() generally represents de-facto (that is - a catch-all for obscure stuff),
depending on the interface of the function it might be harder/impossible to
userland-test and of course we have the beautiful vector-interface that means we
can't restrict behaviour in the API in the way we'd ideally like.
I mean overall there is no beautiful solution here.
As Matthew says, the reason we're struggling here is probably more so because
this is just showing up a flaw in how THP is these days - we really ideally need
something automagic where the kernel figures things out for us.
But in the meantime we have this frankly kinda terrible interface that's rather
off/on and yet want to establish policy _ourselves_ when the kernel should be
able to figure it out.
But with that said, we must trade that off with the reality that there is a need
for this kind of fine-grained control now, in this less than ideal world we
currently inhabit :)
>
> The advantage of this is it avoids having another syscall.
> My personal view (which can be wrong :)) is that a new syscall should be
> for something major,
> and I feel that PR_DEFAULT_MADV_HUGEPAGE and PR_DEFAULT_MADV_NOHUGEPAGE
> might be small enough to fit in prctl? but I completely understand
> your point of view as well!
See Andrew's point on the syscall thing, they're cheap ;)
I mean the idea of course is that we can put all future stuff like that here
too.
We could even port some existing prctl() stuff across, in theory. But having 2
ways of doing the same thing would probably suck.
>
> > ## INTERFACE
> >
> > The proposed interface is simply:
> >
> > int mctl(int pidfd, int action, unsigned int flags);
> >
> > Since PIDFD_SELF is now available, it is easy to invoke this for the
> > current process, while also adding the flexibility of being able to apply
> > actions to other processes also.
> >
> > The function will return 0 on success, -1 on failure, with errno set to the
> > error that arose, standard stuff.
> >
> > The behaviour will be tailored to each action taken.
> >
> > To begin with, I propose a single flag:
> >
> > - MCTL_SET_DEFAULT_EXEC - Persists this behaviour across fork/exec.
> >
> > This again will be tailored - only certain actions will be allowed to set
> > this flag, and we will of course assert appropriate capabilities, etc. upon
> > its use.
> >
>
> Sounds good to me. Just adding this here, the solution will be used in systemd
> in exec_invoke, similar to how KSM is done with prctl in [1], so for the THP
> solution, we would need MCTL_SET_DEFAULT_EXEC as it would need to be inherited
> across fork+exec.
Right yeah, this was the intent.
>
> [1] https://github.com/systemd/systemd/blob/2e72d3efafa88c1cb4d9b28dd4ade7c6ab7be29a/src/core/exec-invoke.c#L5046
>
> > All actions would, impact every VMA (if adjustments to VMAs are required).
> >
> > ## SECURITY
> >
> > Of course, security will be of utmost concern (Jann's input is important
> > here :)
> >
> > We can vary security requirements depending on the action taken.
> >
> > For an initial version I suggest we simply limit operations which:
> >
> > - Operate on a remote process
> > - Use the MCTL_SET_DEFAULT_EXEC flag
> >
> > To those tasks which possess the CAP_SYS_ADMIN capability.
> >
> > This may be too restrictive - be good to get some feedback on this.
> >
> > I know Jann raised concerns around privileged execution and perhaps it'd be
> > useful to see whether this would make more sense for the SET_DEFAULT_EXEC
> > case or not.
> >
> > Usama - would requiring CAP_SYS_ADMIN be egregious to your use case?
> >
>
> My knowledge is security is limited, so please bare with me, but I actually
> didn't understand the security issue and the need for CAP_SYS_ADMIN for
> doing VM_(NO)HUGEPAGE.
>
> A process can already madvise its own VMAs, and this is just doing that
> for the entire process. And VM_INIT_DEF_MASK is already set to VM_NOHUGEPAGE
> so it will be inherited by the parent. Just adding VM_HUGEPAGE shouldnt be
> a issue? Inheriting MMF_VM_HUGEPAGE will mean that khugepaged would enter
> for that process as well, which again doesnt seem like a security issue
> to me.
W.R.T. the current process, the Issue is one Jann raised, in relation to
propagation of behaviour to privileged (e.g. setuid) processes.
W.R.T. remote processes, obviously we want to make sure we are permitted to do
so.
>
> > ## IMPLEMENTATION
> >
> > I think that sensibly we'd need to add some new files here, mm/mctl.c,
> > include/linux/mctl.h (the latter of providing the MCTL_xxx actions and
> > flags).
> >
> > We could find ways to share code between mm files where appropriate to
> > avoid too much duplication.
> >
> > I suggest that the best way forward, if we were minded to examine how this
> > would look in practice, would be for me to implement an RFC that adds the
> > interface, and a simple MCTL_SET_NOHUGEPAGE, MCTL_CLEAR_NOHUGEPAGE
> > implementation as a proof of concept.
> >
> > If we wanted to then go ahead with a non-RFC version, this could then form
> > a foundation upon which Usama and Barry could implement their features,
> > with Usama then able to add MCTL_[SET/CLEAR]_HUGEPAGE and Barry
> > MCTL_[SET/CLEAR]_FADE_ON_DEATH.
> >
> > Obviously I don't mean to presume to suggest how we might proceed here -
> > only suggesting this might be a good way of moving forward and getting
> > things done as quickly as possible while allowing you guys to move forward
> > with your features.
> >
> > Let me know if this makes sense, alternatively I could try to find a
> > relatively benign action to implement as part of the base work, or we could
> > simply collaborate to do it all in one series with multiple authors?
> >
> > ## RFC
> >
> > The above is all only in effect 'putting ideas out there' so this is
> > entirely an RFC in spirit and intent - let me know if this makes sense in
> > whole or part :)
> >
> > Thanks!
> >
> > Lorenzo
>
> Again thanks for the proposal! Happy to move forward with this or prctl.
> Just adding my 2 cents in this email.
Thank you! And your input is most appreciated - it's important we get all sides
of the story here!
>
> Thanks
> Usama
>
Cheers, Lorenzo
Powered by blists - more mailing lists