[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7a214bee-d184-460f-88d6-2249b9d513ba@lucifer.local>
Date: Wed, 21 May 2025 05:21:19 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
Jann Horn <jannh@...gle.com>, Arnd Bergmann <arnd@...db.de>,
Christian Brauner <brauner@...nel.org>, linux-mm@...ck.org,
linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
SeongJae Park <sj@...nel.org>, Usama Arif <usamaarif642@...il.com>
Subject: Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
On Tue, May 20, 2025 at 03:02:09PM -0700, Shakeel Butt wrote:
[snip for clarity]
> I think we are talking past each other (and I blame myself for that). Let
> me try again. (Please keep aside prctl or process_madvise). We need a
> way to change the policy of a process (job) and at the moment we are
> aligned that the job loader (like systemd) will set that policy and load
> the target (fork/exec), so the policy persist across fork/exec. (If
> someone has a better way to set the policy without race, please let us
> know).
Ack, totally agree the kernel currently lacks a cohesive story for this
'adjust POLICY of a process and descendents', we have cgroups, but they're
more general than a process, we have namespaces, but that's for restricting
resources...
I think we all want the same thing here, ultimately.
>
> My argument is that process_madvise() is not a good interface to set
> that policy because of its address range like interface. So, if not
> process_madvise() then what? Should we add a new syscall? (BTW we had
> very similar discussion on process_madvise(DONTNEED) on a remote process
> vs a new syscall i.e. process_mrelease()).
Sure, and generally both proposed interfaces are at least _awkward_, for me
prctl() is a no-go unless we have no other choice, I won't go over my
objections to it yet again (and Liam has also raised his of course).
>
> Adding a new syscall requires that it should be generally useful and
> hopefully have more use-cases. Now going back to the current specific
> use-case where we want to override the hugepage related policy of a job,
> do we expect to use this override forever? I believe this is temporary
> because the only reason we need this is because hugepages are not yet
> ready for prime time (many applications do not handle them well). In
> future when hugepages will be awesome, do we still need this "override
> the hugepage policy" syscall?
As argued previously, I am not so sure it'll be temporary, given the
proposed future 'auto' mode will be a _mode_ and we will need to support
VM_[NO]HUGEPAGE scenarios forever (deep, deep sigh).
Also if you add it into systemd it definitely won't be right? There's no
'throwaway' here, and scouring through prctl() (what is actually documented
:), I am not sure anything ever is, frankly.
So the idea is to try to make this as generic as possible and to have it
sit with code it makes sense to sit with.
>
> Now if we can show that this specific functionality is useful more than
> hugepages then I think new syscall seems like the best way forward.
> However if we think this functionality is only needed temporarily then
> shoving it in prctl() seems reasonable to me. If we really don't want
> prctl() based solution, I would recommend to discuss the new syscall
> approach and see if we can comeup with a more general solution.
>
So, something Liam mentioned off-list was the beautifully named
'mmadvise()'. Idea being that we have a system call _explicitly for_
mm-wide modifications.
With Barry's series doing a prctl() for something similar, and a whole host
of mm->flags existing for modifying behaviour, it would seem a natural fit.
I could do a respin that does something like this instead.
What's a pity to me re: going away from process_madvise() is losing the
opportunity to be able to modify the, frankly broken, gaps handling and
also being able to do 'best effort' madvise ranges.
But I suppose those can always be separate series... :)
I guess let me work that up so we can see how that looks?
Cheers, Lorenzo
Powered by blists - more mailing lists