[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4jw2sj6zxi64xmjbfwdthwboeftz2ucmrkvk2vvrf5sx7vxzoq@rotrm765fbfl>
Date: Tue, 20 May 2025 22:16:07 -0400
From: "Liam R. Howlett" <Liam.Howlett@...cle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
Andrew Morton <akpm@...ux-foundation.org>,
David Hildenbrand <david@...hat.com>, Vlastimil Babka <vbabka@...e.cz>,
Jann Horn <jannh@...gle.com>, Arnd Bergmann <arnd@...db.de>,
Christian Brauner <brauner@...nel.org>, linux-mm@...ck.org,
linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
SeongJae Park <sj@...nel.org>, Usama Arif <usamaarif642@...il.com>
Subject: Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
* Shakeel Butt <shakeel.butt@...ux.dev> [250520 15:49]:
> On Tue, May 20, 2025 at 07:45:43PM +0100, Lorenzo Stoakes wrote:
> > On Tue, May 20, 2025 at 11:25:05AM -0700, Shakeel Butt wrote:
> > > On Mon, May 19, 2025 at 09:52:37PM +0100, Lorenzo Stoakes wrote:
> > > > REVIEWERS NOTES:
> > > > ================
> > > >
> > > > This is a VERY EARLY version of the idea, it's relatively untested, and I'm
> > > > 'putting it out there' for feedback. Any serious version of this will add a
> > > > bunch of self-tests to assert correct behaviour and I will more carefully
> > > > confirm everything's working.
> > > >
> > > > This is based on discussion arising from Usama's series [0], SJ's input on
> > > > the thread around process_madvise() behaviour [1] (and a subsequent
> > > > response by me [2]) and prior discussion about a new madvise() interface
> > > > [3].
> > > >
> > > > [0]: https://lore.kernel.org/linux-mm/20250515133519.2779639-1-usamaarif642@gmail.com/
> > > > [1]: https://lore.kernel.org/linux-mm/20250517162048.36347-1-sj@kernel.org/
> > > > [2]: https://lore.kernel.org/linux-mm/e3ba284c-3cb1-42c1-a0ba-9c59374d0541@lucifer.local/
> > > > [3]: https://lore.kernel.org/linux-mm/c390dd7e-0770-4d29-bb0e-f410ff6678e3@lucifer.local/
> > > >
> > > > ================
...
> > > >
> > > > 3. PMADV_SET_FORK_EXEC_DEFAULT
> > > >
> > > > It may be desirable for a user to specify that all VMAs mapped in a process
> > > > address space default to having an madvise() behaviour established by
> > > > default, in such a fashion as that this persists across fork/exec.
> > > >
> > > > Since this is a very powerful option that would make no sense for many
> > > > advice modes, we explicitly only permit known-safe flags here (currently
> > > > MADV_HUGEPAGE and MADV_NOHUGEPAGE only).
> > >
> > > Other flags seems general enough but this one is just weird. This is
> > > exactly the scenario for prctl() like interface. You are trying to make
> > > process_madvise() like prctl() and I can see process_madvise() would be
> > > included in all the hate that prctl() receives.
> >
> > I'm really not sure what you mean. prctl() has no rhyme nor reason, so not
> > sure what a 'prctl() like interface' means here, and you're not explaining
> > what you mean by that.
>
> I meant it applies a property at the task or process level and has
> examples where those properties are inherited to children.
Okay, I don't think we should have any illusions of how clear cut either
of these things are. prctl has PR_SET_VMA, which isn't to to do with
the process.
madvise() can be called with a range covering the entire address space
and may or may not fail on gaps (depending on the call).
process_madvise() was to madvise on a process range - including all of
it, so I don't see either as the wrong place. Or, really, the best
place.
>
> >
> > Presumably you mean you find this odd as you feel it sits outside the realm
> > of madvise() behaviour.
>
> The persistence across exec seems weird.
>
> >
> > But I'd suggest it does not - the idea is to align _everything_ with
> > madvise(). Rather than adding an entirely arbitrary function in prctl(), we
> > are going the other way - keeping everything relating to madvise()-like
> > modification of memory _in mm_ and _in madvise()_, rather than bitrotting
> > away in kernel/sys.c.
>
> The above para seems like you are talking about code which can be moved
> to mm.
>
> >
> > So we get alignment in the fact that we're saying 'we establish a _default_
> > madvise flag for a process'.
>
> I think this is an important point. So, we want to introduce a way to
> set a process level property which can be inherited through fork/exec.
> With that in mind, is process_madvise() (or even madvise()) really a
> good interface for it? There is no need for address ranges for such
> use-case.
>
> >
> > We restrict initially to VM_HUGEPAGE and VM_NOHUGEPAGE to a. achieve what
> > you guys at meta want while also opening the door to doing so in future if
> > it makes sense to.
>
> Please drop the "you guys at meta". We should be aiming for what is good
> for all (or most) linux users. Whatever is done here will be
> incorporated in systemd which will be used very widely.
Agreed, the aim is to provide the most flexible interface for the
majority of the users. In my view, there isn't a clear way forward yet:
I don't want to make process_madvise() a dumping ground and I don't want
to make prctl() a bigger mess.
But right now, it seems there is more than one user arguing for this
particular implementation, but in fact there two employees at the same
company and that isn't clear in this conversation to the casual reader.
And the opinions are being cross-referenced as evidence of how others
view it. I have no idea if you two know each other, or even if you know
that you are both at meta - but I think if I chimed in from some other
email address saying how crazy prctl() is over this other way, then
you'd probably feel compelled to say something about where I work (I
hope you would tbh)?
You could add a (meta) to your email address or signature so that it is
implicit that there may be other influences on opinion. I'm not saying
there is an influence on opinion or that this opinion is wrong, but it
is not fully transparent and complicates the conversation.
I don't think it is unreasonable for people to point out that everyone
arguing for one solution is at the same company when it isn't obvious,
do you?
>
> >
> > This couldn't be more different than putting some arbitrary code relating
> > to mm in the 'netherrealm' of prctl().
> >
> >
> > >
> > > Let me ask in a different way. Eventually we want to be in a state where
> > > hugepages works out of the box for all workloads. In that state what
> > > would the need for this flag unless you have use-cases other than
> > > hugepages. To me, there is a general consensus that prctl is a hacky
> > > interface, so having some intermediate solution through prctl until
> > > hugepages are good out of the box seems more reasonable.
(oh my, I'm going to sound like a jerk here.. sorry)
This is a terrible argument. This intermediate solution may outlive us
all.. it may even shorten some of our lives due to stress dealing with
it. Will the removal of the prctl() call coincide with the year of the
Linux desktop? Or maybe the removal of the mmap lock... wait.
The hacky interface of prctl() is one of my main gripes with that
proposal, it's not a reason to go with it.
> >
> > No, fundamentally disagree. We already have MADV_[NO]HUGEPAGE. This will
> > have to be supported. In a future where things are automatic, these may be
> > no-ops in 'auto' mode.
> >
> > But the requirement to have these flags will still exist, the requirement
> > to do madvise() operations will still exist, we're simply expanding this
> > functionality.
> >
> > The problem arises the other way around when we shovel mm stuff in
> > kernel/sys.c.
>
> I think you mixing the location of the code and the interface which will
> remain relevant long term. I don't see process_madvise (or madvise) good
> interface for this specific functionality (i.e. process level property
> that gets inherited through fork/exec).
Process is literally in the name and we are applying an madvise flag
across the entire range.
The inherited through fork/exec is a much stronger reason that I see for
not doing it in process_madvise().
> Now we can add a new syscall for
> this but to me, particularly for hugepage, this functionality is needed
> temporarily until hugepages are good out of the box.
I'm not holding my breath.
> However if there is
> a use-case other than hugepages then yes, why not a new syscall.
>
We also have Barry looking at prctl() for avoiding LRU updates on exit
[1], which I guess is process related.. or page related.. really it is
mm flags related in his implementation, sort of like this feature. But
it will probably only be used by android, so...
[1]. https://lore.kernel.org/all/20250514070820.51793-1-21cnbao@gmail.com/
Thanks,
Liam
Powered by blists - more mailing lists