[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dde3a174-e8de-4804-ae5b-a358f0f492dc@lucifer.local>
Date: Thu, 22 May 2025 14:21:24 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Shakeel Butt <shakeel.butt@...ux.dev>,
Andrew Morton <akpm@...ux-foundation.org>,
"Liam R . Howlett" <Liam.Howlett@...cle.com>,
Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
Arnd Bergmann <arnd@...db.de>, Christian Brauner <brauner@...nel.org>,
linux-mm@...ck.org, linux-arch@...r.kernel.org,
linux-kernel@...r.kernel.org, SeongJae Park <sj@...nel.org>,
Usama Arif <usamaarif642@...il.com>
Subject: Re: [RFC PATCH 0/5] add process_madvise() flags to modify behaviour
TL;DR - action item on below is I'll put together a proposed API (without
code) and cc people here when I've done so, so we can take a look at how
mctl() or mmadvise() or whatever we call it might look :)
On Thu, May 22, 2025 at 03:05:30PM +0200, David Hildenbrand wrote:
> On 21.05.25 19:39, Shakeel Butt wrote:
> > On Wed, May 21, 2025 at 05:49:15PM +0100, Lorenzo Stoakes wrote:
> > [...]
> > > >
> > > > Please let's first get consensus on this before starting the work.
> > >
> > > With respect Shakeel, I'll work on whatever I want, whenever I want.
> >
> > I fail to understand why you would respond like that.
>
> Relax guys ... :) Really nothing to be fighting about.
Agreed...!
>
> Lorenzo has a lot of energy to play with things, to see how it would look. I
> wish I would have that much energy, but I have no idea where it went ...
> (well, okay, I have a suspicion) :P
We have cats rather than kids which might explain a bit ;)
>
> At the same time, I hope (and assume :) ) that Lorenzo will get Usama
> involved in the development once we know what we want.
>
>
> To summarize my current view:
>
> 1) ebpf: most people are are not a fan of that, and I agree, at least
> for this purpose. If we were talking about making better *placement*
> decisions using epbf, it would be a different story.
Yeah, I think overall we have a situation that is _bad_ in terms of
interface. We need something more fine-grained, but it's chicken and egg, and
there are genuine needs users have _now_.
So the whole discussion is about this.
>
> 2) prctl(): the unloved child, and I can understand why. Maybe now is
> the right time to stop adding new MM things that feel weird in there.
> Maybe we should already have done that with the KSM toggle (guess who
> was involved in that ;) ).
I won't belabour this point, at this point I might get a reputation as
prctl()'s biggest hater otherwise :P
But one thing I will say is - systemd makes these things permanent (hey
that KSM thing that breaks VMA merging is literally an option in systemd,
wasn't aware :)
>
> 3) process_madvise(): I think it's an interesting extension, but
> probably we should just have something that applies to the whole
> address space naturally. At least my take for now.
Yeah that's the point of view I've come to, I mean the point was to try to
make this more generic in a way that _also_ got us improved control over
madvise() - sort of win/win.
But the 'default the process' thing is, as Shakeel and Liam rightly say,
just really out of band or doesn't quite fit this interface.
I may still put forward a patch to add flags for e.g. not breaking on gaps
but as a separate thing I think, I still think that'd be valuable (but I'll
provide solid at least self tests to make the point).
>
> 4) new syscall: worth exploring how it would look. I'm especially
> interested in flag options (e.g., SET_DEFAULT_EXEC) and how we could
> make them only apply to selected controls.
Yeah, this is exactly what I want to play with.
>
>
> An API prototype of 4), not necessarily with the code yet, might be
> valuable.
ACK, though I really find it valuable to code things up because so often
you figure out what works by trying to make it work in practice.
This is how guard regions happened for instance, we had a ton of
conversation like this, loads of back and forth, nobody quite knew, then I
wrote some prototype code and it became apparent that this thing was
doable.
I never intend the RFC to be _the work_ rather it's a 'proof of concept'
for discussion.
However, as we're still fairly vague on the API bit, I think in this case
it'll be valuable to do exactly what you suggest and simply prototype an
API around this without code.
So I'll do that and come up with something as a separate mail, cc'ing
people here.
>
> In general, the "always/madvise/never" policies are really horrible. We
> should instead be prioritizing who gets THPs -- and only disable them for
> selected workloads.
I couldn't agree more.
>
> Because splitting THPs up because a process is not allowed to use them,
> thereby increasing memory fragmentation, is really absolutely suboptimal.
Yes, there's a disconnect here between - a global resource (-ish :P) - and
process requirements.
>
> But we don't have anything better right now.
I feel like all this turmoil brings us closer to longer term solutions, if
perhaps via pain-inspired development (a new programming philosophy I
intend to trademark ;)
>
> So I would hope that we can at least turn the "always/VM_HUGEPAGE" into a
> "prioritize for largest (m)THPs possible" in a distant future.
I suspect we might still require some legacy settings so people don't
panic. Aren't uAPIs fun?
>
> If only changing the semantics of VM_NOHUGEPAGE to mean "deprioritize for
> THPs" couldn't break userfaultfd ... :( But maybe that can be worked around
> in the future somehow (e.g., when we detect userfaultfd usage, not sure,
> ...).
I hate how uffd is implemented (I like the concept of what it provides
though!) on multiple levels. It's crept into so much and the idea it's
putting restrictions on core stuff is just horrid.
I do feel though we may want to introduce something new for this though, as
'never' or 'no' suddenly not being no but 'deprioritise' could be pretty
concerning for people.
But on the other hand, this is a resource for the kernel to determine how
to manage as it sees fit so, perhaps we shouldn't care...
>
> --
> Cheers,
>
> David / dhildenb
>
Thanks!
Powered by blists - more mailing lists