[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a175a2c-8afa-4995-9dec-e3e7cf1efc72@lucifer.local>
Date: Fri, 16 May 2025 11:57:46 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: David Hildenbrand <david@...hat.com>
Cc: Usama Arif <usamaarif642@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
ziy@...dia.com, laoar.shao@...il.com, baolin.wang@...ux.alibaba.com,
Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
kernel-team@...a.com
Subject: Re: [PATCH 1/6] prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the
process
On Fri, May 16, 2025 at 09:45:17AM +0200, David Hildenbrand wrote:
> On 15.05.25 22:35, Lorenzo Stoakes wrote:
> > On Thu, May 15, 2025 at 09:12:13PM +0200, David Hildenbrand wrote:
> > > On 15.05.25 20:08, Lorenzo Stoakes wrote:
> > > > On Thu, May 15, 2025 at 06:11:55PM +0200, David Hildenbrand wrote:
> > > > > > > > So if you're not overriding VM_NOHUGEPAGE, the whole point of this exercise
> > > > > > > > is to override global 'never'?
> > > > > > > >
> > > > > > >
> > > > > > > Again, I am not overriding never.
> > > > > > >
> > > > > > > hugepage_global_always and hugepage_global_enabled will evaluate to false
> > > > > > > and you will not get a hugepage.
> > > > > >
> > > > > > Yeah, again ack, but I kind of hate that we set VM_HUGEPAGE everywhere even
> > > > > > if the policy is never.
> > > > >
> > > > > I think it should behave just as if someone does manually an madvise(). So
> > > > > whatever we do here during an madvise, we should try to do the same thing
> > > > > here.
> > > >
> > > > Ack I agree with this.
> > > >
> > > > It actually simplifies things a LOT to view it this way - we're saying 'by
> > > > default apply madvise(...) to new VMAs'.
> > > >
> > > > Hm I wonder if we could have a more generic version of this...
> > > >
> > > > Note though that we're not _quite_ doing this.
> > > >
> > > > So in hugepage_madvise():
> > > >
> > > > int hugepage_madvise(struct vm_area_struct *vma,
> > > > unsigned long *vm_flags, int advice)
> > > > {
> > > > ...
> > > >
> > > > switch (advice) {
> > > > case MADV_HUGEPAGE:
> > > > *vm_flags &= ~VM_NOHUGEPAGE;
> > > > *vm_flags |= VM_HUGEPAGE;
> > > >
> > > > ...
> > > >
> > > > break;
> > > >
> > > > ...
> > > > }
> > > >
> > > > ...
> > > > }
> > > >
> > > > So here we're actually clearing VM_NOHUGEPAGE and overriding it, but in the
> > > > proposed code we're not.
> > >
> > > Yeah, I think I suggested that, but probably we should just do exactly what
> > > madvise() does.
> >
> > Yes, agreed.
> >
> > Usama - do you have any issue with us switching to how madvise() does it?
> >
> > >
> > > >
> > > > So we're back into confusing territory again :)
> > > >
> > > > I wonder if we could...
> > > >
> > > > 1. Add an MADV_xxx that mimics the desired behaviour here.
> > > >
> > > > 2. Add a generic 'madvise() by default' thing at a process level?
> > > >
> > > > Is this crazy?
> > >
> > > I think that's what I had in mind, just a bit twisted.
> > >
> > > What could work is
> > >
> > > 1) prctl to set the default
> > >
> > > 2) madvise() to adjust all existing VMAs
> > >
> > >
> > > We might have to teach 2) to ignore non-compatible VMAs / holes. Maybe not,
> > > worth an investigation.
> >
> > Yeah, I think it'd _probably_ be ok except on s390 (which can fail, and so
> > we'd have to be able to say - skip on error, carry on).
> >
> > We'll just get an -ENOMEM at the end for the gaps (god how I hate
> > that). Otherwise I don't think MADV_HUGEPAGE actually is really that
> > restrictive.
> >
> > That would simplify :)
> >
> > But I still so hate using prctl()... this might be one of those cases where
> > we simply figure out we have no other choice.
> > > But when you put it as simply as this maybe it's not so bad. With the
> > flags2 gone by fixing this stupid 32-bit limit it's less awful.
> >
> > Perhaps worth seeing what an improved RFC of this series looks like with
> > all the various bits fixed to give an idea.
>
> Yes.
>
> >
> > But you do then wonder if we could make this _generic_ for _any_ madvise(),
> > and how _that_ would look.
> >
> > But perhaps that's insane because many VMAs would simply not be suited to
> > having certain madvise flags set hmm.
>
> Same thinking. I think this is rather special.
>
> In a perfect world not even the madvise(*HUGEPAGE) would exist.
>
> But here we are ... 14 years (wow!) after
This feels like the tale of the kernel :)
>
> commit 0af4e98b6b095c74588af04872f83d333c958c32
> Author: Andrea Arcangeli <aarcange@...hat.com>
> Date: Thu Jan 13 15:46:55 2011 -0800
>
> thp: madvise(MADV_HUGEPAGE)
>
>
>
> (I'm surprised you don't complain about madvise(). IMHO, prctl() is even a
> better interface than catch-all madvise(); a syscall where an advise might
> not be an advise. I saw some funny rants about MADV_DONTNEED on reddit at
> some point ... :) mctrl() would have been clearer, at least for me :D )
No I prefer madvise() massively, I mean yes in a way it's hacky, but prctl() is
the ultimate hack.
So as an interface it's actually kinda fine like 'virtual range X-Y, advise ZZZ
about it'.
(as for naming haha maybe you have a point actually, the 'advice' bit
has always been strange... :)
But.
The actual set of advice is bloody hideous and confusing and I've seen
first hand userspace people get very, very confused about what each thing
does. The naming is horrible, overloaded, overwrought.
And the weird behaviour with gaps is also horrible...
So there's lots to moan about there, but saying prctl() is somehow superior
to the true evil of prctl() is far too far :P
I mean take a look at https://man7.org/linux/man-pages/man2/prctl.2.html
Things like:
PR_SET_MM
PR_SET_VMA
Are super worrying...
>
> --
> Cheers,
>
> David / dhildenb
>
I wonder if we just need a new syscall overall... *puts thinking cap on*.
A galaxy brained idea may be coming to me good sir :P
Watch this space...
Powered by blists - more mailing lists