[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54b9e8de-8722-4d60-a9ed-3af62467fe7c@lucifer.local>
Date: Thu, 15 May 2025 17:24:17 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Usama Arif <usamaarif642@...il.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
linux-mm@...ck.org, hannes@...xchg.org, shakeel.butt@...ux.dev,
riel@...riel.com, ziy@...dia.com, laoar.shao@...il.com,
baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
npache@...hat.com, ryan.roberts@....com, linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY
On Thu, May 15, 2025 at 04:54:40PM +0100, Usama Arif wrote:
>
>
> On 15/05/2025 16:15, Lorenzo Stoakes wrote:
> > Thanks for coming back to me so quickly, appreciated :)
> >
> > I am reacting in a 'WTF' way here, but it's in proportion to the (at least
> > perceived) magnitude of this change. We really need to be sure this is
> > right.
> >
>
> Lol I had to rewrite my replies a few times to tone them down.
> Hopefully I don't come across as aggressive :)
Haha this is literally totally normal. And something I do a lot (I thik we
all do :). Ask me about the time a friend called immediately prior to me
deleting a sentence that I had realised came across badly, then forgetting
to do so, and then sending the mail... ;)
No it's fine, equally apologies if I seem aggressive, it's not intended, I
am just passionate about _us getting it right_ - and I think the healthiest
approach is to be as direct as possible, which maintaining professionalism
(not withstanding star wars references and comments on how prctl() is
simply the dictionary definition of deep, unmitigated evil).
But it's vital in kernel work to be able to be absolutely robust, in both
directions, while maintaining civility :)
>
>
> > On Thu, May 15, 2025 at 03:50:47PM +0100, Usama Arif wrote:
> >>
> >>
> >> On 15/05/2025 14:55, Lorenzo Stoakes wrote:
> >>> On Thu, May 15, 2025 at 02:33:29PM +0100, Usama Arif wrote:
> >>>> This allows to change the THP policy of a process, according to the value
> >>>> set in arg2, all of which will be inherited during fork+exec:
> >>>
> >>> This is pretty confusing.
> >>>
> >>> It should be something like 'add a new prctl() option that allows...' etc.
> >>>
> >>>> - PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE
> >>>> process flag which changes the default of new VMAs to be VM_HUGEPAGE. The
> >>>> call also modifies all existing VMAs that are not VM_NOHUGEPAGE
> >>>> to be VM_HUGEPAGE.
> >>>
> >>> This is referring to implementation detail that doesn't matter for an overview,
> >>> just add a summary here e.g.
> >>>
> >>> PR_THP_POLICY_DEFAULT_HUGE - set VM_HUGEPAGE flag in all VMAs by default,
> >>> including after fork/exec, ignoring global policy.
> >>>
> >>> PR_THP_POLICY_DEFAULT_NOHUGE - clear VM_HUGEPAGE flag in all VMAs by default,
> >>> including after fork/exec, ignoring global policy.
> >>>
> >>> PR_THP_POLICY_DEFAULT_SYSTEM - Eliminate any policy set above.
> >>
> >> Hi Lorenzo,
> >>
> >> Thanks for the review. I will make the cover letter clearer in the next revision.
> >
> > The next version should emphatically be an RFC also, please. Your cover letter
> > should mention you're fundamentally changing mm_struct and VMA logic, and
> > explain why your use cae is so important that that is justified.
> >
>
> Thanks, will make it RFC and add that I am making changes to mm_struct and VMA logic.
Thanks! This isn't sort of 'another way of saying no' by the way, it's
literally because, as is obvious already, we all need to talk about this :)
Once we stabilise on a way forwards then we're good to un-RFC.
>
> >>
> >>>
> >>>> This allows systems where the global policy is set to "madvise"
> >>>> to effectively have THPs always for the process. In an environment
> >>>> where different types of workloads are stacked on the same machine
> >>>> whose global policy is set to "madvise", this will allow workloads
> >>>> that benefit from always having hugepages to do so, without regressing
> >>>> those that don't.
> >>>
> >>> So does this just ignore and override the global policy? I'm not sure I'm
> >>> comfortable with that.
> >>
> >> No. The decision making of when and what order THPs are allowed is not
> >> changed, i.e. there are no changes in __thp_vma_allowable_orders and
> >> thp_vma_allowable_orders. David has the same concern as you and this
> >> current series is implementing what David suggested in
> >> https://lore.kernel.org/all/3f7ba97d-04d5-4ea4-9f08-6ec3584e0d4c@redhat.com/
> >>
> >> It will change the existing VMA (NO)HUGE flags according to
> >> the prctl. For e.g. doing PR_THP_POLICY_DEFAULT_HUGE will not give
> >> a THP when global policy is never.
> >
> > Umm...
> >
> > + case PR_SET_THP_POLICY:
> > + if (arg3 || arg4 || arg5)
> > + return -EINVAL;
> > + if (mmap_write_lock_killable(me->mm))
> > + return -EINTR;
> > + switch (arg2) {
> > + case PR_THP_POLICY_DEFAULT_HUGE:
> > + set_bit(MMF2_THP_VMA_DEFAULT_HUGE, &me->mm->flags2);
> > + process_vmas_thp_default_huge(me->mm);
> > + break;
> > + default:
> >
> >
> > Where's the check against never? You're unconditionally setting VM_HUGEPAGE?
>
> So this was from the discussion with David. My initial implementation in v1,
> messed with the policy evaluation in thp_vma_allowable_orders and __thp_vma_allowable_orders.
>
> The whole point of doing it this way is that you dont mess with the policy evaluation.
>
> hugepage_global_always and hugepage_global_enabled will still evaluate to false
> when never is set and you will not get a hugepage. But more on it below.
Yeah, I mean I kind of hate this, but madvise(..., VM_[NO]HUGEPAGE) already
does this, regardless of whether the VMA is in any way suited to THP or
global policies say otherwise.
So maybe I can put my criticisms of this aside and wonder out loud:
1. We want to make this behave as if we were just setting madvise(..,
MADV_[NO]HUGEPAGE) everywhere.
2. If so, do we then take the semantics to be, in the case of ...HUGEPAGE,
'these VMAs should be considered by khugepaged for collapse if and only
if global policy is set to one of [madvise, always]'.
3. If so, do we also then take the semantics to be, in the case of
...NOHUGEPAGE, 'these VMAs should NOT be considered by khugepaged for
collapse even if global policy is set to [always]'.
I wonder, since this is at mm_struct granularity, why do we even bother
applying this to VMAs?
I guess the reason we are doing this at VMA level rather than mm level is
that we want the user to have the ability to override this per-VMA.
But couldn't we just more efficiently do this by having the khugepaged code
check both mm->flags AND vma->vm_flags, and have some function handle the
override _explcitly_ there? Is there some reason we're not doing it that
way?
Forgive me if this was already discussed.
>
> >
> > You're relying on VM_HUGEPAGE being ignored in this instance? But you're still:
> >
> > 1. Setting VM_HUGEPAGE everywhere (and breaking VMA merging everywhere).
> >
> > 2. Setting MMF2_THP_VMA_DEFAULT_HUGE and making it so PR_GET_THP_POLICY says it
> > has a policy of default huge even if policy is set to never?
> >
> > I'm not ok with that. I'd much rather we do the never check here...
> >
>
> I am ok with that. I can add a check over here that wraps this in:
> if (hugepage_global_enabled())
> ...
>
> > Also see hugepage_madvise(). There's arch-specific code that overrides
> > that, and you're now bypassing that (yes it's for one arch of course but
> > it's still a thing)
> >
>
> Thanks, I will put
> if (mm_has_pgste(vma->vm_mm))
> return 0;
> at the start.
Well, hang on, let's maybe try to reuse this code if we can :) we
definitely don't want to duplicate this.
>
> >>
> >>>
> >>> What about if the the policy is 'never'? Does this override that? That seems
> >>> completely wrong.
> >>
> >> No, it won't override it. hugepage_global_always and hugepage_global_enabled
> >> will still evaluate to false and you wont get a hugepage no matter what prctl
> >> is set.
> >
> > Ack ok I see as above, you're relying on VM_HUGEPAGE enforcing htis.
> >
> > You really need to put stuff like this in the cover letter though!!
> >
>
> Sure will do in the next revision, Thanks.
Thanks!
> >>
> >>>
> >>>> - PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE
> >>>> process flag which changes the default of new VMAs to be VM_NOHUGEPAGE.
> >>>> The call also modifies all existing VMAs that are not VM_HUGEPAGE
> >>>> to be VM_NOHUGEPAGE.
> >>>> This allows systems where the global policy is set to "always"
> >>>> to effectively have THPs on madvise only for the process. In an
> >>>> environment where different types of workloads are stacked on the
> >>>> same machine whose global policy is set to "always", this will allow
> >>>> workloads that benefit from having hugepages on an madvise basis only
> >>>> to do so, without regressing those that benefit from having hugepages
> >>>> always.
> >>>
> >>> Wait, so 'no huge' means 'madvise'? What? This is confusing.
> >>
> >>
> >> I probably made the cover letter confusing :) or maybe need to rename the flags.
> >>
> >> This flag work as follows:
> >>
> >> a) Changes the default flag of new VMAs to be VM_NOHUGEPAGE
> >>
> >> b) Modifies all existing VMAs that are not VM_HUGEPAGE to be VM_NOHUGEPAGE
> >>
> >> c) Is inherited during fork+exec
> >>
> >> I think maybe I should add VMA to the flag names and rename the flags to
> >> PR_THP_POLICY_DEFAULT_VMA_(NO)HUGE ??
> >
> > Please no :) 'VMA' is implicit re: mappings. If you're touching memory
> > mappings you're necessarily touching VMAs.
> >
> > I know some prctl() (a pathway to many abilities some consider to be
> > unnatural) uses 'VMA' in some of the endpoints but generally when referring
> > to specific VMAs no?
> >
> > These namesa are already kinda horrible (yes naming is hard, for everyone,
> > ask me about MADV_POISON/REMEDY) but I think something like:
> >
> > PR_DEFAULT_MADV_HUGEPAGE
> > PR_DEFAULT_MADV_NOHUGEPAGE
> >
> > -ish :)
> >
>
> Sure, happy with that, Thanks.
Thanks!
> >>
> >>>
> >>>> - PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE
> >>>> and MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
> >>>>
> >>>> These patches are required in rolling out hugepages in hyperscaler
> >>>> configurations for workloads that benefit from them, where workloads are
> >>>> stacked anda single THP global policy is likely to be used across the entire
> >>>> fleet, and prctl will help override it.
> >>>
> >>> I don't understand this justification whatsoever. What does 'stacked' mean? And
> >>> you're not justifying why you'd override the policy?
> >>
> >> By stacked I just meant different types of workloads running on the same machine.
> >> Lets say we have a single server whose global policy is set to madvise.
> >> You can have a container on that server running some database workload that best
> >> works with madvise.
> >> You can have another container on that same server running some AI workload that would
> >> benefit from having VM_HUGEPAGE set on all new VMAs. We can use prctl
> >> PR_THP_POLICY_DEFAULT_HUGE to get VM_HUGEPAGE set by default on all new VMAs for that
> >> container.
> >>
> >>>
> >>> This series has no actual justificaiton here at all? You really need to provide one.
> >>>
> >>
> >> There was a discussion on the usecases in
> >> https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/
> >>
> >> I tried (and I guess failed :)) to summarize the justification from that thread.
> >
> > It's fine, I have most definitely not been as clear as I could be in series
> > too :>) just need to add a bigger summary.
> >
> > Don't afraid to waffle on... (I know I am not... ;)
> >
> >>
> >> I will try and rephrase it here.
> >>
> >> In hyperscalers, we have a single THP policy for the entire fleet.
> >> We have different types of workloads (e.g. AI/compute/databases/etc)
> >> running on a single server (this is what I meant by 'stacked').
> >> Some of these workloads will benefit from always getting THP at fault (or collapsed
> >> by khugepaged), some of them will benefit by only getting them at madvise.
> >>
> >> This series is useful for 2 usecases:
> >>
> >> 1) global system policy = madvise, while we want some workloads to get THPs
> >> at fault and by khugepaged :- some processes (e.g. AI workloads) benefits from getting
> >> THPs at fault (and collapsed by khugepaged). Other workloads like databases will incur
> >> regression (either a performance regression or they are completely memory bound and
> >> even a very slight increase in memory will cause them to OOM). So what these patches
> >> will do is allow setting prctl(PR_THP_POLICY_DEFAULT_HUGE) on the AI workloads,
> >> (This is how workloads are deployed in our (Meta's/Facebook) fleet at this moment).
> >>
> >> 2) global system policy = always, while we want some workloads to get THPs
> >> only on madvise basis :- Same reason as 1). What these patches
> >> will do is allow setting prctl(PR_THP_POLICY_DEFAULT_NOHUGE) on the database
> >> workloads.
> >> (We hope this is us (Meta) in the near future, if a majority of workloads show that they
> >> benefit from always, we flip the default host setting to "always" across the fleet and
> >> workloads that regress can opt-out and be "madvise".
> >> New services developed will then be tested with always by default. "always" is also the
> >> default defconfig option upstream, so I would imagine this is faced by others as well.)
> >
> > Right, but I'm not sure you're explaining why prctl(), one of the most cursed,
> > neglected and frankly evil (maybe exaggerating :P) APIs in the kernel is the way
> > to do this?
> >
> > You do need to summarise why the suggested idea re: BPF, or cgroups, or whatnot
> > is _totally unworkable_.
> >
> > And why not process_madvise() with MADV_HUGEPAGE?
> >
> > I'm also not sure fork/exec is a great situation to have, because are you sure
> > the workloads stay the same across all fork/execs that you're now propagating?
> >
> > It feels like this should be a cgroup thing, really.
> >
>
> So I actually dont mind the cgroup implementation (that was actually my first
> prototype and after that I saw there was someone who had posted it earlier).
> It was shot down because it wont be hierarchical and doesnt solve it when
> its not being done in a cgroup.
>
> A large proportion of the thread in v1 was discussion with David, Johannes, Zi and
> Yafang (the bpf THP policy author) on different ways of doing this.
Ack yes. I think this whole discussion underlines why it's so important for
you to _summarise_ that discussion. I'm repeating myself from elsewhere but
the cover letter needs something like:
We considered the following alternatives, and none of them were workable:
[list of alternatives]
And ack on cgroups maybe not being quite right due to this not being
strictly hierarchical.
>
> >>
> >> Hope this makes the justification for the patches clearer :)
> >
> > Sure, please add this kind of thing to the cover letter to get fewer 'wtf'
> > reactions :)
> >
> > You're doing something really _big_ and _opinonated_ here though, that's
> > basically fundamentally changing core stuff, so an extended discussion of why
> > you feel it's so important, why other approaches are not workable, why the
> > Sauron-spawned Mordor dwelling prctl() API is the way to go, etc.
> >
> >>
> >>>>
> >>>> v1->v2:
> >>>
> >>> Where was the v1? Is it [0]?
> >>>
> >>> This seems like a massive change compared to that series?
> >>>
> >>> You've renamed it and not referenced the old series, please make sure you link
> >>> it or somehow let somebody see what this is against, because it makes review
> >>> difficult.
> >>>
> >>
> >> Yes its the patch you linked below. Sorry should have linked it in this series.
> >> Its a big change, but it was basically incorporating all feedback from David,
> >> while trying to achieve a similar goal. Will link it in future series.
> >
> > Yeah, again, this should have been an RFC on that basis :)
> >
> >>
> >>> [0]: https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/
> >>>
> >>>> - change from modifying the THP decision making for the process, to modifying
> >>>> VMA flags only. This prevents further complicating the logic used to
> >>>> determine THP order (Thanks David!)
> >>>> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
> >>>> and arg2 to set the policy. (Zi Yan)
> >>>> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
> >>>> - Add selftests and documentation.
> >>>>
> >>>> Usama Arif (6):
> >>>> prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
> >>>> prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
> >>>> prctl: introduce PR_THP_POLICY_SYSTEM for the process
> >>>> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
> >>>> selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
> >>>> docs: transhuge: document process level THP controls
> >>>>
> >>>> Documentation/admin-guide/mm/transhuge.rst | 40 +++
> >>>> include/linux/huge_mm.h | 4 +
> >>>> include/linux/mm_types.h | 14 +
> >>>> include/uapi/linux/prctl.h | 6 +
> >>>> kernel/fork.c | 1 +
> >>>> kernel/sys.c | 35 +++
> >>>> mm/huge_memory.c | 56 ++++
> >>>> mm/vma.c | 2 +
> >>>> tools/include/uapi/linux/prctl.h | 6 +
> >>>> .../trace/beauty/include/uapi/linux/prctl.h | 6 +
> >>>> tools/testing/selftests/prctl/Makefile | 2 +-
> >>>> tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
> >>>> 12 files changed, 457 insertions(+), 1 deletion(-)
> >>>> create mode 100644 tools/testing/selftests/prctl/thp_policy.c
> >>>>
> >>>> --
> >>>> 2.47.1
> >>>>
> >>
>
Powered by blists - more mailing lists