lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5e4c107f-9db8-4212-99b6-a490406fec77@gmail.com>
Date: Thu, 15 May 2025 15:50:47 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, david@...hat.com,
 linux-mm@...ck.org, hannes@...xchg.org, shakeel.butt@...ux.dev,
 riel@...riel.com, ziy@...dia.com, laoar.shao@...il.com,
 baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com, npache@...hat.com,
 ryan.roberts@....com, linux-kernel@...r.kernel.org,
 linux-doc@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 0/6] prctl: introduce PR_SET/GET_THP_POLICY



On 15/05/2025 14:55, Lorenzo Stoakes wrote:
> On Thu, May 15, 2025 at 02:33:29PM +0100, Usama Arif wrote:
>> This allows to change the THP policy of a process, according to the value
>> set in arg2, all of which will be inherited during fork+exec:
> 
> This is pretty confusing.
> 
> It should be something like 'add a new prctl() option that allows...' etc.
> 
>> - PR_THP_POLICY_DEFAULT_HUGE: This will set the MMF2_THP_VMA_DEFAULT_HUGE
>>   process flag which changes the default of new VMAs to be VM_HUGEPAGE. The
>>   call also modifies all existing VMAs that are not VM_NOHUGEPAGE
>>   to be VM_HUGEPAGE.
> 
> This is referring to implementation detail that doesn't matter for an overview,
> just add a summary here e.g.
> 
> PR_THP_POLICY_DEFAULT_HUGE - set VM_HUGEPAGE flag in all VMAs by default,
> including after fork/exec, ignoring global policy.
> 
> PR_THP_POLICY_DEFAULT_NOHUGE - clear VM_HUGEPAGE flag in all VMAs by default,
> including after fork/exec, ignoring global policy.
> 
> PR_THP_POLICY_DEFAULT_SYSTEM - Eliminate any policy set above.

Hi Lorenzo,

Thanks for the review. I will make the cover letter clearer in the next revision.

> 
>>   This allows systems where the global policy is set to "madvise"
>>   to effectively have THPs always for the process. In an environment
>>   where different types of workloads are stacked on the same machine
>>   whose global policy is set to "madvise", this will allow workloads
>>   that benefit from always having hugepages to do so, without regressing
>>   those that don't.
> 
> So does this just ignore and override the global policy? I'm not sure I'm
> comfortable with that.

No. The decision making of when and what order THPs are allowed is not
changed, i.e. there are no changes in __thp_vma_allowable_orders and
thp_vma_allowable_orders. David has the same concern as you and this
current series is implementing what David suggested in
https://lore.kernel.org/all/3f7ba97d-04d5-4ea4-9f08-6ec3584e0d4c@redhat.com/

It will change the existing VMA (NO)HUGE flags according to
the prctl. For e.g. doing PR_THP_POLICY_DEFAULT_HUGE will not give
a THP when global policy is never. 

> 
> What about if the the policy is 'never'? Does this override that? That seems
> completely wrong.

No, it won't override it. hugepage_global_always and hugepage_global_enabled
will still evaluate to false and you wont get a hugepage no matter what prctl
is set.

> 
>> - PR_THP_POLICY_DEFAULT_NOHUGE: This will set the MMF2_THP_VMA_DEFAULT_NOHUGE
>>   process flag which changes the default of new VMAs to be VM_NOHUGEPAGE.
>>   The call also modifies all existing VMAs that are not VM_HUGEPAGE
>>   to be VM_NOHUGEPAGE.
>>   This allows systems where the global policy is set to "always"
>>   to effectively have THPs on madvise only for the process. In an
>>   environment where different types of workloads are stacked on the
>>   same machine whose global policy is set to "always", this will allow
>>   workloads that benefit from having hugepages on an madvise basis only
>>   to do so, without regressing those that benefit from having hugepages
>>   always.
> 
> Wait, so 'no huge' means 'madvise'? What? This is confusing.


I probably made the cover letter confusing :) or maybe need to rename the flags.

This flag work as follows: 

a) Changes the default flag of new VMAs to be VM_NOHUGEPAGE

b) Modifies all existing VMAs that are not VM_HUGEPAGE to be VM_NOHUGEPAGE

c) Is inherited during fork+exec

I think maybe I should add VMA to the flag names and rename the flags to
PR_THP_POLICY_DEFAULT_VMA_(NO)HUGE ??

> 
>> - PR_THP_POLICY_DEFAULT_SYSTEM: This will clear the MMF2_THP_VMA_DEFAULT_HUGE
>>   and MMF2_THP_VMA_DEFAULT_NOHUGE process flags.
>>
>> These patches are required in rolling out hugepages in hyperscaler
>> configurations for workloads that benefit from them, where workloads are
>> stacked anda single THP global policy is likely to be used across the entire
>> fleet, and prctl will help override it.
> 
> I don't understand this justification whatsoever. What does 'stacked' mean? And
> you're not justifying why you'd override the policy?

By stacked I just meant different types of workloads running on the same machine.
Lets say we have a single server whose global policy is set to madvise.
You can have a container on that server running some database workload that best
works with madvise.
You can have another container on that same server running some AI workload that would
benefit from having VM_HUGEPAGE set on all new VMAs. We can use prctl
PR_THP_POLICY_DEFAULT_HUGE to get VM_HUGEPAGE set by default on all new VMAs for that
container.

> 
> This series has no actual justificaiton here at all? You really need to provide one.
> 

There was a discussion on the usecases in
https://lore.kernel.org/all/13b68fa0-8755-43d8-8504-d181c2d46134@gmail.com/

I tried (and I guess failed :)) to summarize the justification from that thread.

I will try and rephrase it here.

In hyperscalers, we have a single THP policy for the entire fleet.
We have different types of workloads (e.g. AI/compute/databases/etc)
running on a single server (this is what I meant by 'stacked').
Some of these workloads will benefit from always getting THP at fault (or collapsed
by khugepaged), some of them will benefit by only getting them at madvise.

This series is useful for 2 usecases:

1) global system policy = madvise, while we want some workloads to get THPs
at fault and by khugepaged :- some processes (e.g. AI workloads) benefits from getting
THPs at fault (and collapsed by khugepaged). Other workloads like databases will incur
regression (either a performance regression or they are completely memory bound and
even a very slight increase in memory will cause them to OOM). So what these patches
will do is allow setting prctl(PR_THP_POLICY_DEFAULT_HUGE) on the AI workloads,
(This is how workloads are deployed in our (Meta's/Facebook) fleet at this moment).

2) global system policy = always, while we want some workloads to get THPs
only on madvise basis :- Same reason as 1). What these patches
will do is allow setting prctl(PR_THP_POLICY_DEFAULT_NOHUGE) on the database
workloads.
(We hope this is us (Meta) in the near future, if a majority of workloads show that they
benefit from always, we flip the default host setting to "always" across the fleet and
workloads that regress can opt-out and be "madvise".
New services developed will then be tested with always by default. "always" is also the
default defconfig option upstream, so I would imagine this is faced by others as well.)

Hope this makes the justification for the patches clearer :)

>>
>> v1->v2:
> 
> Where was the v1? Is it [0]?
> 
> This seems like a massive change compared to that series?
> 
> You've renamed it and not referenced the old series, please make sure you link
> it or somehow let somebody see what this is against, because it makes review
> difficult.
> 

Yes its the patch you linked below. Sorry should have linked it in this series.
Its a big change, but it was basically incorporating all feedback from David,
while trying to achieve a similar goal. Will link it in future series.

> [0]: https://lore.kernel.org/linux-mm/20250507141132.2773275-1-usamaarif642@gmail.com/
> 
>> - change from modifying the THP decision making for the process, to modifying
>>   VMA flags only. This prevents further complicating the logic used to
>>   determine THP order (Thanks David!)
>> - change from using a prctl per policy change to just using PR_SET_THP_POLICY
>>   and arg2 to set the policy. (Zi Yan)
>> - Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
>> - Add selftests and documentation.
>>
>> Usama Arif (6):
>>   prctl: introduce PR_THP_POLICY_DEFAULT_HUGE for the process
>>   prctl: introduce PR_THP_POLICY_DEFAULT_NOHUGE for the process
>>   prctl: introduce PR_THP_POLICY_SYSTEM for the process
>>   selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_NOHUGE
>>   selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
>>   docs: transhuge: document process level THP controls
>>
>>  Documentation/admin-guide/mm/transhuge.rst    |  40 +++
>>  include/linux/huge_mm.h                       |   4 +
>>  include/linux/mm_types.h                      |  14 +
>>  include/uapi/linux/prctl.h                    |   6 +
>>  kernel/fork.c                                 |   1 +
>>  kernel/sys.c                                  |  35 +++
>>  mm/huge_memory.c                              |  56 ++++
>>  mm/vma.c                                      |   2 +
>>  tools/include/uapi/linux/prctl.h              |   6 +
>>  .../trace/beauty/include/uapi/linux/prctl.h   |   6 +
>>  tools/testing/selftests/prctl/Makefile        |   2 +-
>>  tools/testing/selftests/prctl/thp_policy.c    | 286 ++++++++++++++++++
>>  12 files changed, 457 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/testing/selftests/prctl/thp_policy.c
>>
>> --
>> 2.47.1
>>


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ