[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250519223307.3601786-1-usamaarif642@gmail.com>
Date: Mon, 19 May 2025 23:29:52 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>,
david@...hat.com,
linux-mm@...ck.org
Cc: hannes@...xchg.org,
shakeel.butt@...ux.dev,
riel@...riel.com,
ziy@...dia.com,
laoar.shao@...il.com,
baolin.wang@...ux.alibaba.com,
lorenzo.stoakes@...cle.com,
Liam.Howlett@...cle.com,
npache@...hat.com,
ryan.roberts@....com,
vbabka@...e.cz,
jannh@...gle.com,
Arnd Bergmann <arnd@...db.de>,
linux-kernel@...r.kernel.org,
linux-doc@...r.kernel.org,
kernel-team@...a.com,
Usama Arif <usamaarif642@...il.com>
Subject: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY
This series allows to change the THP policy of a process, according to the
value set in arg2, all of which will be inherited during fork+exec:
- PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
for the default VMA flags. It will also iterate through every VMA in the
process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
This effectively allows setting MADV_HUGEPAGE on the entire process.
In an environment where different types of workloads are run on the
same machine, this will allow workloads that benefit from always having
hugepages to do so, without regressing those that don't.
- PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
for the default VMA flags. It will also iterate through every VMA in the
process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
This effectively allows setting MADV_NOHUGEPAGE on the entire process.
In an environment where different types of workloads are run on the
same machine,this will allow workloads that benefit from having
hugepages on an madvise basis only to do so, without regressing those
that benefit from having hugepages always.
- PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
VM_NOHUGEPAGE process for the default flags.
In hyperscalers, we have a single THP policy for the entire fleet.
We have different types of workloads (e.g. AI/compute/databases/etc)
running on a single server.
Some of these workloads will benefit from always getting THP at fault
(or collapsed by khugepaged), some of them will benefit by only getting
them at madvise.
This series is useful for 2 usecases:
1) global system policy = madvise, while we want some workloads to get THPs
at fault and by khugepaged :- some processes (e.g. AI workloads) benefits
from getting THPs at fault (and collapsed by khugepaged). Other workloads
like databases will incur regression (either a performance regression or
they are completely memory bound and even a very slight increase in memory
will cause them to OOM). So what these patches will do is allow setting
prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how
workloads are deployed in our (Meta's/Facebook) fleet at this moment).
2) global system policy = always, while we want some workloads to get THPs
only on madvise basis :- Same reason as 1). What these patches
will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database
workloads. (We hope this is us (Meta) in the near future, if a majority of
workloads show that they benefit from always, we flip the default host
setting to "always" across the fleet and workloads that regress can opt-out
and be "madvise". New services developed will then be tested with always by
default. "always" is also the default defconfig option upstream, so I would
imagine this is faced by others as well.)
v2->v3: (Thanks Lorenzo for all the below feedback!)
v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
- no more flags2.
- no more MMF2_...
- renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE
- mmap_write_lock_killable acquired in PR_GET_THP_POLICY
- mmap_write lock fixed in PR_SET_THP_POLICY
- mmap assert check in process_default_madv_hugepage
- check if hugepage_global_enabled is enabled in the call and account for s390
- set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in
the way done by madvise(). I believe VM merge will not be broken in
this way.
- process_default_madv_hugepage function that does for_each_vma and calls
hugepage_madvise.
v1->v2:
- change from modifying the THP decision making for the process, to modifying
VMA flags only. This prevents further complicating the logic used to
determine THP order (Thanks David!)
- change from using a prctl per policy change to just using PR_SET_THP_POLICY
and arg2 to set the policy. (Zi Yan)
- Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
- Add selftests and documentation.
Usama Arif (7):
mm: khugepaged: extract vm flag setting outside of hugepage_madvise
prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
prctl: introduce PR_THP_POLICY_SYSTEM for the process
selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
docs: transhuge: document process level THP controls
Documentation/admin-guide/mm/transhuge.rst | 42 +++
include/linux/huge_mm.h | 2 +
include/linux/mm.h | 2 +-
include/linux/mm_types.h | 4 +-
include/uapi/linux/prctl.h | 6 +
kernel/sys.c | 53 ++++
mm/huge_memory.c | 13 +
mm/khugepaged.c | 26 +-
tools/include/uapi/linux/prctl.h | 6 +
.../trace/beauty/include/uapi/linux/prctl.h | 6 +
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/thp_policy.c | 286 ++++++++++++++++++
12 files changed, 436 insertions(+), 12 deletions(-)
create mode 100644 tools/testing/selftests/prctl/thp_policy.c
--
2.47.1
Powered by blists - more mailing lists