lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250519223307.3601786-1-usamaarif642@gmail.com>
Date: Mon, 19 May 2025 23:29:52 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Andrew Morton <akpm@...ux-foundation.org>,
	david@...hat.com,
	linux-mm@...ck.org
Cc: hannes@...xchg.org,
	shakeel.butt@...ux.dev,
	riel@...riel.com,
	ziy@...dia.com,
	laoar.shao@...il.com,
	baolin.wang@...ux.alibaba.com,
	lorenzo.stoakes@...cle.com,
	Liam.Howlett@...cle.com,
	npache@...hat.com,
	ryan.roberts@....com,
	vbabka@...e.cz,
	jannh@...gle.com,
	Arnd Bergmann <arnd@...db.de>,
	linux-kernel@...r.kernel.org,
	linux-doc@...r.kernel.org,
	kernel-team@...a.com,
	Usama Arif <usamaarif642@...il.com>
Subject: [PATCH v3 0/7] prctl: introduce PR_SET/GET_THP_POLICY

This series allows to change the THP policy of a process, according to the
value set in arg2, all of which will be inherited during fork+exec:
- PR_DEFAULT_MADV_HUGEPAGE: This will set VM_HUGEPAGE and clear VM_NOHUGEPAGE
  for the default VMA flags. It will also iterate through every VMA in the
  process and call hugepage_madvise on it, with MADV_HUGEPAGE policy.
  This effectively allows setting MADV_HUGEPAGE on the entire process.
  In an environment where different types of workloads are run on the
  same machine, this will allow workloads that benefit from always having
  hugepages to do so, without regressing those that don't.
- PR_DEFAULT_MADV_NOHUGEPAGE: This will set VM_NOHUGEPAGE and clear VM_HUGEPAGE
  for the default VMA flags. It will also iterate through every VMA in the
  process and call hugepage_madvise on it, with MADV_NOHUGEPAGE policy.
  This effectively allows setting MADV_NOHUGEPAGE on the entire process.
  In an environment where different types of workloads are run on the
  same machine,this will allow workloads that benefit from having
  hugepages on an madvise basis only to do so, without regressing those
  that benefit from having hugepages always.
- PR_THP_POLICY_SYSTEM: This will reset (clear) both VM_HUGEPAGE and
  VM_NOHUGEPAGE process for the default flags.

In hyperscalers, we have a single THP policy for the entire fleet.
We have different types of workloads (e.g. AI/compute/databases/etc)
running on a single server.
Some of these workloads will benefit from always getting THP at fault
(or collapsed by khugepaged), some of them will benefit by only getting
them at madvise.

This series is useful for 2 usecases:
1) global system policy = madvise, while we want some workloads to get THPs
at fault and by khugepaged :- some processes (e.g. AI workloads) benefits
from getting THPs at fault (and collapsed by khugepaged). Other workloads
like databases will incur regression (either a performance regression or
they are completely memory bound and even a very slight increase in memory
will cause them to OOM). So what these patches will do is allow setting
prctl(PR_DEFAULT_MADV_HUGEPAGE) on the AI workloads, (This is how
workloads are deployed in our (Meta's/Facebook) fleet at this moment).

2) global system policy = always, while we want some workloads to get THPs
only on madvise basis :- Same reason as 1). What these patches
will do is allow setting prctl(PR_DEFAULT_MADV_NOHUGEPAGE) on the database
workloads. (We hope this is us (Meta) in the near future, if a majority of
workloads show that they benefit from always, we flip the default host
setting to "always" across the fleet and workloads that regress can opt-out
and be "madvise". New services developed will then be tested with always by
default. "always" is also the default defconfig option upstream, so I would
imagine this is faced by others as well.)

v2->v3: (Thanks Lorenzo for all the below feedback!)
v2: https://lore.kernel.org/all/20250515133519.2779639-1-usamaarif642@gmail.com/
- no more flags2.
- no more MMF2_...
- renamed policy to PR_DEFAULT_MADV_(NO)HUGEPAGE
- mmap_write_lock_killable acquired in PR_GET_THP_POLICY
- mmap_write lock fixed in PR_SET_THP_POLICY
- mmap assert check in process_default_madv_hugepage
- check if hugepage_global_enabled is enabled in the call and account for s390
- set mm->def_flags VM_HUGEPAGE and VM_NOHUGEPAGE according to the policy in
  the way done by madvise(). I believe VM merge will not be broken in
  this way.
- process_default_madv_hugepage function that does for_each_vma and calls
  hugepage_madvise.

v1->v2:
- change from modifying the THP decision making for the process, to modifying
  VMA flags only. This prevents further complicating the logic used to
  determine THP order (Thanks David!)
- change from using a prctl per policy change to just using PR_SET_THP_POLICY
  and arg2 to set the policy. (Zi Yan)
- Introduce PR_THP_POLICY_DEFAULT_NOHUGE and PR_THP_POLICY_DEFAULT_SYSTEM
- Add selftests and documentation.
 
Usama Arif (7):
  mm: khugepaged: extract vm flag setting outside of hugepage_madvise
  prctl: introduce PR_DEFAULT_MADV_HUGEPAGE for the process
  prctl: introduce PR_DEFAULT_MADV_NOHUGEPAGE for the process
  prctl: introduce PR_THP_POLICY_SYSTEM for the process
  selftests: prctl: introduce tests for PR_DEFAULT_MADV_NOHUGEPAGE
  selftests: prctl: introduce tests for PR_THP_POLICY_DEFAULT_HUGE
  docs: transhuge: document process level THP controls

 Documentation/admin-guide/mm/transhuge.rst    |  42 +++
 include/linux/huge_mm.h                       |   2 +
 include/linux/mm.h                            |   2 +-
 include/linux/mm_types.h                      |   4 +-
 include/uapi/linux/prctl.h                    |   6 +
 kernel/sys.c                                  |  53 ++++
 mm/huge_memory.c                              |  13 +
 mm/khugepaged.c                               |  26 +-
 tools/include/uapi/linux/prctl.h              |   6 +
 .../trace/beauty/include/uapi/linux/prctl.h   |   6 +
 tools/testing/selftests/prctl/Makefile        |   2 +-
 tools/testing/selftests/prctl/thp_policy.c    | 286 ++++++++++++++++++
 12 files changed, 436 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/prctl/thp_policy.c

-- 
2.47.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ