[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3b5d929f-ec2f-4444-825f-81e71f804033@redhat.com>
Date: Thu, 8 May 2025 13:06:53 +0200
From: David Hildenbrand <david@...hat.com>
To: Usama Arif <usamaarif642@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org
Cc: hannes@...xchg.org, shakeel.butt@...ux.dev, riel@...riel.com,
ziy@...dia.com, baolin.wang@...ux.alibaba.com, lorenzo.stoakes@...cle.com,
Liam.Howlett@...cle.com, npache@...hat.com, ryan.roberts@....com,
linux-kernel@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCH 0/1] prctl: allow overriding system THP policy to always
On 07.05.25 16:00, Usama Arif wrote:
> Allowing override of global THP policy per process allows workloads
> that have shown to benefit from hugepages to do so, without regressing
> workloads that wouldn't benefit. This will allow such types of
> workloads to be run/stacked on the same machine.
>
> It also helps in rolling out hugepages in hyperscaler configurations
> for workloads that benefit from them, where a single THP policy is
> likely to be used across the entire fleet, and prctl will help override it.
>
> An advantage of doing it via prctl vs creating a cgroup specific
> option (like /sys/fs/cgroup/test/memory.transparent_hugepage.enabled) is
> that this will work even when there are no cgroups present, and my
> understanding is there is a strong preference of cgroups controls being
> hierarchical which usually means them having a numerical value.
>
>
> The output and code of test program is below:
>
> [root@vm4 vmuser]# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
> [root@vm4 vmuser]# echo inherit > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> [root@vm4 vmuser]# ./a.out
> Default THP setting:
> THP is not set to 'always'.
> PR_SET_THP_ALWAYS = 1
> THP is set to 'always'.
> PR_SET_THP_ALWAYS = 0
> THP is not set to 'always'.
Some quick feedback:
(1) The "always" in PR_SET_THP_ALWAYS does not look future proof. Why
wouldn't someone want to force-disable them for a process (-> "never")
or set it to some other new mode ("-> defer" that is currently on the list).
(2) In your example, is the toggle specific to 2M THP or the global
toggle ...? Unclear. And that makes this interface also suboptimal.
(3) I'm a bit concerned about interaction with per-VMA settings (the one
we already have, and order-specific ones that people were discussing).
It's going to be a mess if we have global, per-process, per-vma and then
some other policies (ebpf? whatever else?) on top.
The low-hanging fruit would be a per-process toggle that only controls
the existing per-VMA toggle: for example, with the semantics that
(1) All new (applicable) VMAs start with VM_HUGEPAGE
(2) All existing (applicable) VMAs that are *not* VM_NOHUGEPAGE become
VM_HUGEPAGE.
We did something similar with PR_SET_MEMORY_MERGE.
--
Cheers,
David / dhildenb
Powered by blists - more mailing lists