linux-kernel - Re: [DISCUSSION] proposed mctl() API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6d8832bb-b5a7-4cd9-b92c-c93f2c1fe182@gmail.com>
Date: Wed, 2 Jul 2025 15:15:01 +0100
From: Usama Arif <usamaarif642@...il.com>
To: David Hildenbrand <david@...hat.com>, linux-mm@...ck.org,
 Andrew Morton <akpm@...ux-foundation.org>
Cc: Shakeel Butt <shakeel.butt@...ux.dev>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 Arnd Bergmann <arnd@...db.de>, Christian Brauner <brauner@...nel.org>,
 SeongJae Park <sj@...nel.org>, Mike Rapoport <rppt@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>, Barry Song <21cnbao@...il.com>,
 linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-api@...r.kernel.org, Pedro Falcato <pfalcato@...e.de>,
 Matthew Wilcox <willy@...radead.org>,
 Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
Subject: Re: [DISCUSSION] proposed mctl() API

> As I replied to Matthew in [1], it would be amazing if it was not needed, but thats not
> how it works in the medium term and I dont think it will work even in the long term.
> I will paste my answer from [1] below as well:
> 
> If we have 2 workloads on the same server, For e.g. one is database where THPs 
> just dont do well, but the other one is AI where THPs do really well. How
> will the kernel monitor that the database workload is performing worse
> and the AI one isnt?
> 
> I added THP shrinker to hopefully try and do this automatically, and it does
> really help. But unfortunately it is not a complete solution.
> There are severely memory bound workloads where even a tiny increase
> in memory will lead to an OOM. And if you colocate the container thats running
> that workload with one in which we will benefit with THPs, we unfortunately
> can't just rely on the system doing the right thing.
> 
> It would be awesome if THPs are truly transparent and don't require
> any input, but unfortunately I don't think that there is a solution
> for this with just kernel monitoring.
> 
> This is just a big hint from the user. If the global system policy is madvise
> and the workload owner has done their own benchmarks and see benefits
> with always, they set DEFAULT_MADV_HUGEPAGE for the process to optin as "always".
> If the global system policy is always and the workload owner has done their own 
> benchmarks and see worse results with always, they set DEFAULT_MADV_NOHUGEPAGE for 
> the process to optin as "madvise". 
> 
> [1] https://lore.kernel.org/all/162c14e6-0b16-4698-bd76-735037ea0d73@gmail.com/
> 
> 
> I havent seen activity on this thread over the past week, but I was hoping
> we can reach a consensus on which approach to use, prctl or mctl.
> If its mctl and if you don't think this should be done, please let me know
> if you would like me to work on this instead. This is a valid big realworld
> usecase that is a real blocker for deploying THPs in workloads in servers.
> 

Hi!

Just wanted to check if anyone has any thoughts on this?

I think we are all in agreement for the long term eventual goal, have THP just work
and be default enabled. From our perspective, we (meta) have spent a significant
amount of time and effort over the last 18 months trying to make changes/optimizations
where we could actually have it so and can transparently and reliably get hugepages.
This includes the THP shrinker [1], changes to allocator and reclaim/compaction code
to reduce fragmentation [2] and reducing type mixing [3].
We want to continue spending more time and resources in trying to achieve this.

But in the current state, not being able to selectively enable THPs always for certain
workloads is a significant blocker in trying to roll it out at hyperscaler levels, and
from the attempts made by others, I do believe its a problem others are facing as well.
Our long term aim is to have transparent_hugepage/enabled set to always across the fleet.
But for that we need to have the ability to enable it selectively for workloads that
show benefit, try and solve problems that come up in production when it is enabled,
and see why for those that regress. This can not be done with just transparent_hugepage/enabled
for hyperscalers which run vastly different types of containerized workloads on the same
machine.

There have been multiple efforts from different people on trying to address similar
problems (including via cgroup[4] and bpf[5]). IMHO, its quite clear that unfortunately
just having a system wide setting for THP is not enough at the moment or in the near future,
especially when running workloads that have completely different characteristic on the same server.


In terms of the approach of doing this, IMHO, I dont think the way to do this
is controversial. After the great feedback from Lorenzo on the prctl series, the
approach would be for userpsace to make a call that just does for_each_vma of the process,
madvises the VMAs, and changes the mm->def_flags for the process. We are not making changes
to the pagefaulting path (thp_vma_allowable_orders has no code change which is awesome).
In terms of what the call is going to be, there has been a lot of debate (and my preference
of prctl is clear), I am ok with either with prctl or mctl, as it should not change
the actual implementation. If there is consensus, I would love to send a RFC for how the
prctl or mctl solution would look like.


[1] https://lore.kernel.org/all/20240830100438.3623486-1-usamaarif642@gmail.com/
[2] https://lore.kernel.org/all/20250313210647.1314586-1-hannes@cmpxchg.org/
[3] https://lore.kernel.org/all/20240320180429.678181-1-hannes@cmpxchg.org/
[4] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/
[5] https://lore.kernel.org/all/20250520060504.20251-1-laoar.shao@gmail.com/

Thanks,
Usama