linux-kernel - Re: [DISCUSSION] proposed mctl() API

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8e0882d6-2c1b-4097-a7da-471c77a759a7@gmail.com>
Date: Tue, 10 Jun 2025 18:02:50 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 David Hildenbrand <david@...hat.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Shakeel Butt <shakeel.butt@...ux.dev>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 Arnd Bergmann <arnd@...db.de>, Christian Brauner <brauner@...nel.org>,
 SeongJae Park <sj@...nel.org>, Mike Rapoport <rppt@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>, Barry Song <21cnbao@...il.com>,
 linux-mm@...ck.org, linux-arch@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
 Pedro Falcato <pfalcato@...e.de>
Subject: Re: [DISCUSSION] proposed mctl() API

On 10/06/2025 17:26, Matthew Wilcox wrote:
> On Tue, Jun 10, 2025 at 05:00:47PM +0100, Usama Arif wrote:
>> On 10/06/2025 16:46, Matthew Wilcox wrote:
>>> On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
>>>> If we have 2 workloads on the same server, For e.g. one is database where THPs 
>>>> just dont do well, but the other one is AI where THPs do really well. How
>>>> will the kernel monitor that the database workload is performing worse
>>>> and the AI one isnt?
>>>
>>> It can monitor the allocation/access patterns and see who's getting
>>> the benefit.  The two workloads are in competition for memory, and
>>> we can tell which pages are hot and which cold.
>>>
>>> And I don't believe it's a binary anyway.  I bet there are some
>>> allocations where the database benefits from having THPs (I mean, I know
>>> a database which invented the entire hugetlbfs subsystem so it could
>>> use PMD entries and avoid one layer of TLB misses!)
>>>
>>
>> Sure, but this is just an example. Workload owners are not going to spend time
>> trying to see how each allocation works and if its hot, they put it in hugetlbfs.
> 
> No, they're not.  It should be automatic.  There are many deficiencies
> in the kernel; this is one of them.
> 
>> Ofcourse hugetlbfs has its own drawbacks of reserving pages.
> 
> Drawback or advantage?  It's a feature.  You're being very strange about
> this.  First you want to reserve THPs for some workloads only, then when
> given a way to do that you complain that ... you have to reserve hugetlb
> pages.  You can't possibly mean both of these things sincerely.
> 

Let me try and explain my view better:

hugetlb requires 2 things, reserving hugepages and passing MAP_HUGETLB at mmap time i.e.
not "transparent". (I know the meaning of transparent even in THP is a bit messed up :))

There are some workload owners that will happily test (and have the resources to do
so) to see what is the best point to use hugetlb. They can go in their code and change
mmap and make the necessary changes to disrupt workload orchestration so that hugetlb
is reserved. This is a small minority.

An extremely large majority of workload owners will not be willing to do this (and don't
have the resources to do so as well).
For them, we have THPs to do it "transparently". If you just give a knob to switch
THP=always on/off for *just their workload* without affecting others on the same server,
they will be happy to try it and other workloads that are running on the same server
in controlled cgroups wont care and won't be affected. i.e.:

- if the machine policy (/sys/kernel/mm/transparent_hugepage/enabled) is madvise, workloads can
  opt-in getting THPs by just having this call (the PR_DEFAULT_MADV_HUGEPAGE version) in systemd.

- if the machine policy is always, and they dont benefit, they can opt-out of getting THPs
  by having this call (the PR_DEFAULT_MADV_NOHUGEPAGE) version in systemd *without* disrupting
  the other workloads that are running on the same server that do.

Doing above is very simple. This is how KSM is done as well. It doesnt require doing any changes
to mmap, i.e. is "transparent" (after the prctl/mctl call :)) and doesn't require reserving anything
for hugetlb before the application starts.