lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8e0882d6-2c1b-4097-a7da-471c77a759a7@gmail.com>
Date: Tue, 10 Jun 2025 18:02:50 +0100
From: Usama Arif <usamaarif642@...il.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>,
 David Hildenbrand <david@...hat.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Shakeel Butt <shakeel.butt@...ux.dev>,
 "Liam R . Howlett" <Liam.Howlett@...cle.com>,
 Vlastimil Babka <vbabka@...e.cz>, Jann Horn <jannh@...gle.com>,
 Arnd Bergmann <arnd@...db.de>, Christian Brauner <brauner@...nel.org>,
 SeongJae Park <sj@...nel.org>, Mike Rapoport <rppt@...nel.org>,
 Johannes Weiner <hannes@...xchg.org>, Barry Song <21cnbao@...il.com>,
 linux-mm@...ck.org, linux-arch@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
 Pedro Falcato <pfalcato@...e.de>
Subject: Re: [DISCUSSION] proposed mctl() API



On 10/06/2025 17:26, Matthew Wilcox wrote:
> On Tue, Jun 10, 2025 at 05:00:47PM +0100, Usama Arif wrote:
>> On 10/06/2025 16:46, Matthew Wilcox wrote:
>>> On Tue, Jun 10, 2025 at 04:30:43PM +0100, Usama Arif wrote:
>>>> If we have 2 workloads on the same server, For e.g. one is database where THPs 
>>>> just dont do well, but the other one is AI where THPs do really well. How
>>>> will the kernel monitor that the database workload is performing worse
>>>> and the AI one isnt?
>>>
>>> It can monitor the allocation/access patterns and see who's getting
>>> the benefit.  The two workloads are in competition for memory, and
>>> we can tell which pages are hot and which cold.
>>>
>>> And I don't believe it's a binary anyway.  I bet there are some
>>> allocations where the database benefits from having THPs (I mean, I know
>>> a database which invented the entire hugetlbfs subsystem so it could
>>> use PMD entries and avoid one layer of TLB misses!)
>>>
>>
>> Sure, but this is just an example. Workload owners are not going to spend time
>> trying to see how each allocation works and if its hot, they put it in hugetlbfs.
> 
> No, they're not.  It should be automatic.  There are many deficiencies
> in the kernel; this is one of them.
> 
>> Ofcourse hugetlbfs has its own drawbacks of reserving pages.
> 
> Drawback or advantage?  It's a feature.  You're being very strange about
> this.  First you want to reserve THPs for some workloads only, then when
> given a way to do that you complain that ... you have to reserve hugetlb
> pages.  You can't possibly mean both of these things sincerely.
> 

Let me try and explain my view better:

hugetlb requires 2 things, reserving hugepages and passing MAP_HUGETLB at mmap time i.e.
not "transparent". (I know the meaning of transparent even in THP is a bit messed up :))

There are some workload owners that will happily test (and have the resources to do
so) to see what is the best point to use hugetlb. They can go in their code and change
mmap and make the necessary changes to disrupt workload orchestration so that hugetlb
is reserved. This is a small minority.

An extremely large majority of workload owners will not be willing to do this (and don't
have the resources to do so as well).
For them, we have THPs to do it "transparently". If you just give a knob to switch
THP=always on/off for *just their workload* without affecting others on the same server,
they will be happy to try it and other workloads that are running on the same server
in controlled cgroups wont care and won't be affected. i.e.:

- if the machine policy (/sys/kernel/mm/transparent_hugepage/enabled) is madvise, workloads can
  opt-in getting THPs by just having this call (the PR_DEFAULT_MADV_HUGEPAGE version) in systemd.

- if the machine policy is always, and they dont benefit, they can opt-out of getting THPs
  by having this call (the PR_DEFAULT_MADV_NOHUGEPAGE) version in systemd *without* disrupting
  the other workloads that are running on the same server that do.

Doing above is very simple. This is how KSM is done as well. It doesnt require doing any changes
to mmap, i.e. is "transparent" (after the prctl/mctl call :)) and doesn't require reserving anything
for hugetlb before the application starts.



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ