linux-kernel - Re: [RFC 0/2] mm: introduce THP deferred setting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <18647b68-0c7c-47bc-9b9e-9cf46ce86761@gmail.com>
Date: Wed, 28 Aug 2024 11:44:54 +0100
From: Usama Arif <usamaarif642@...il.com>
To: "Kirill A . Shutemov" <kirill.shutemov@...ux.intel.com>,
 Rik van Riel <riel@...riel.com>, Nico Pache <npache@...hat.com>
Cc: Johannes Weiner <hannes@...xchg.org>, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, linux-doc@...r.kernel.org,
 Andrew Morton <akpm@...ux-foundation.org>,
 David Hildenbrand <david@...hat.com>, Matthew Wilcox <willy@...radead.org>,
 Barry Song <baohua@...nel.org>, Ryan Roberts <ryan.roberts@....com>,
 Baolin Wang <baolin.wang@...ux.alibaba.com>, Lance Yang
 <ioworker0@...il.com>, Peter Xu <peterx@...hat.com>,
 Rafael Aquini <aquini@...hat.com>, Andrea Arcangeli <aarcange@...hat.com>,
 Jonathan Corbet <corbet@....net>, Zi Yan <ziy@...dia.com>
Subject: Re: [RFC 0/2] mm: introduce THP deferred setting

On 28/08/2024 02:17, Kirill A . Shutemov wrote:
> On Tue, Aug 27, 2024 at 09:18:58PM -0400, Rik van Riel wrote:
>> On Tue, 2024-08-27 at 13:09 +0200, Johannes Weiner wrote:
>>>
>>> I agree with this. The defer mode is an improvement over the upstream
>>> status quo, no doubt. However, both defer mode and the shrinker solve
>>> the issue of memory waste under pressure, while the shrinker permits
>>> more desirable behavior when memory is abundant.
>>>
>>> So my take is that the shrinker is the way to go, and I don't see a
>>> bonafide usecase for defer mode that the shrinker couldn't cover.
>>>
>>>
>> I would like to take one step back, and think about what some real
>> world workloads might want as a tunable for THP.
>>
>> Workload owners are going to have a real problem trying to figure
>> out what the best value of max_ptes_none should be for their
>> workloads.
>>
Yes, I agree. For both solutions, max_ptes_none needs to be adjusted,
and would require experimentation with different values which workload
owners might not do or want to do. But as Kirill said, the information
on the number of zero pages in THPs isn't available. A possible solution
might be randomly sampling a number of THPs at certain time intervals,
but I don't think its a good idea to use that as a representation of the
entire system.

Its ok from my side to have both the solutions in kernel as they don't
interfere with each other. THP=defer makes sense to have as well if there
are real world workloads or benchmarks that show page fault latency is
problem due to THP=always as Nico mentioned in his reply [1]

[1] https://lore.kernel.org/all/CAA1CXcCyRd+qfszM4GGvKqW=95AV9v8LG5oihByEBGLtW4tD4g@mail.gmail.com/

>> However, giving workload owners the ability to say "this workload
>> should not waste more than 1GB of memory on zero pages inside THPs",
>> or 500MB, or 4GB or whatever, would then allow the kernel to
>> automatically adjust the max_ptes_none threshold.
> 
> The problem is that we don't have and cannot have the info on zero pages
> inside THPs readily available. It requires memory scanning which is
> prohibitively expensive if we want the info to be somewhat up-to-date.
> 
> We don't have enough input from HW on the access pattern. It would be nice
> to decouple A/D bit (or maybe just A) from page table structure and get
> higher resolution on the access pattern for THPs.
> 
> I tried to talk to HW folk, but it went nowhere. Maybe if there would be a
> customer demand... Just saying...
>