linux-kernel - Re: [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2965932d-bde8-4610-8946-c575794c0991@amd.com>
Date: Mon, 24 Mar 2025 21:47:21 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Jonathan Cameron <Jonathan.Cameron@...wei.com>
Cc: AneeshKumar.KizhakeVeetil@....com, Hasan.Maruf@....com,
 Michael.Day@....com, akpm@...ux-foundation.org, bharata@....com,
 dave.hansen@...el.com, david@...hat.com, dongjoo.linux.dev@...il.com,
 feng.tang@...el.com, gourry@...rry.net, hannes@...xchg.org,
 honggyu.kim@...com, hughd@...gle.com, jhubbard@...dia.com,
 jon.grimm@....com, k.shutemov@...il.com, kbusch@...a.com,
 kmanaouil.dev@...il.com, leesuyeon0506@...il.com, leillc@...gle.com,
 liam.howlett@...cle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 mgorman@...hsingularity.net, mingo@...hat.com, nadav.amit@...il.com,
 nphamcs@...il.com, peterz@...radead.org, riel@...riel.com,
 rientjes@...gle.com, rppt@...nel.org, santosh.shukla@....com,
 shivankg@....com, shy828301@...il.com, sj@...nel.org, vbabka@...e.cz,
 weixugc@...gle.com, willy@...radead.org, ying.huang@...ux.alibaba.com,
 ziy@...dia.com, dave@...olabs.net, Hillf Danton <hdanton@...a.com>
Subject: Re: [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node

+Hillf

On 3/21/2025 11:12 PM, Jonathan Cameron wrote:
> On Wed, 19 Mar 2025 19:30:24 +0000
> Raghavendra K T <raghavendra.kt@....com> wrote:
> 
>> One of the key challenges in PTE A bit based scanning is to find right
>> target node to promote to.
> 
> I have the same problem with the CXL hotpage monitor so very keen to
> see solutions to this (though this particular one doesn't work for
> me unless A bit scanning is happening as well).
>

This is the thought I have (for how final solution looks like)

A migrate list and mm or target node(s) passed from various sources to
common migration thread for async migration.

source:
case1)
kmmscand -> (migratelist (type: folio/PFN, mminfo/migrate node) ---> 
(kmmmigrated/kpromoted)
                                                (unified migration thread)

case2)
  IBS/CHMU --> (migrate_list (type : PFN), NULL) --> (kmmmigrated/kpromoted)

for case 2 issue I see is, we are not able to associate any task or mm
to PFN. But in case we can get that.. we should be able use heuristic.

For case two, applying Hillf's suggestion of reverse demotion target +
next faster tier with highest free page availability should help IMHO.

>>
>> Here is a simple heuristic based approach:
>>     While scanning pages of any mm we also scan toptier pages that belong
>> to that mm. We get an insight on the distribution of pages that potentially
>> belonging to particular toptier node and also its recent access.
>>
>> Current logic walks all the toptier node, and picks the one with highest
>> accesses.
> 
> Maybe talk through why this heuristic works?  What is the intuition behind it?
> 
> I can see that on basis of first touch allocation, we should get a reasonable
> number of pages in the node where that CPU doing initialization is.
> 

Rationale is that suppose a workload is already running and has some
part of its working set in toptier node, consolidate it in that toptier
node.

for e.g.,

Bharata has a benchmark cbench-split (will share abench and cbench-split
source) where I can run 25:75 50:50 etc allocation on both CXL and
toptier.
After that workload touches all the pages to make them hot.

node0 (128GB) toptier
node1 (128GB) toptier
node2 (128GB) slowtier

I have run the workload with memory footprint of 8GB, 32GB, 128GB with
split of 50:50 on one toptier and one slowtier.

Observation:

Memory 	Base time (s)	Patched time (s)	%improvement
   8GB	53.29	46.47	12.79
  32GB	213.86	184.22	13.85
128GB	862.66	703.26	18.47

I could see that workload is consolidating on one node with a decent
more than 10% gain. Importantly if workload has its working set on node1
all the target_node is chosen for CXL pages is node1.

(Same thing happen when workload is spread between node0:node2,
target_node = 0)

However, going forward we need to device complex mechanism to take care 
of freepages available etc proactively.

 > Is this relying on some other mechanism to ensure that the pages 
being touched
 > are local to the CPUs touching them?

Unfortunately this where there is no control/visibility, access could be
from both local/remote. This is where we will have to rely on NUMAB1 to
take care of last mile toptier balancing (both CPU/memory).

- Raghu
[...]