linux-kernel - Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d0aedf71-047b-4ee4-9175-a67708a389de@amd.com>
Date: Fri, 21 Mar 2025 00:41:20 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: AneeshKumar.KizhakeVeetil@....com, Hasan.Maruf@....com,
 Michael.Day@....com, akpm@...ux-foundation.org, bharata@....com,
 dave.hansen@...el.com, david@...hat.com, dongjoo.linux.dev@...il.com,
 feng.tang@...el.com, gourry@...rry.net, hannes@...xchg.org,
 honggyu.kim@...com, hughd@...gle.com, jhubbard@...dia.com,
 jon.grimm@....com, k.shutemov@...il.com, kbusch@...a.com,
 kmanaouil.dev@...il.com, leesuyeon0506@...il.com, leillc@...gle.com,
 liam.howlett@...cle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 mgorman@...hsingularity.net, mingo@...hat.com, nadav.amit@...il.com,
 nphamcs@...il.com, peterz@...radead.org, riel@...riel.com,
 rientjes@...gle.com, rppt@...nel.org, santosh.shukla@....com,
 shivankg@....com, shy828301@...il.com, sj@...nel.org, vbabka@...e.cz,
 weixugc@...gle.com, willy@...radead.org, ying.huang@...ux.alibaba.com,
 ziy@...dia.com, Jonathan.Cameron@...wei.com, alok.rathore@...sung.com
Subject: Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A
 bit

On 3/20/2025 2:21 PM, Raghavendra K T wrote:
> On 3/20/2025 4:30 AM, Davidlohr Bueso wrote:
>> On Wed, 19 Mar 2025, Raghavendra K T wrote:
>>
>>> Introduction:
>>> =============
>>> In the current hot page promotion, all the activities including the
>>> process address space scanning, NUMA hint fault handling and page
>>> migration is performed in the process context. i.e., scanning 
>>> overhead is
>>> borne by applications.
>>>
>>> This is RFC V1 patch series to do (slow tier) CXL page promotion.
>>> The approach in this patchset assists/addresses the issue by adding PTE
>>> Accessed bit scanning.
>>>
>>> Scanning is done by a global kernel thread which routinely scans all
>>> the processes' address spaces and checks for accesses by reading the
>>> PTE A bit.
>>>
>>> A separate migration thread migrates/promotes the pages to the toptier
>>> node based on a simple heuristic that uses toptier scan/access 
>>> information
>>> of the mm.
>>>
>>> Additionally based on the feedback for RFC V0 [4], a prctl knob with
>>> a scalar value is provided to control per task scanning.
>>>
>>> Initial results show promising number on a microbenchmark. Soon
>>> will get numbers with real benchmarks and findings (tunings).
>>>
>>> Experiment:
>>> ============
>>> Abench microbenchmark,
>>> - Allocates 8GB/16GB/32GB/64GB of memory on CXL node
>>> - 64 threads created, and each thread randomly accesses pages in 4K
>>>  granularity.
>>> - 512 iterations with a delay of 1 us between two successive iterations.
>>>
>>> SUT: 512 CPU, 2 node 256GB, AMD EPYC.
>>>
>>> 3 runs, command:  abench -m 2 -d 1 -i 512 -s <size>
>>>
>>> Calculates how much time is taken to complete the task, lower is better.
>>> Expectation is CXL node memory is expected to be migrated as fast as
>>> possible.
>>>
>>> Base case: 6.14-rc6    w/ numab mode = 2 (hot page promotion is 
>>> enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> we expect daemon to do page promotion.
>>>
>>> Result:
>>> ========
>>>         base NUMAB2                    patched NUMAB1
>>>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
>>> 8GB     134.33       ( 0.19 )        120.52  ( 0.21 )     10.28
>>> 16GB     292.24       ( 0.60 )        275.97  ( 0.18 )      5.56
>>> 32GB     585.06       ( 0.24 )        546.49  ( 0.35 )      6.59
>>> 64GB    1278.98       ( 0.27 )       1205.20  ( 2.29 )      5.76
>>>
>>> Base case: 6.14-rc6    w/ numab mode = 1 (numa balancing is enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>>         base NUMAB1                    patched NUMAB1
>>>         time in sec  (%stdev)   time in sec  (%stdev)     %gain
>>> 8GB     186.71       ( 0.99 )        120.52  ( 0.21 )     35.45
>>> 16GB     376.09       ( 0.46 )        275.97  ( 0.18 )     26.62
>>> 32GB     744.37       ( 0.71 )        546.49  ( 0.35 )     26.58
>>> 64GB    1534.49       ( 0.09 )       1205.20  ( 2.29 )     21.45
>>
>> Very promising, but a few things. A more fair comparison would be
>> vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
>> the asynchronous migration, and effectively measuring synchronous
>> vs asynchronous scanning overhead and implied semantics. Essentially
>> save the extra kthread and only have a per-NUMA node migrator, which
>> is the common denominator for all these sources of hotness.
> 
> 
> Yes, I agree that fair comparison would be
> 1) kmmscand generating data on pages to be promoted working with
> kpromoted asynchronously migrating
> VS
> 2) NUMAB2 generating data on pages to be migrated integrated with
> kpromoted.
> 
> As Bharata already mentioned, we tried integrating kpromoted with
> kmmscand generated migration list, But kmmscand generates huge amount of
> scanned page data, and need to be organized better so that kpromted can 
> handle the migration effectively.
> 
> (2) We have not tried it yet, will get back on the possibility (and also
> numbers when both are ready).
> 
>>
>> Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
>> this sort of thing, it would be useful to have data on no numa balancing
>> at all. If nothing else, that would measure the effects of the dest
>> node heuristics.
> 
> Last time when I checked, with patch, numbers with NUMAB=0 and NUMAB=1
> was not making much difference in 8GB case because most of the migration 
> was handled by kmmscand. It is because before NUMAB=1 learns and tries
> to migrate, kmmscand would have already migrated.
> 
> But a longer running/ more memory workload may make more difference.
> I will comeback with that number.

                  base NUMAB=2   Patched NUMAB=0
                  time in sec    time in sec
===================================================
8G:              134.33 (0.19)   119.88 ( 0.25)
16G:             292.24 (0.60)   325.06 (11.11)
32G:             585.06 (0.24)   546.15 ( 0.50)
64G:            1278.98 (0.27)  1221.41 ( 1.54)

We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
patched case.

PS: for 16G there was a bad case where a rare contention happen for lock
for same mm. that we can see from stdev, which should be taken care in
next version.

[...]