[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d0aedf71-047b-4ee4-9175-a67708a389de@amd.com>
Date: Fri, 21 Mar 2025 00:41:20 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: AneeshKumar.KizhakeVeetil@....com, Hasan.Maruf@....com,
Michael.Day@....com, akpm@...ux-foundation.org, bharata@....com,
dave.hansen@...el.com, david@...hat.com, dongjoo.linux.dev@...il.com,
feng.tang@...el.com, gourry@...rry.net, hannes@...xchg.org,
honggyu.kim@...com, hughd@...gle.com, jhubbard@...dia.com,
jon.grimm@....com, k.shutemov@...il.com, kbusch@...a.com,
kmanaouil.dev@...il.com, leesuyeon0506@...il.com, leillc@...gle.com,
liam.howlett@...cle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
mgorman@...hsingularity.net, mingo@...hat.com, nadav.amit@...il.com,
nphamcs@...il.com, peterz@...radead.org, riel@...riel.com,
rientjes@...gle.com, rppt@...nel.org, santosh.shukla@....com,
shivankg@....com, shy828301@...il.com, sj@...nel.org, vbabka@...e.cz,
weixugc@...gle.com, willy@...radead.org, ying.huang@...ux.alibaba.com,
ziy@...dia.com, Jonathan.Cameron@...wei.com, alok.rathore@...sung.com
Subject: Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A
bit
On 3/20/2025 2:21 PM, Raghavendra K T wrote:
> On 3/20/2025 4:30 AM, Davidlohr Bueso wrote:
>> On Wed, 19 Mar 2025, Raghavendra K T wrote:
>>
>>> Introduction:
>>> =============
>>> In the current hot page promotion, all the activities including the
>>> process address space scanning, NUMA hint fault handling and page
>>> migration is performed in the process context. i.e., scanning
>>> overhead is
>>> borne by applications.
>>>
>>> This is RFC V1 patch series to do (slow tier) CXL page promotion.
>>> The approach in this patchset assists/addresses the issue by adding PTE
>>> Accessed bit scanning.
>>>
>>> Scanning is done by a global kernel thread which routinely scans all
>>> the processes' address spaces and checks for accesses by reading the
>>> PTE A bit.
>>>
>>> A separate migration thread migrates/promotes the pages to the toptier
>>> node based on a simple heuristic that uses toptier scan/access
>>> information
>>> of the mm.
>>>
>>> Additionally based on the feedback for RFC V0 [4], a prctl knob with
>>> a scalar value is provided to control per task scanning.
>>>
>>> Initial results show promising number on a microbenchmark. Soon
>>> will get numbers with real benchmarks and findings (tunings).
>>>
>>> Experiment:
>>> ============
>>> Abench microbenchmark,
>>> - Allocates 8GB/16GB/32GB/64GB of memory on CXL node
>>> - 64 threads created, and each thread randomly accesses pages in 4K
>>> granularity.
>>> - 512 iterations with a delay of 1 us between two successive iterations.
>>>
>>> SUT: 512 CPU, 2 node 256GB, AMD EPYC.
>>>
>>> 3 runs, command: abench -m 2 -d 1 -i 512 -s <size>
>>>
>>> Calculates how much time is taken to complete the task, lower is better.
>>> Expectation is CXL node memory is expected to be migrated as fast as
>>> possible.
>>>
>>> Base case: 6.14-rc6 w/ numab mode = 2 (hot page promotion is
>>> enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> we expect daemon to do page promotion.
>>>
>>> Result:
>>> ========
>>> base NUMAB2 patched NUMAB1
>>> time in sec (%stdev) time in sec (%stdev) %gain
>>> 8GB 134.33 ( 0.19 ) 120.52 ( 0.21 ) 10.28
>>> 16GB 292.24 ( 0.60 ) 275.97 ( 0.18 ) 5.56
>>> 32GB 585.06 ( 0.24 ) 546.49 ( 0.35 ) 6.59
>>> 64GB 1278.98 ( 0.27 ) 1205.20 ( 2.29 ) 5.76
>>>
>>> Base case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
>>> base NUMAB1 patched NUMAB1
>>> time in sec (%stdev) time in sec (%stdev) %gain
>>> 8GB 186.71 ( 0.99 ) 120.52 ( 0.21 ) 35.45
>>> 16GB 376.09 ( 0.46 ) 275.97 ( 0.18 ) 26.62
>>> 32GB 744.37 ( 0.71 ) 546.49 ( 0.35 ) 26.58
>>> 64GB 1534.49 ( 0.09 ) 1205.20 ( 2.29 ) 21.45
>>
>> Very promising, but a few things. A more fair comparison would be
>> vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
>> the asynchronous migration, and effectively measuring synchronous
>> vs asynchronous scanning overhead and implied semantics. Essentially
>> save the extra kthread and only have a per-NUMA node migrator, which
>> is the common denominator for all these sources of hotness.
>
>
> Yes, I agree that fair comparison would be
> 1) kmmscand generating data on pages to be promoted working with
> kpromoted asynchronously migrating
> VS
> 2) NUMAB2 generating data on pages to be migrated integrated with
> kpromoted.
>
> As Bharata already mentioned, we tried integrating kpromoted with
> kmmscand generated migration list, But kmmscand generates huge amount of
> scanned page data, and need to be organized better so that kpromted can
> handle the migration effectively.
>
> (2) We have not tried it yet, will get back on the possibility (and also
> numbers when both are ready).
>
>>
>> Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
>> this sort of thing, it would be useful to have data on no numa balancing
>> at all. If nothing else, that would measure the effects of the dest
>> node heuristics.
>
> Last time when I checked, with patch, numbers with NUMAB=0 and NUMAB=1
> was not making much difference in 8GB case because most of the migration
> was handled by kmmscand. It is because before NUMAB=1 learns and tries
> to migrate, kmmscand would have already migrated.
>
> But a longer running/ more memory workload may make more difference.
> I will comeback with that number.
base NUMAB=2 Patched NUMAB=0
time in sec time in sec
===================================================
8G: 134.33 (0.19) 119.88 ( 0.25)
16G: 292.24 (0.60) 325.06 (11.11)
32G: 585.06 (0.24) 546.15 ( 0.50)
64G: 1278.98 (0.27) 1221.41 ( 1.54)
We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
patched case.
PS: for 16G there was a bad case where a rare contention happen for lock
for same mm. that we can see from stdev, which should be taken care in
next version.
[...]
Powered by blists - more mailing lists