linux-kernel - Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ff53d70a-7d59-4f0d-aad0-03628f9d8b67@amd.com>
Date: Tue, 25 Mar 2025 12:06:39 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: AneeshKumar.KizhakeVeetil@....com, Hasan.Maruf@....com,
 Michael.Day@....com, akpm@...ux-foundation.org, bharata@....com,
 dave.hansen@...el.com, david@...hat.com, dongjoo.linux.dev@...il.com,
 feng.tang@...el.com, gourry@...rry.net, hannes@...xchg.org,
 honggyu.kim@...com, hughd@...gle.com, jhubbard@...dia.com,
 jon.grimm@....com, k.shutemov@...il.com, kbusch@...a.com,
 kmanaouil.dev@...il.com, leesuyeon0506@...il.com, leillc@...gle.com,
 liam.howlett@...cle.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
 mgorman@...hsingularity.net, mingo@...hat.com, nadav.amit@...il.com,
 nphamcs@...il.com, peterz@...radead.org, riel@...riel.com,
 rientjes@...gle.com, rppt@...nel.org, santosh.shukla@....com,
 shivankg@....com, shy828301@...il.com, sj@...nel.org, vbabka@...e.cz,
 weixugc@...gle.com, willy@...radead.org, ying.huang@...ux.alibaba.com,
 ziy@...dia.com, Jonathan.Cameron@...wei.com, alok.rathore@...sung.com,
 kinseyho@...gle.com, yuanchu@...gle.com
Subject: Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A
 bit

+kinseyho and yuanchu

On 3/22/2025 2:05 AM, Davidlohr Bueso wrote:
> On Fri, 21 Mar 2025, Raghavendra K T wrote:
> 
>>> But a longer running/ more memory workload may make more difference.
>>> I will comeback with that number.
>>
>>                 base NUMAB=2   Patched NUMAB=0
>>                 time in sec    time in sec
>> ===================================================
>> 8G:              134.33 (0.19)   119.88 ( 0.25)
>> 16G:             292.24 (0.60)   325.06 (11.11)
>> 32G:             585.06 (0.24)   546.15 ( 0.50)
>> 64G:            1278.98 (0.27)  1221.41 ( 1.54)
>>
>> We can see that numbers have not changed much between NUMAB=1 NUMAB=0 in
>> patched case.
> 
> Thanks. Since this might vary across workloads, another important metric
> here is numa hit/misses statistics.

Hello David, sorry for coming back late.

Yes I did collect some of the other stats along with this (posting for
8GB only). I did not se much difference in total numa_hit. But there are 
differences in in numa_local etc.. (not pasted here)

#grep -A2 completed  abench_cxl_6.14.0-rc6-kmmscand+_8G.log 
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in 
120292376.0 us, Total thread execution time 7490922681.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6376927
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in 
119583939.0 us, Total thread execution time 7461705291.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6373409
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-kmmscand+_8G.log:Benchmark completed in 
119784117.0 us, Total thread execution time 7482710944.0 us
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_hit 6378384
abench_cxl_6.14.0-rc6-kmmscand+_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in 
134481344.0 us, Total thread execution time 8409840511.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6303300
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in 
133967260.0 us, Total thread execution time 8352886349.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6304063
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0
--
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log:Benchmark completed in 
134554911.0 us, Total thread execution time 8444951713.0 us
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_hit 6302506
abench_cxl_6.14.0-rc6-cxlfix+_numab2_8G.log-numa_miss 0

> 
> fyi I have also been trying this series to get some numbers as well, but
> noticed overnight things went south (so no chance before LSFMM):
>

This issue looks to be different. Could you please let me know any ways
to reproduce?

I had tested perf bench numa mem, did not find anything.

The issue I know of currently is:

kmmscand:
  for_each_mm
     for_each_vma
         scan_vma and get accessed_folo_list
         add to migration_list() // does not check for duplicate

kmmmigrated:
   for_each_folio in migration_list
        migrate_misplaced_folio()

there is also
   cleanup_migration_list() in mm teardown

migration_list is protected by single lock, and kmmscand is too
aggressive and can potentially bombard with migration_list (practical
workload may generate lesser pages though). That results in non-fatal
  softlockup that will be fixed with mmslot as I noted somewhere.

But now main challenge to solve in kmmscand is, it generates:
t1-> migration_list1 (of recently accessed folios)
t2-> migration_list2

How do I get the union of migration_list1 and migration_list2 so that
instead of migrating on first access, we can get a hotter page to
promote.

I had few solutions in mind: (That I wanted to get opinion / suggestion
from exerts during LSFMM)

1. Reusing DAMON VA scanning. scanning params are controlled in KMMSCAND 
(current heuristics)


2. Can we use LRU information to filter access list (LRU active/ folio 
is in (n-1) generation?)
  (I do see Kinseyho just posted LRU based approach)

3. Can we split the address range to 2MB to monitor? PMD level access 
monitoring.

4. Any possible ways of using bloom-filters for list1,list2

- Raghu

[snip...]