[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20241210185357.81214-1-sj@kernel.org>
Date: Tue, 10 Dec 2024 10:53:57 -0800
From: SeongJae Park <sj@...nel.org>
To: Raghavendra K T <raghavendra.kt@....com>
Cc: SeongJae Park <sj@...nel.org>,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
gourry@...rry.net,
nehagholkar@...a.com,
abhishekd@...a.com,
david@...hat.com,
ying.huang@...el.com,
nphamcs@...il.com,
akpm@...ux-foundation.org,
hannes@...xchg.org,
feng.tang@...el.com,
kbusch@...a.com,
bharata@....com,
Hasan.Maruf@....com,
willy@...radead.org,
kirill.shutemov@...ux.intel.com,
mgorman@...hsingularity.net,
vbabka@...e.cz,
hughd@...gle.com,
rientjes@...gle.com,
shy828301@...il.com,
Liam.Howlett@...cle.com,
peterz@...radead.org,
mingo@...hat.com
Subject: Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
Hello Raghavendra,
Thank you for posting this nice patch series. I gave you some feedback
offline. Adding those here again for transparency on this grateful public
discussion.
On Sun, 1 Dec 2024 15:38:08 +0000 Raghavendra K T <raghavendra.kt@....com> wrote:
> Introduction:
> =============
> This patchset is an outcome of an ongoing collaboration between AMD and Meta.
> Meta wanted to explore an alternative page promotion technique as they
> observe high latency spikes in their workloads that access CXL memory.
>
> In the current hot page promotion, all the activities including the
> process address space scanning, NUMA hint fault handling and page
> migration is performed in the process context. i.e., scanning overhead is
> borne by applications.
Yet another approach is using DAMON. DAMON does access monitoring, and further
allows users to request access pattern-driven system operations in name of
DAMOS (Data Access Monitoring-based Operation Schemes). Using it, users can
request DAMON to find hot pages and promote, while finding cold pages and
demote. SK hynix has made their CXL-based memory capacity expansion solution
in the way (https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion). We
collaboratively developed new DAMON features for that, and those are all
in the mainline since Linux v6.11.
I also proposed an idea for advancing it using DAMOS auto-tuning on more
general (>2 tiers) setup
(https:lore.kernel.org/20231112195602.61525-1-sj@...nel.org). I haven't had a
time to further implement and test the idea so far, though.
>
> This is an early RFC patch series to do (slow tier) CXL page promotion.
> The approach in this patchset assists/addresses the issue by adding PTE
> Accessed bit scanning.
>
> Scanning is done by a global kernel thread which routinely scans all
> the processes' address spaces and checks for accesses by reading the
> PTE A bit. It then migrates/promotes the pages to the toptier node
> (node 0 in the current approach).
>
> Thus, the approach pushes overhead of scanning, NUMA hint faults and
> migrations off from process context.
DAMON also uses PTE A bit as major source of the access information. And DAMON
does both access monitoring and promotion/demotion in a global kernel thread,
namely kdamond. Hence the DAMON-based approach would also offload the
overheads from process context. So I feel your approach has a sort of
similarity with DAMON-based one in a way, and we might have a chance to avoid
unnecessary duplicates.
[...]
>
> Limitations:
> ===========
> PTE A bit scanning approach lacks information about exact destination
> node to migrate to.
This is same for DAMON-based approach, since DAMON also uses PTE A bit as the
major source of the information. We aim to extend DAMON to aware of the access
source CPU, and use it for solving this problem, though. Utilizing page faults
or AMD IBS-like h/w features are on the table of the ideas.
>
> Notes/Observations on design/Implementations/Alternatives/TODOs...
> ================================
> 1. Fine-tuning scan throttling
DAMON allows users set the upper-limit of monitoring overhead, using
max_nr_regions parameter. Then it provides its best-effort accuracy. We also
have ongoing projects for making it more accurate and easier to tune.
>
> 2. Use migrate_balanced_pgdat() to balance toptier node before migration
> OR Use migrate_misplaced_folio_prepare() directly.
> But it may need some optimizations (for e.g., invoke occasionaly so
> that overhead is not there for every migration).
>
> 3. Explore if a separate PAGE_EXT flag is needed instead of reusing
> PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
> But practically does not look good idea.
>
> 4. Use timestamp information-based migration (Similar to numab mode=2).
> instead of migrating immediately when PTE A bit set.
> (cons:
> - It will not be accurate since it is done outside of process
> context.
> - Performance benefit may be lost.)
DAMON provides a sort of time-based aggregated monitoring results. And DAMOS
provides prioritization of pages based on the access temperature. Hence,
DAMON-based apparoach can also be used for a similar purpose (promoting not
every accessed pages but pages that more frequently used for longer time).
>
> 5. Explore if we need to use PFN information + hash list instead of
> simple migration list. Here scanning is directly done with PFN belonging
> to CXL node.
DAMON supports physical address space monitoring, and maintains the access
monitoring results in its own data structure called damon_region. So I think
similar benefit can be achieved using DAMON?
[...]
> 8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
> physical addresses accessed.
My biased humble opinion is that it would be very nice to explore this
opportunity, since I show some similarities and opportunities to solve some of
challenges on your approach in an easier way. Even if it turns out that DAMON
cannot be used for your use case, failing earlier is a good thing, I'd say :)
>
> 9. Gregory has nicely mentioned some details/ideas on different approaches in
> [1] : development notes, in the context of promoting unmapped page cache folios.
DAMON supports monitoring accesses to unmapped page cache folios, so hopefully
DAMON-based approaches can also solve this issue.
>
> 10. SJ had pointed about concerns about kernel-thread based approaches as in
> kstaled [2]. So current patchset has tried to address the issue with simple
> algorithms to reduce CPU overhead. Migration throttling, Running the daemon
> in NICE priority, Parallelizing migration with scanning could help further.
>
> 11. Toptier pages scanned can be used to assist current NUMAB by providing information
> on hot VMAs.
>
> Credits
> =======
> Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and
> support.
I also learned many things from the great discussions, thank you :)
[...]
>
> Links:
> [1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@gourry.net/
> [2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@google.com/#r
> [3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@hirez.programming.kicks-ass.net/
>
> I might have CCed more people or less people than needed
> unintentionally.
Thanks,
SJ
[...]
Powered by blists - more mailing lists