[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <71ac5779-d535-4b0f-bf8d-7a60bf6a6ecf@nvidia.com>
Date: Wed, 17 Sep 2025 13:20:49 +1000
From: Balbir Singh <balbirs@...dia.com>
To: Wei Xu <weixugc@...gle.com>, David Rientjes <rientjes@...gle.com>,
Bharata B Rao <bharata@....com>
Cc: Gregory Price <gourry@...rry.net>, Matthew Wilcox <willy@...radead.org>,
linux-kernel@...r.kernel.org, linux-mm@...ck.org,
Jonathan.Cameron@...wei.com, dave.hansen@...el.com, hannes@...xchg.org,
mgorman@...hsingularity.net, mingo@...hat.com, peterz@...radead.org,
raghavendra.kt@....com, riel@...riel.com, sj@...nel.org,
ying.huang@...ux.alibaba.com, ziy@...dia.com, dave@...olabs.net,
nifan.cxl@...il.com, xuezhengchu@...wei.com, yiannis@...corp.com,
akpm@...ux-foundation.org, david@...hat.com, byungchul@...com,
kinseyho@...gle.com, joshua.hahnjy@...il.com, yuanchu@...gle.com,
alok.rathore@...sung.com
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion
infrastructure
On 9/17/25 10:30, Wei Xu wrote:
> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@...gle.com> wrote:
>>
>> On Wed, 10 Sep 2025, Gregory Price wrote:
>>
>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
>>>>> This patchset introduces a new subsystem for hot page tracking
>>>>> and promotion (pghot) that consolidates memory access information
>>>>> from various sources and enables centralized promotion of hot
>>>>> pages across memory tiers.
>>>>
>>>> Just to be clear, I continue to believe this is a terrible idea and we
>>>> should not do this. If systems will be built with CXL (and given the
>>>> horrendous performance, I cannot see why they would be), the kernel
>>>> should not be migrating memory around like this.
>>>
>>> I've been considered this problem from the opposite approach since LSFMM.
>>>
>>> Rather than decide how to move stuff around, what if instead we just
>>> decide not to ever put certain classes of memory on CXL. Right now, so
>>> long as CXL is in the page allocator, it's the wild west - any page can
>>> end up anywhere.
>>>
>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
>>> workloads to show local CXL expansion is valuable and performant enough
>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
>>> disallows GFP_KERNEL. For example: this keeps SLAB meta-data out of
>>> CXL, but allows any given user-driven page allocation (including page
>>> cache, file, and anon mappings) to land there.
>>>
>>
>> This is similar to our use case, although the direct allocation can be
>> controlled by cpusets or mempolicies as needed depending on the memory
>> access latency required for the workload; nothing new there, though, it's
>> the same argument as NUMA in general and the abstraction of these far
>> memory nodes as separate NUMA nodes makes this very straightforward.
>>
>>> I'm hoping to share some of this data in the coming months.
>>>
>>> I've yet to see any strong indication that a complex hotness/movement
>>> system is warranted (yet) - but that may simply be because we have
>>> local cards with no switching involved. So far LRU-based promotion and
>>> demotion has been sufficient.
>>>
>>
>> To me, this is a key point. As we've discussed in meetings, we're in the
>> early days here. The CHMU does provide a lot of flexibility, both to
>> create very good and very bad hotness trackers. But I think the key point
>> is that we have multiple sources of hotness information depending on the
>> platform and some of these sources only make sense for the kernel (or a
>> BPF offload) to maintain as the source of truth. Some of these sources
>> will be clear-on-read so only one entity would be possible to have as the
>> source of truth of page hotness.
>>
>> I've been pretty focused on the promotion story here rather than demotion
>> because of how responsive it needs to be. Harvesting the page table
>> accessed bits or waiting on a sliding window through NUMA Balancing (even
>> NUMAB=2) is not as responsive as needed for very fast promotion to top
>> tier memory, hence things like the CHMU (or PEBS or IBS etc).
>>
>> A few things that I think we need to discuss and align on:
>>
>> - the kernel as the source of truth for all memory hotness information,
>> which can then be abstracted and used for multiple downstream purposes,
>> memory tiering only being one of them
>>
>> - the long-term plan for NUMAB=2 and memory tiering support in the kernel
>> in general, are we planning on supporting this through NUMA hint faults
>> forever despite their drawbacks (too slow, too much overhead for KVM)
>>
>> - the role of the kernel vs userspace in driving the memory migration;
>> lots of discussion on hardware assists that can be leveraged for memory
>> migration but today the balancing is driven in process context. The
>> kthread as the driver of migration is yet to be a sold argument, but
>> are where a number of companies are currently looking
>>
>> There's also some feature support that is possible with these CXL memory
>> expansion devices that have started to pop up in labs that can also
>> drastically reduce overall TCO. Perhaps Wei Xu, cc'd, will be able to
>> chime in as well.
>>
>> This topic seems due for an alignment session as well, so will look to get
>> that scheduled in the coming weeks if people are up for it.
>
> Our experience is that workloads in hyper-scalar data centers such as
> Google often have significant cold memory. Offloading this to CXL memory
> devices, backed by cheaper, lower-performance media (e.g. DRAM with
> hardware compression), can be a practical approach to reduce overall
> TCO. Page promotion and demotion are then critical for such a tiered
> memory system.
>
> A kernel thread to drive hot page collection and promotion seems
> logical, especially since hot page data from new sources (e.g. CHMU)
> are collected outside the process execution context and in the form of
> physical addresses.
>
> I do agree that we need to balance the complexity and benefits of any
> new data structures for hotness tracking.
I think there is a mismatch in the tiering structure and
the patches. If you see the example in memory tiering
/*
* ...
* Example 3:
*
* Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
*
* node distances:
* node 0 1 2
* 0 10 20 30
* 1 20 10 40
* 2 30 40 10
*
* memory_tiers0 = 1
* memory_tiers1 = 0
* memory_tiers2 = 2
*..
*/
The topmost tier need not be DRAM, patch 3 states
"
[..]
* kpromoted is a kernel thread that runs on each toptier node and
* promotes pages from max_heap.
"
Also, there is no data in the cover letter to indicate what workloads benefit from
migration to top-tier and by how much?
Balbir
Powered by blists - more mailing lists