linux-kernel - Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <332e842b-a3c9-41f0-af5c-c147661d7997@amd.com>
Date: Wed, 17 Sep 2025 09:45:06 +0530
From: Bharata B Rao <bharata@....com>
To: Balbir Singh <balbirs@...dia.com>, Wei Xu <weixugc@...gle.com>, "David
 Rientjes" <rientjes@...gle.com>
CC: Gregory Price <gourry@...rry.net>, Matthew Wilcox <willy@...radead.org>,
	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
	<Jonathan.Cameron@...wei.com>, <dave.hansen@...el.com>, <hannes@...xchg.org>,
	<mgorman@...hsingularity.net>, <mingo@...hat.com>, <peterz@...radead.org>,
	<raghavendra.kt@....com>, <riel@...riel.com>, <sj@...nel.org>,
	<ying.huang@...ux.alibaba.com>, <ziy@...dia.com>, <dave@...olabs.net>,
	<nifan.cxl@...il.com>, <xuezhengchu@...wei.com>, <yiannis@...corp.com>,
	<akpm@...ux-foundation.org>, <david@...hat.com>, <byungchul@...com>,
	<kinseyho@...gle.com>, <joshua.hahnjy@...il.com>, <yuanchu@...gle.com>,
	<alok.rathore@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion
 infrastructure



On 17-Sep-25 8:50 AM, Balbir Singh wrote:
> On 9/17/25 10:30, Wei Xu wrote:
>> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@...gle.com> wrote:
>>>
>>> On Wed, 10 Sep 2025, Gregory Price wrote:
>>>
>>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:
>>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:
>>>>>> This patchset introduces a new subsystem for hot page tracking
>>>>>> and promotion (pghot) that consolidates memory access information
>>>>>> from various sources and enables centralized promotion of hot
>>>>>> pages across memory tiers.
>>>>>
>>>>> Just to be clear, I continue to believe this is a terrible idea and we
>>>>> should not do this.  If systems will be built with CXL (and given the
>>>>> horrendous performance, I cannot see why they would be), the kernel
>>>>> should not be migrating memory around like this.
>>>>
>>>> I've been considered this problem from the opposite approach since LSFMM.
>>>>
>>>> Rather than decide how to move stuff around, what if instead we just
>>>> decide not to ever put certain classes of memory on CXL.  Right now, so
>>>> long as CXL is in the page allocator, it's the wild west - any page can
>>>> end up anywhere.
>>>>
>>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
>>>> workloads to show local CXL expansion is valuable and performant enough
>>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
>>>> disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
>>>> CXL, but allows any given user-driven page allocation (including page
>>>> cache, file, and anon mappings) to land there.
>>>>
>>>
>>> This is similar to our use case, although the direct allocation can be
>>> controlled by cpusets or mempolicies as needed depending on the memory
>>> access latency required for the workload; nothing new there, though, it's
>>> the same argument as NUMA in general and the abstraction of these far
>>> memory nodes as separate NUMA nodes makes this very straightforward.
>>>
>>>> I'm hoping to share some of this data in the coming months.
>>>>
>>>> I've yet to see any strong indication that a complex hotness/movement
>>>> system is warranted (yet) - but that may simply be because we have
>>>> local cards with no switching involved. So far LRU-based promotion and
>>>> demotion has been sufficient.
>>>>
>>>
>>> To me, this is a key point.  As we've discussed in meetings, we're in the
>>> early days here.  The CHMU does provide a lot of flexibility, both to
>>> create very good and very bad hotness trackers.  But I think the key point
>>> is that we have multiple sources of hotness information depending on the
>>> platform and some of these sources only make sense for the kernel (or a
>>> BPF offload) to maintain as the source of truth.  Some of these sources
>>> will be clear-on-read so only one entity would be possible to have as the
>>> source of truth of page hotness.
>>>
>>> I've been pretty focused on the promotion story here rather than demotion
>>> because of how responsive it needs to be.  Harvesting the page table
>>> accessed bits or waiting on a sliding window through NUMA Balancing (even
>>> NUMAB=2) is not as responsive as needed for very fast promotion to top
>>> tier memory, hence things like the CHMU (or PEBS or IBS etc).
>>>
>>> A few things that I think we need to discuss and align on:
>>>
>>>  - the kernel as the source of truth for all memory hotness information,
>>>    which can then be abstracted and used for multiple downstream purposes,
>>>    memory tiering only being one of them
>>>
>>>  - the long-term plan for NUMAB=2 and memory tiering support in the kernel
>>>    in general, are we planning on supporting this through NUMA hint faults
>>>    forever despite their drawbacks (too slow, too much overhead for KVM)
>>>
>>>  - the role of the kernel vs userspace in driving the memory migration;
>>>    lots of discussion on hardware assists that can be leveraged for memory
>>>    migration but today the balancing is driven in process context.  The
>>>    kthread as the driver of migration is yet to be a sold argument, but
>>>    are where a number of companies are currently looking
>>>
>>> There's also some feature support that is possible with these CXL memory
>>> expansion devices that have started to pop up in labs that can also
>>> drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
>>> chime in as well.
>>>
>>> This topic seems due for an alignment session as well, so will look to get
>>> that scheduled in the coming weeks if people are up for it.
>>
>> Our experience is that workloads in hyper-scalar data centers such as
>> Google often have significant cold memory. Offloading this to CXL memory
>> devices, backed by cheaper, lower-performance media (e.g. DRAM with
>> hardware compression), can be a practical approach to reduce overall
>> TCO. Page promotion and demotion are then critical for such a tiered
>> memory system.
>>
>> A kernel thread to drive hot page collection and promotion seems
>> logical, especially since hot page data from new sources (e.g. CHMU)
>> are collected outside the process execution context and in the form of
>> physical addresses.
>>
>> I do agree that we need to balance the complexity and benefits of any
>> new data structures for hotness tracking.
> 
> 
> I think there is a mismatch in the tiering structure and
> the patches. If you see the example in memory tiering
> 
> /*
>  * ...
>  * Example 3:
>  *
>  * Node 0 is CPU + DRAM nodes, Node 1 is HBM node, node 2 is PMEM node.
>  *
>  * node distances:
>  * node   0    1    2
>  *    0  10   20   30
>  *    1  20   10   40
>  *    2  30   40   10
>  *
>  * memory_tiers0 = 1
>  * memory_tiers1 = 0
>  * memory_tiers2 = 2
>  *..
>  */
> 
> The topmost tier need not be DRAM, patch 3 states
> 
> "
> [..]
>  * kpromoted is a kernel thread that runs on each toptier node and
>  * promotes pages from max_heap.

That comment is not accurate, will reword it next time.

Currently I am using kthread_create_on_node() to create one kernel thread
for each toptier node. I haven't tried this patchset with HBM but it should
end up creating a kthread for HBM node too.

However unlike for regular DRAM nodes, the kthread for HBM node can't be
bound to any CPU.

> 
> Also, there is no data in the cover letter to indicate what workloads benefit from
> migration to top-tier and by how much?

I have been trying to get the tracking infrastructure up and hoping to
get some review on that. I will start including numbers from the next iteration.

Regards,
Bharata.