linux-kernel - Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <5A7E0646-0324-4463-8D93-A1105C715EB3@gmail.com>
Date: Thu, 25 Sep 2025 16:03:46 +0200
From: Yiannis Nikolakopoulos <yiannis.nikolakop@...il.com>
To: Jonathan Cameron <jonathan.cameron@...wei.com>
Cc: Wei Xu <weixugc@...gle.com>,
 David Rientjes <rientjes@...gle.com>,
 Gregory Price <gourry@...rry.net>,
 Matthew Wilcox <willy@...radead.org>,
 Bharata B Rao <bharata@....com>,
 linux-kernel@...r.kernel.org,
 linux-mm@...ck.org,
 dave.hansen@...el.com,
 hannes@...xchg.org,
 mgorman@...hsingularity.net,
 mingo@...hat.com,
 peterz@...radead.org,
 raghavendra.kt@....com,
 riel@...riel.com,
 sj@...nel.org,
 ying.huang@...ux.alibaba.com,
 ziy@...dia.com,
 dave@...olabs.net,
 nifan.cxl@...il.com,
 xuezhengchu@...wei.com,
 akpm@...ux-foundation.org,
 david@...hat.com,
 byungchul@...com,
 kinseyho@...gle.com,
 joshua.hahnjy@...il.com,
 yuanchu@...gle.com,
 balbirs@...dia.com,
 alok.rathore@...sung.com,
 yiannis@...corp.com,
 Adam Manzanares <a.manzanares@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion
 infrastructure



> On 17 Sep 2025, at 18:49, Jonathan Cameron <jonathan.cameron@...wei.com> wrote:
> 
> On Tue, 16 Sep 2025 17:30:46 -0700
> Wei Xu <weixugc@...gle.com> wrote:
> 
>> On Tue, Sep 16, 2025 at 12:45 PM David Rientjes <rientjes@...gle.com> wrote:
>>> 
>>> On Wed, 10 Sep 2025, Gregory Price wrote:
>>> 
>>>> On Wed, Sep 10, 2025 at 04:39:16PM +0100, Matthew Wilcox wrote:  
>>>>> On Wed, Sep 10, 2025 at 08:16:45PM +0530, Bharata B Rao wrote:  
>>>>>> This patchset introduces a new subsystem for hot page tracking
>>>>>> and promotion (pghot) that consolidates memory access information
>>>>>> from various sources and enables centralized promotion of hot
>>>>>> pages across memory tiers.  
>>>>> 
>>>>> Just to be clear, I continue to believe this is a terrible idea and we
>>>>> should not do this.  If systems will be built with CXL (and given the
>>>>> horrendous performance, I cannot see why they would be), the kernel
>>>>> should not be migrating memory around like this.  
>>>> 
>>>> I've been considered this problem from the opposite approach since LSFMM.
>>>> 
>>>> Rather than decide how to move stuff around, what if instead we just
>>>> decide not to ever put certain classes of memory on CXL.  Right now, so
>>>> long as CXL is in the page allocator, it's the wild west - any page can
>>>> end up anywhere.
>>>> 
>>>> I have enough data now from ZONE_MOVABLE-only CXL deployments on real
>>>> workloads to show local CXL expansion is valuable and performant enough
>>>> to be worth deploying - but the key piece for me is that ZONE_MOVABLE
>>>> disallows GFP_KERNEL.  For example: this keeps SLAB meta-data out of
>>>> CXL, but allows any given user-driven page allocation (including page
>>>> cache, file, and anon mappings) to land there.
>>>> 
>>> 
[snip]
>>> There's also some feature support that is possible with these CXL memory
>>> expansion devices that have started to pop up in labs that can also
>>> drastically reduce overall TCO.  Perhaps Wei Xu, cc'd, will be able to
>>> chime in as well.
>>> 
>>> This topic seems due for an alignment session as well, so will look to get
>>> that scheduled in the coming weeks if people are up for it.  
>> 
>> Our experience is that workloads in hyper-scalar data centers such as
>> Google often have significant cold memory. Offloading this to CXL memory
>> devices, backed by cheaper, lower-performance media (e.g. DRAM with
>> hardware compression), can be a practical approach to reduce overall
>> TCO. Page promotion and demotion are then critical for such a tiered
>> memory system.
> 
> For the hardware compression devices how are you dealing with capacity variation
> / overcommit?  
I understand that this is indeed one of the key questions from the upstream
kernel’s perspective.
So, I am jumping in to answer w.r.t. what we do in ZeroPoint; obviously I can
not speak of other solutions/deployments. However, our HW interface follows 
existing open specifications from OCP [1], so what I am describing below is
more widely applicable.

At a very high level, the way our HW works is that the DPA is indeed
overcommitted. Then, there is a control plane over CXL.io (PCIe) which
exposes the real remaining capacity, as well as some configurable
MSI-X interrupts that raise warnings when the capacity crosses over
certain configurable thresholds.

Last year I presented this interface in LSF/MM [2]. Based on the feedback I
got there, we have an early prototype that acts as the *last* memory tier
before reclaim (kind of "compressed tier in lieu of discard" as was
suggested to me by Dan).

What is different from standard tiering is that the control plane is
checked on demotion to make sure there is still capacity left. If not, the
demotion fails. While this seems stable so far, a missing piece is to
ensure that this tier is mainly written by demotions and not arbitrary kernel
allocations (at least as a starting point). I want to explore how mempolicies
can help there, or something of the sort that Gregory described.

This early prototype still needs quite some work in order to find the right
abstractions. Hopefully, I will be able to push an RFC in the near future
(a couple of months).

> Whilst there have been some discussions on that but without a
> backing store of flash or similar it seems to be challenging to use
> compressed memory in a tiering system (so as 'normalish' memory) unless you
> don't mind occasionally and unexpectedly running out of memory (in nasty
> async ways as dirty cache lines get written back).
There are several things that may be done on the device side. For now, I
think the kernel should be unaware of these. But with what I described
above, the goal is to have the capacity thresholds configured in a way
that we can absorb the occasional dirty cache lines that are written back.
> 
> Or do you mean zswap type use with a hardware offload of the actual
> compression?
I would categorize this as a completely different discussion (and product
line for us).

[1] https://www.opencompute.org/documents/hyperscale-tiered-memory-expander-specification-for-compute-express-link-cxl-1-pdf
[2] https://www.youtube.com/watch?v=tXWEbaJmZ_s

Thanks,
Yiannis

PS: Sending from a personal email address to avoid issues with
confidentiality footers of the corporate domain.