linux-kernel - Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOi6=wTXSTgTB+KUpn+LOUGvXg4UeEz-DN0mh-LjChn3g8YiHA@mail.gmail.com>
Date: Thu, 16 Oct 2025 13:48:00 +0200
From: Yiannis Nikolakopoulos <yiannis.nikolakop@...il.com>
To: Gregory Price <gourry@...rry.net>
Cc: Jonathan Cameron <jonathan.cameron@...wei.com>, Wei Xu <weixugc@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Matthew Wilcox <willy@...radead.org>, 
	Bharata B Rao <bharata@....com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	dave.hansen@...el.com, hannes@...xchg.org, mgorman@...hsingularity.net, 
	mingo@...hat.com, peterz@...radead.org, raghavendra.kt@....com, 
	riel@...riel.com, sj@...nel.org, ying.huang@...ux.alibaba.com, ziy@...dia.com, 
	dave@...olabs.net, nifan.cxl@...il.com, xuezhengchu@...wei.com, 
	akpm@...ux-foundation.org, david@...hat.com, byungchul@...com, 
	kinseyho@...gle.com, joshua.hahnjy@...il.com, yuanchu@...gle.com, 
	balbirs@...dia.com, alok.rathore@...sung.com, yiannis@...corp.com, 
	Adam Manzanares <a.manzanares@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

Hi Gregory,

Thanks for all the feedback. I am finally getting some time to come
back to this.

On Thu, Sep 25, 2025 at 4:41 PM Gregory Price <gourry@...rry.net> wrote:
>
> On Thu, Sep 25, 2025 at 04:03:46PM +0200, Yiannis Nikolakopoulos wrote:
> > >
> > > For the hardware compression devices how are you dealing with capacity variation
> > > / overcommit?
> ...
> > What is different from standard tiering is that the control plane is
> > checked on demotion to make sure there is still capacity left. If not, the
> > demotion fails. While this seems stable so far, a missing piece is to
> > ensure that this tier is mainly written by demotions and not arbitrary kernel
> > allocations (at least as a starting point). I want to explore how mempolicies
> > can help there, or something of the sort that Gregory described.
> >
>
> Writing back the description as i understand it:
>
> 1) The intent is to only have this memory allocable via demotion
>    (i.e. no fault or direct allocation from userland possible)
Yes that is what looks to me like the "safe" way to begin with. In
theory you could have userland apps/middleware that is aware of this
memory and its quirks and are ok to use it but I guess we can leave
that for later and it feels like it could be provided by a separate
driver.
>
> 2) The intent is to still have this memory accessible directly (DMA),
>    while compressed, not trigger a fault/promotion on access
>    (i.e. no zswap faults)
Correct. One of the big advantages of CXL.mem is the cache-line access
granularity and our customers don't want to lose that.
>
> 3) The intent is to have an external monitoring software handle
>    outrunning run-away decompression/hotness by promoting that data.
External is not strictly necessary. E.g. it could be an additional
source of input to the kpromote/kmigrate solution.
>
> So basically we want a zswap-like interface for allocation, but to
If by "zswap-like interface" you mean something that can reject the
demote (or store according to the zswap semantics) then yes.
I just want to be careful when comparing with zswap.
> retain the `struct page` in page tables such that no faults are incurred
> on access.  Then if the page becomes hot, depend on some kind of HMU
> tiering system to get it off the device.
Correct.
>
> I think we all understand there's some bear we have to outrun to deal
> with problem #3 - and many of us are skeptical that the bear won't catch
> up with our pants down.  Let's ignore this for the moment.
Agreed.
>
> If such a device's memory is added to the default page allocator, then
> the question becomes one of *isolation* - such that the kernel will
> provide some "Capital-G Guarantee" that certain NUMA nodes will NEVER
> be used except under very explicit scenarios.
>
> There are only 3 mechanisms with which to restrict this (presently):
>
> 1) ZONE membership (to disallow GFP_KERNEL)
> 2) cgroups->cpusets->mems_allowed
> 3) task/vma mempolicy
> (obvious #4: Don't put it in the default page allocator)
>
> cpusets and mempolicy are not sufficient to provide full isolation
> - cgroups have the opposite hierarchical relationship than desired.
>   The parent cgroup will lock out all children cgroups from using nodes
>   not present in the parent mems_allowed. e.g. if you lock out access
>   from the root cgroup, no cgroup on the entire system is eligible to
>   allocate the memory.  If you don't lock out the root cgroup - any root
>   cgroup task is eligible.  This isn't tractible.
>
> - task/vma mempolicy gets ignored in many cases and is closer to a
>   suggestion than enforcible.  It's also subject to rebinding as a
>   task's cgroups.cpuset.mems_allowed changes.
>
> I haven't read up enough on ZONE_DEVICE to understand the implications
> of membership there, but have you explored this as an option?  I don't
> see the work i'm doing intersecting well with your efforts - except
> maybe on the vmscan.c work around allocation on demotion.
Thanks for the very helpful breakdown. Your take on #2 & #3 seems
reasonable. About #1, I've skimmed through the rest of the thread and
I'll continue addressing your responses there.

Yiannis
>
> The work i'm doing is more aligned with - hey, filesystems are a global
> resource, why are we using cgroup/task/vma policies to dictate whether a
> filesystem's cache is eligible to land in remote nodes? i.e. drawing
> better boundaries and controls around what can land in some set of
> remote nodes "by default".  You're looking for *strong isolation*
> controls, which implies a different kind of allocator interface.
>
> ~Gregory