[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOi6=wTXSTgTB+KUpn+LOUGvXg4UeEz-DN0mh-LjChn3g8YiHA@mail.gmail.com>
Date: Thu, 16 Oct 2025 13:48:00 +0200
From: Yiannis Nikolakopoulos <yiannis.nikolakop@...il.com>
To: Gregory Price <gourry@...rry.net>
Cc: Jonathan Cameron <jonathan.cameron@...wei.com>, Wei Xu <weixugc@...gle.com>,
David Rientjes <rientjes@...gle.com>, Matthew Wilcox <willy@...radead.org>,
Bharata B Rao <bharata@....com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
dave.hansen@...el.com, hannes@...xchg.org, mgorman@...hsingularity.net,
mingo@...hat.com, peterz@...radead.org, raghavendra.kt@....com,
riel@...riel.com, sj@...nel.org, ying.huang@...ux.alibaba.com, ziy@...dia.com,
dave@...olabs.net, nifan.cxl@...il.com, xuezhengchu@...wei.com,
akpm@...ux-foundation.org, david@...hat.com, byungchul@...com,
kinseyho@...gle.com, joshua.hahnjy@...il.com, yuanchu@...gle.com,
balbirs@...dia.com, alok.rathore@...sung.com, yiannis@...corp.com,
Adam Manzanares <a.manzanares@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
Hi Gregory,
Thanks for all the feedback. I am finally getting some time to come
back to this.
On Thu, Sep 25, 2025 at 4:41 PM Gregory Price <gourry@...rry.net> wrote:
>
> On Thu, Sep 25, 2025 at 04:03:46PM +0200, Yiannis Nikolakopoulos wrote:
> > >
> > > For the hardware compression devices how are you dealing with capacity variation
> > > / overcommit?
> ...
> > What is different from standard tiering is that the control plane is
> > checked on demotion to make sure there is still capacity left. If not, the
> > demotion fails. While this seems stable so far, a missing piece is to
> > ensure that this tier is mainly written by demotions and not arbitrary kernel
> > allocations (at least as a starting point). I want to explore how mempolicies
> > can help there, or something of the sort that Gregory described.
> >
>
> Writing back the description as i understand it:
>
> 1) The intent is to only have this memory allocable via demotion
> (i.e. no fault or direct allocation from userland possible)
Yes that is what looks to me like the "safe" way to begin with. In
theory you could have userland apps/middleware that is aware of this
memory and its quirks and are ok to use it but I guess we can leave
that for later and it feels like it could be provided by a separate
driver.
>
> 2) The intent is to still have this memory accessible directly (DMA),
> while compressed, not trigger a fault/promotion on access
> (i.e. no zswap faults)
Correct. One of the big advantages of CXL.mem is the cache-line access
granularity and our customers don't want to lose that.
>
> 3) The intent is to have an external monitoring software handle
> outrunning run-away decompression/hotness by promoting that data.
External is not strictly necessary. E.g. it could be an additional
source of input to the kpromote/kmigrate solution.
>
> So basically we want a zswap-like interface for allocation, but to
If by "zswap-like interface" you mean something that can reject the
demote (or store according to the zswap semantics) then yes.
I just want to be careful when comparing with zswap.
> retain the `struct page` in page tables such that no faults are incurred
> on access. Then if the page becomes hot, depend on some kind of HMU
> tiering system to get it off the device.
Correct.
>
> I think we all understand there's some bear we have to outrun to deal
> with problem #3 - and many of us are skeptical that the bear won't catch
> up with our pants down. Let's ignore this for the moment.
Agreed.
>
> If such a device's memory is added to the default page allocator, then
> the question becomes one of *isolation* - such that the kernel will
> provide some "Capital-G Guarantee" that certain NUMA nodes will NEVER
> be used except under very explicit scenarios.
>
> There are only 3 mechanisms with which to restrict this (presently):
>
> 1) ZONE membership (to disallow GFP_KERNEL)
> 2) cgroups->cpusets->mems_allowed
> 3) task/vma mempolicy
> (obvious #4: Don't put it in the default page allocator)
>
> cpusets and mempolicy are not sufficient to provide full isolation
> - cgroups have the opposite hierarchical relationship than desired.
> The parent cgroup will lock out all children cgroups from using nodes
> not present in the parent mems_allowed. e.g. if you lock out access
> from the root cgroup, no cgroup on the entire system is eligible to
> allocate the memory. If you don't lock out the root cgroup - any root
> cgroup task is eligible. This isn't tractible.
>
> - task/vma mempolicy gets ignored in many cases and is closer to a
> suggestion than enforcible. It's also subject to rebinding as a
> task's cgroups.cpuset.mems_allowed changes.
>
> I haven't read up enough on ZONE_DEVICE to understand the implications
> of membership there, but have you explored this as an option? I don't
> see the work i'm doing intersecting well with your efforts - except
> maybe on the vmscan.c work around allocation on demotion.
Thanks for the very helpful breakdown. Your take on #2 & #3 seems
reasonable. About #1, I've skimmed through the rest of the thread and
I'll continue addressing your responses there.
Yiannis
>
> The work i'm doing is more aligned with - hey, filesystems are a global
> resource, why are we using cgroup/task/vma policies to dictate whether a
> filesystem's cache is eligible to land in remote nodes? i.e. drawing
> better boundaries and controls around what can land in some set of
> remote nodes "by default". You're looking for *strong isolation*
> controls, which implies a different kind of allocator interface.
>
> ~Gregory
Powered by blists - more mailing lists