[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aNVUj0s30rrXEh4C@gourry-fedora-PF4VCD3F>
Date: Thu, 25 Sep 2025 10:41:19 -0400
From: Gregory Price <gourry@...rry.net>
To: Yiannis Nikolakopoulos <yiannis.nikolakop@...il.com>
Cc: Jonathan Cameron <jonathan.cameron@...wei.com>,
Wei Xu <weixugc@...gle.com>, David Rientjes <rientjes@...gle.com>,
Matthew Wilcox <willy@...radead.org>,
Bharata B Rao <bharata@....com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, dave.hansen@...el.com, hannes@...xchg.org,
mgorman@...hsingularity.net, mingo@...hat.com, peterz@...radead.org,
raghavendra.kt@....com, riel@...riel.com, sj@...nel.org,
ying.huang@...ux.alibaba.com, ziy@...dia.com, dave@...olabs.net,
nifan.cxl@...il.com, xuezhengchu@...wei.com,
akpm@...ux-foundation.org, david@...hat.com, byungchul@...com,
kinseyho@...gle.com, joshua.hahnjy@...il.com, yuanchu@...gle.com,
balbirs@...dia.com, alok.rathore@...sung.com, yiannis@...corp.com,
Adam Manzanares <a.manzanares@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion
infrastructure
On Thu, Sep 25, 2025 at 04:03:46PM +0200, Yiannis Nikolakopoulos wrote:
> >
> > For the hardware compression devices how are you dealing with capacity variation
> > / overcommit?
...
> What is different from standard tiering is that the control plane is
> checked on demotion to make sure there is still capacity left. If not, the
> demotion fails. While this seems stable so far, a missing piece is to
> ensure that this tier is mainly written by demotions and not arbitrary kernel
> allocations (at least as a starting point). I want to explore how mempolicies
> can help there, or something of the sort that Gregory described.
>
Writing back the description as i understand it:
1) The intent is to only have this memory allocable via demotion
(i.e. no fault or direct allocation from userland possible)
2) The intent is to still have this memory accessible directly (DMA),
while compressed, not trigger a fault/promotion on access
(i.e. no zswap faults)
3) The intent is to have an external monitoring software handle
outrunning run-away decompression/hotness by promoting that data.
So basically we want a zswap-like interface for allocation, but to
retain the `struct page` in page tables such that no faults are incurred
on access. Then if the page becomes hot, depend on some kind of HMU
tiering system to get it off the device.
I think we all understand there's some bear we have to outrun to deal
with problem #3 - and many of us are skeptical that the bear won't catch
up with our pants down. Let's ignore this for the moment.
If such a device's memory is added to the default page allocator, then
the question becomes one of *isolation* - such that the kernel will
provide some "Capital-G Guarantee" that certain NUMA nodes will NEVER
be used except under very explicit scenarios.
There are only 3 mechanisms with which to restrict this (presently):
1) ZONE membership (to disallow GFP_KERNEL)
2) cgroups->cpusets->mems_allowed
3) task/vma mempolicy
(obvious #4: Don't put it in the default page allocator)
cpusets and mempolicy are not sufficient to provide full isolation
- cgroups have the opposite hierarchical relationship than desired.
The parent cgroup will lock out all children cgroups from using nodes
not present in the parent mems_allowed. e.g. if you lock out access
from the root cgroup, no cgroup on the entire system is eligible to
allocate the memory. If you don't lock out the root cgroup - any root
cgroup task is eligible. This isn't tractible.
- task/vma mempolicy gets ignored in many cases and is closer to a
suggestion than enforcible. It's also subject to rebinding as a
task's cgroups.cpuset.mems_allowed changes.
I haven't read up enough on ZONE_DEVICE to understand the implications
of membership there, but have you explored this as an option? I don't
see the work i'm doing intersecting well with your efforts - except
maybe on the vmscan.c work around allocation on demotion.
The work i'm doing is more aligned with - hey, filesystems are a global
resource, why are we using cgroup/task/vma policies to dictate whether a
filesystem's cache is eligible to land in remote nodes? i.e. drawing
better boundaries and controls around what can land in some set of
remote nodes "by default". You're looking for *strong isolation*
controls, which implies a different kind of allocator interface.
~Gregory
Powered by blists - more mailing lists