[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aNzWwz5OYLOjwjLv@gourry-fedora-PF4VCD3F>
Date: Wed, 1 Oct 2025 03:22:43 -0400
From: Gregory Price <gourry@...rry.net>
To: Jonathan Cameron <jonathan.cameron@...wei.com>
Cc: Yiannis Nikolakopoulos <yiannis.nikolakop@...il.com>,
Wei Xu <weixugc@...gle.com>, David Rientjes <rientjes@...gle.com>,
Matthew Wilcox <willy@...radead.org>,
Bharata B Rao <bharata@....com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, dave.hansen@...el.com, hannes@...xchg.org,
mgorman@...hsingularity.net, mingo@...hat.com, peterz@...radead.org,
raghavendra.kt@....com, riel@...riel.com, sj@...nel.org,
ying.huang@...ux.alibaba.com, ziy@...dia.com, dave@...olabs.net,
nifan.cxl@...il.com, xuezhengchu@...wei.com,
akpm@...ux-foundation.org, david@...hat.com, byungchul@...com,
kinseyho@...gle.com, joshua.hahnjy@...il.com, yuanchu@...gle.com,
balbirs@...dia.com, alok.rathore@...sung.com, yiannis@...corp.com,
Adam Manzanares <a.manzanares@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion
infrastructure
On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > On Thu, 25 Sep 2025 12:06:28 -0400
> > Gregory Price <gourry@...rry.net> wrote:
> >
> > > It feels much more natural to put this as a zswap/zram backend.
> > >
> > Agreed. I currently see two paths that are generic (ish).
> >
> > 1. zswap route - faulting as you describe on writes.
>
> aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
>
> The interposition point for zswap/zram is the PTE present bit being
> hacked off to generate access faults.
>
I went digging around a bit.
Not only this, but the PTE is used to store the swap entry ID, so you
can't just use a swap backend and keep the mapping. It's just not a
compatible abstraction - so as a zswap-backend this is DOA.
Even if you could figure out a way to re-use the abstraction and just
take a hard-fault to fault it back in as read-only, you lose the swap
entry on fault. That just gets nasty trying to reconcile the
differences between this interface and swap at that point.
So here's a fun proposal. I'm not sure of how NUMA nodes for devices
get determined -
1. Carve out an explicit proximity domain (NUMA node) for the compressed
region via SRAT.
https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
2. Make sure this proximity domain (NUMA node) has separate data in the
HMAT so it can be an explicit demotion target for higher tiers
https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
3. Create a node-to-zone-allocator registration and retrieval function
device_folio_alloc = nid_to_alloc(nid)
4. Create a DAX extension that registers the above allocator interface
5. in `alloc_migration_target()` mm/migrate.c
Since nid is not a valid buddy-allocator target, everything here
will fail. So we can simply append the following to the bottom
device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
if (device_folio_alloc)
folio = device_folio_alloc(...)
return folio;
6. in `struct migration_target_control` add a new .no_writable value
- This will say the new mapping replacements should have the
writable bit chopped off.
7. On write-fault, extent mm/memory.c:do_numa_page to detect this
and simply promote the page to allow writes. Write faults will
be expensive, but you'll have pretty strong guarantees around
not unexpectedly running out of space.
You can then loosen the .no_writable restriction with settings if
you have high confidence that your system will outrun your ability
to promote/evict/whatever if device memory becomes hot.
The only thing I don't know off hand is how shared pages will work in
this setup. For VMAs with a mapping that exist at demotion time, this
all works wonderfully - less so if the mapping doesn't exist or a new
VMA is created after a demotion has occurred.
I don't know what will happen there.
I think this would also sate the desire for a "separate CXL allocator"
for integration into other paths as well.
~Gregory
Powered by blists - more mailing lists