linux-kernel - Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aNzWwz5OYLOjwjLv@gourry-fedora-PF4VCD3F>
Date: Wed, 1 Oct 2025 03:22:43 -0400
From: Gregory Price <gourry@...rry.net>
To: Jonathan Cameron <jonathan.cameron@...wei.com>
Cc: Yiannis Nikolakopoulos <yiannis.nikolakop@...il.com>,
	Wei Xu <weixugc@...gle.com>, David Rientjes <rientjes@...gle.com>,
	Matthew Wilcox <willy@...radead.org>,
	Bharata B Rao <bharata@....com>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, dave.hansen@...el.com, hannes@...xchg.org,
	mgorman@...hsingularity.net, mingo@...hat.com, peterz@...radead.org,
	raghavendra.kt@....com, riel@...riel.com, sj@...nel.org,
	ying.huang@...ux.alibaba.com, ziy@...dia.com, dave@...olabs.net,
	nifan.cxl@...il.com, xuezhengchu@...wei.com,
	akpm@...ux-foundation.org, david@...hat.com, byungchul@...com,
	kinseyho@...gle.com, joshua.hahnjy@...il.com, yuanchu@...gle.com,
	balbirs@...dia.com, alok.rathore@...sung.com, yiannis@...corp.com,
	Adam Manzanares <a.manzanares@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion
 infrastructure

On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > On Thu, 25 Sep 2025 12:06:28 -0400
> > Gregory Price <gourry@...rry.net> wrote:
> > 
> > > It feels much more natural to put this as a zswap/zram backend.
> > > 
> > Agreed.  I currently see two paths that are generic (ish).
> > 
> > 1. zswap route - faulting as you describe on writes.
> 
> aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
> 
> The interposition point for zswap/zram is the PTE present bit being 
> hacked off to generate access faults.
> 

I went digging around a bit.

Not only this, but the PTE is used to store the swap entry ID, so you
can't just use a swap backend and keep the mapping. It's just not a
compatible abstraction - so as a zswap-backend this is DOA.

Even if you could figure out a way to re-use the abstraction and just
take a hard-fault to fault it back in as read-only, you lose the swap
entry on fault.  That just gets nasty trying to reconcile the
differences between this interface and swap at that point.

So here's a fun proposal.  I'm not sure of how NUMA nodes for devices
get determined - 

1. Carve out an explicit proximity domain (NUMA node) for the compressed
   region via SRAT.
   https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html

2. Make sure this proximity domain (NUMA node) has separate data in the
   HMAT so it can be an explicit demotion target for higher tiers
   https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html

3. Create a node-to-zone-allocator registration and retrieval function
   device_folio_alloc = nid_to_alloc(nid)

4. Create a DAX extension that registers the above allocator interface

5. in `alloc_migration_target()` mm/migrate.c
   Since nid is not a valid buddy-allocator target, everything here
   will fail.  So we can simply append the following to the bottom

   device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
   if (device_folio_alloc)
       folio = device_folio_alloc(...)
   return folio;

6. in `struct migration_target_control` add a new .no_writable value
   - This will say the new mapping replacements should have the
     writable bit chopped off.

7. On write-fault, extent mm/memory.c:do_numa_page to detect this
   and simply promote the page to allow writes.  Write faults will
   be expensive, but you'll have pretty strong guarantees around
   not unexpectedly running out of space.

   You can then loosen the .no_writable restriction with settings if
   you have high confidence that your system will outrun your ability
   to promote/evict/whatever if device memory becomes hot.

The only thing I don't know off hand is how shared pages will work in
this setup.  For VMAs with a mapping that exist at demotion time, this
all works wonderfully - less so if the mapping doesn't exist or a new
VMA is created after a demotion has occurred.

I don't know what will happen there.

I think this would also sate the desire for a "separate CXL allocator"
for integration into other paths as well.

~Gregory