linux-kernel - Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAOi6=wTsY=EWt=yQ_7QJONsJpTM_3HKp0c42FKaJ8iJ2q8-n+w@mail.gmail.com>
Date: Fri, 17 Oct 2025 11:53:31 +0200
From: Yiannis Nikolakopoulos <yiannis.nikolakop@...il.com>
To: Gregory Price <gourry@...rry.net>
Cc: Jonathan Cameron <jonathan.cameron@...wei.com>, Wei Xu <weixugc@...gle.com>, 
	David Rientjes <rientjes@...gle.com>, Matthew Wilcox <willy@...radead.org>, 
	Bharata B Rao <bharata@....com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	dave.hansen@...el.com, hannes@...xchg.org, mgorman@...hsingularity.net, 
	mingo@...hat.com, peterz@...radead.org, raghavendra.kt@....com, 
	riel@...riel.com, sj@...nel.org, ying.huang@...ux.alibaba.com, ziy@...dia.com, 
	dave@...olabs.net, nifan.cxl@...il.com, xuezhengchu@...wei.com, 
	akpm@...ux-foundation.org, david@...hat.com, byungchul@...com, 
	kinseyho@...gle.com, joshua.hahnjy@...il.com, yuanchu@...gle.com, 
	balbirs@...dia.com, alok.rathore@...sung.com, yiannis@...corp.com, 
	Adam Manzanares <a.manzanares@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure

On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@...rry.net> wrote:
>
> On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> > On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > > On Thu, 25 Sep 2025 12:06:28 -0400
> > > Gregory Price <gourry@...rry.net> wrote:
> > >
> > > > It feels much more natural to put this as a zswap/zram backend.
> > > >
> > > Agreed.  I currently see two paths that are generic (ish).
> > >
> > > 1. zswap route - faulting as you describe on writes.
> >
> > aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
> >
> > The interposition point for zswap/zram is the PTE present bit being
> > hacked off to generate access faults.
> >
>
> I went digging around a bit.
>
> Not only this, but the PTE is used to store the swap entry ID, so you
> can't just use a swap backend and keep the mapping. It's just not a
> compatible abstraction - so as a zswap-backend this is DOA.
>
> Even if you could figure out a way to re-use the abstraction and just
> take a hard-fault to fault it back in as read-only, you lose the swap
> entry on fault.  That just gets nasty trying to reconcile the
> differences between this interface and swap at that point.
>
> So here's a fun proposal.  I'm not sure of how NUMA nodes for devices
> get determined -
>
> 1. Carve out an explicit proximity domain (NUMA node) for the compressed
>    region via SRAT.
>    https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
>
> 2. Make sure this proximity domain (NUMA node) has separate data in the
>    HMAT so it can be an explicit demotion target for higher tiers
>    https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
This makes sense. I've done a dirty hardcoding trick in my prototype
so that my node is always the last target. I'll have a look on how to
make this right.
>
> 3. Create a node-to-zone-allocator registration and retrieval function
>    device_folio_alloc = nid_to_alloc(nid)
>
> 4. Create a DAX extension that registers the above allocator interface
>
> 5. in `alloc_migration_target()` mm/migrate.c
>    Since nid is not a valid buddy-allocator target, everything here
>    will fail.  So we can simply append the following to the bottom
>
>    device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
>    if (device_folio_alloc)
>        folio = device_folio_alloc(...)
>    return folio;
In my current prototype alloc_migration_target was working (naively).
Steps 3, 4 and 5 seem like an interesting thing to try after all this
discussion.
>
> 6. in `struct migration_target_control` add a new .no_writable value
>    - This will say the new mapping replacements should have the
>      writable bit chopped off.
>
> 7. On write-fault, extent mm/memory.c:do_numa_page to detect this
>    and simply promote the page to allow writes.  Write faults will
>    be expensive, but you'll have pretty strong guarantees around
>    not unexpectedly running out of space.
>
>    You can then loosen the .no_writable restriction with settings if
>    you have high confidence that your system will outrun your ability
>    to promote/evict/whatever if device memory becomes hot.
That looks modular enough that will allow me to test both writable and
no_writable and being able to compare.
>
> The only thing I don't know off hand is how shared pages will work in
> this setup.  For VMAs with a mapping that exist at demotion time, this
> all works wonderfully - less so if the mapping doesn't exist or a new
> VMA is created after a demotion has occurred.
I'll keep that in mind.
>
> I don't know what will happen there.
>
> I think this would also sate the desire for a "separate CXL allocator"
> for integration into other paths as well.
>
> ~Gregory
Thanks a lot for all the discussion and the input. I can move my
prototype towards this direction and will get back with what I 've
learned and an RFC if it makes sense. Please keep me in the loop in
any related discussions.

Best,
/Yiannis