[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOi6=wTsY=EWt=yQ_7QJONsJpTM_3HKp0c42FKaJ8iJ2q8-n+w@mail.gmail.com>
Date: Fri, 17 Oct 2025 11:53:31 +0200
From: Yiannis Nikolakopoulos <yiannis.nikolakop@...il.com>
To: Gregory Price <gourry@...rry.net>
Cc: Jonathan Cameron <jonathan.cameron@...wei.com>, Wei Xu <weixugc@...gle.com>,
David Rientjes <rientjes@...gle.com>, Matthew Wilcox <willy@...radead.org>,
Bharata B Rao <bharata@....com>, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
dave.hansen@...el.com, hannes@...xchg.org, mgorman@...hsingularity.net,
mingo@...hat.com, peterz@...radead.org, raghavendra.kt@....com,
riel@...riel.com, sj@...nel.org, ying.huang@...ux.alibaba.com, ziy@...dia.com,
dave@...olabs.net, nifan.cxl@...il.com, xuezhengchu@...wei.com,
akpm@...ux-foundation.org, david@...hat.com, byungchul@...com,
kinseyho@...gle.com, joshua.hahnjy@...il.com, yuanchu@...gle.com,
balbirs@...dia.com, alok.rathore@...sung.com, yiannis@...corp.com,
Adam Manzanares <a.manzanares@...sung.com>
Subject: Re: [RFC PATCH v2 0/8] mm: Hot page tracking and promotion infrastructure
On Wed, Oct 1, 2025 at 9:22 AM Gregory Price <gourry@...rry.net> wrote:
>
> On Thu, Sep 25, 2025 at 03:02:16PM -0400, Gregory Price wrote:
> > On Thu, Sep 25, 2025 at 06:23:08PM +0100, Jonathan Cameron wrote:
> > > On Thu, 25 Sep 2025 12:06:28 -0400
> > > Gregory Price <gourry@...rry.net> wrote:
> > >
> > > > It feels much more natural to put this as a zswap/zram backend.
> > > >
> > > Agreed. I currently see two paths that are generic (ish).
> > >
> > > 1. zswap route - faulting as you describe on writes.
> >
> > aaaaaaaaaaaaaaaaaaaaaaah but therein lies the rub
> >
> > The interposition point for zswap/zram is the PTE present bit being
> > hacked off to generate access faults.
> >
>
> I went digging around a bit.
>
> Not only this, but the PTE is used to store the swap entry ID, so you
> can't just use a swap backend and keep the mapping. It's just not a
> compatible abstraction - so as a zswap-backend this is DOA.
>
> Even if you could figure out a way to re-use the abstraction and just
> take a hard-fault to fault it back in as read-only, you lose the swap
> entry on fault. That just gets nasty trying to reconcile the
> differences between this interface and swap at that point.
>
> So here's a fun proposal. I'm not sure of how NUMA nodes for devices
> get determined -
>
> 1. Carve out an explicit proximity domain (NUMA node) for the compressed
> region via SRAT.
> https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html
>
> 2. Make sure this proximity domain (NUMA node) has separate data in the
> HMAT so it can be an explicit demotion target for higher tiers
> https://docs.kernel.org/driver-api/cxl/platform/acpi/hmat.html
This makes sense. I've done a dirty hardcoding trick in my prototype
so that my node is always the last target. I'll have a look on how to
make this right.
>
> 3. Create a node-to-zone-allocator registration and retrieval function
> device_folio_alloc = nid_to_alloc(nid)
>
> 4. Create a DAX extension that registers the above allocator interface
>
> 5. in `alloc_migration_target()` mm/migrate.c
> Since nid is not a valid buddy-allocator target, everything here
> will fail. So we can simply append the following to the bottom
>
> device_folio_alloc = nid_to_alloc(nid, DEVICE_FOLIO_ALLOC);
> if (device_folio_alloc)
> folio = device_folio_alloc(...)
> return folio;
In my current prototype alloc_migration_target was working (naively).
Steps 3, 4 and 5 seem like an interesting thing to try after all this
discussion.
>
> 6. in `struct migration_target_control` add a new .no_writable value
> - This will say the new mapping replacements should have the
> writable bit chopped off.
>
> 7. On write-fault, extent mm/memory.c:do_numa_page to detect this
> and simply promote the page to allow writes. Write faults will
> be expensive, but you'll have pretty strong guarantees around
> not unexpectedly running out of space.
>
> You can then loosen the .no_writable restriction with settings if
> you have high confidence that your system will outrun your ability
> to promote/evict/whatever if device memory becomes hot.
That looks modular enough that will allow me to test both writable and
no_writable and being able to compare.
>
> The only thing I don't know off hand is how shared pages will work in
> this setup. For VMAs with a mapping that exist at demotion time, this
> all works wonderfully - less so if the mapping doesn't exist or a new
> VMA is created after a demotion has occurred.
I'll keep that in mind.
>
> I don't know what will happen there.
>
> I think this would also sate the desire for a "separate CXL allocator"
> for integration into other paths as well.
>
> ~Gregory
Thanks a lot for all the discussion and the input. I can move my
prototype towards this direction and will get back with what I 've
learned and an RFC if it makes sense. Please keep me in the loop in
any related discussions.
Best,
/Yiannis
Powered by blists - more mailing lists