linux-kernel - Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ti434m4sbveft6jw4zqrzzis47ycjupgiw5csj2wxmcac74xva@xyvx3ebqoqhu>
Date: Thu, 27 Nov 2025 16:03:24 +1100
From: Alistair Popple <apopple@...dia.com>
To: Gregory Price <gourry@...rry.net>
Cc: linux-mm@...ck.org, kernel-team@...a.com, linux-cxl@...r.kernel.org, 
	linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev, linux-fsdevel@...r.kernel.org, 
	cgroups@...r.kernel.org, dave@...olabs.net, jonathan.cameron@...wei.com, 
	dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com, 
	ira.weiny@...el.com, dan.j.williams@...el.com, longman@...hat.com, 
	akpm@...ux-foundation.org, david@...hat.com, lorenzo.stoakes@...cle.com, 
	Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, 
	mhocko@...e.com, osalvador@...e.de, ziy@...dia.com, matthew.brost@...el.com, 
	joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com, ying.huang@...ux.alibaba.com, 
	mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org, 
	bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, tj@...nel.org, 
	hannes@...xchg.org, mkoutny@...e.com, kees@...nel.org, muchun.song@...ux.dev, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, rientjes@...gle.com, jackmanb@...gle.com, 
	cl@...two.org, harry.yoo@...cle.com, axelrasmussen@...gle.com, 
	yuanchu@...gle.com, weixugc@...gle.com, zhengqi.arch@...edance.com, 
	yosry.ahmed@...ux.dev, nphamcs@...il.com, chengming.zhou@...ux.dev, 
	fabio.m.de.francesco@...ux.intel.com, rrichter@....com, ming.li@...omail.com, usamaarif642@...il.com, 
	brauner@...nel.org, oleg@...hat.com, namcao@...utronix.de, escape@...ux.alibaba.com, 
	dongjoo.seo1@...sung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

On 2025-11-25 at 02:28 +1100, Gregory Price <gourry@...rry.net> wrote...
> On Mon, Nov 24, 2025 at 10:09:37AM +1100, Alistair Popple wrote:
> > On 2025-11-22 at 08:07 +1100, Gregory Price <gourry@...rry.net> wrote...
> > > On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> > > > 
> > 
> > There are multiple types here (DEVICE_PRIVATE and DEVICE_COHERENT). The former
> > is mostly irrelevant for this discussion but I'm including the descriptions here
> > for completeness.
> > 
> 
> I appreciate you taking the time here.  I'll maybe try to look at
> updating the docs as this evolves.

I believe the DEVICE_PRIVATE bit is documented here
https://www.kernel.org/doc/Documentation/vm/hmm.rst , but if there is anything
there that you think needs improvement I'd be happy to look or review. I'm not
sure if that was updated for DEVICE_COHERENT though.

> > > But I could imagine an (overly simplistic) pattern with SPM Nodes:
> > > 
> > > fd = open("/dev/gpu_mem", ...)
> > > buf = mmap(fd, ...)
> > > buf[0] 
> > >    1) driver takes the fault
> > >    2) driver calls alloc_page(..., gpu_node, GFP_SPM_NODE)
> > >    3) driver manages any special page table masks
> > >       Like marking pages RO/RW to manage ownership.
> > 
> > Of course as an aside this needs to match the CPU PTEs logic (this what
> > hmm_range_fault() is primarily used for).
> >
> 
> This is actually the most interesting part of series for me.  I'm using
> a compressed memory device as a stand-in for a memory type that requires
> special page table entries (RO) to avoid compression ratios tanking
> (resulting, eventually, in a MCE as there's no way to slow things down).
> 
> You can somewhat "Get there from here" through device coherent
> ZONE_DEVICE, but you still don't have access to basic services like
> compaction and reclaim - which you absolutely do want for such a memory
> type (for the same reasons we groom zswap and zram).
> 
> I wonder if we can even re-use the hmm interfaces for SPM nodes to make
> managing special page table policies easier as well.  That seems
> promising.

It might depend on what exactly you're looking to do - HMM is really too parts,
one for mirroring page tables and another for allowing special non-present PTEs
to be setup to map a dummy ZONE_DEVICE struct page that notifies a driver when
the CPU attempts access.

> I said this during LSFMM: Without isolation, "memory policy" is really
> just a suggestion.  What we're describing here is all predicated on
> isolation work, and all of a sudden much clearer examples of managing
> memory on NUMA boundaries starts to make a little more sense.

I very much agree with the views of memory policy that you shared in one of the
other threads. I don't think it is adequate for providing isolation, and agree
the isolation (and degree of isolation) is the interesting bit of the work here,
at least for now.

> 
> > >    4) driver sends the gpu the (mapping_id, pfn, index) information
> > >       so that gpu can map the region in its page tables.
> > 
> > On coherent systems this often just uses HW address translation services
> > (ATS), although I think the specific implementation of how page-tables are
> > mirrored/shared is orthogonal to this.
> >
> 
> Yeah this part is completely foreign to me, I just presume there's some
> way to tell the GPU how to recontruct the virtually contiguous setup.
> That mechanism would be entirely reusable here (I assume).
> 
> > This is roughly how things work with DEVICE_PRIVATE/COHERENT memory today,
> > except in the case of DEVICE_PRIVATE in step (5) above. In that case the page is
> > mapped as a non-present special swap entry that triggers a driver callback due
> > to the lack of cache coherence.
> > 
> 
> Btw, just an aside, Lorenzo is moving to rename these entries to
> softleaf (software-leaf) entries. I think you'll find it welcome.
> https://lore.kernel.org/linux-mm/c879383aac77d96a03e4d38f7daba893cd35fc76.1762812360.git.lorenzo.stoakes@oracle.com/
> 
> > > Driver doesn't have to do much in the way of allocationg management.
> > > 
> > > This is probably less compelling since you don't want general purposes
> > > services like reclaim, migration, compaction, tiering - etc.  
> > 
> > On at least some of our systems I'm told we do want this, hence my interest
> > here. Currently we have systems not using DEVICE_COHERENT and instead just
> > onlining everything as normal system managed memory in order to get reclaim
> > and tiering. Of course then people complain that it's managed as normal system
> > memory and non-GPU related things (ie. page-cache) end up in what's viewed as
> > special purpose memory.
> > 
> 
> Ok, so now this gets interesting then.  I don't understand how this
> makes sense (not saying it doesn't, I simply don't understand).
> 
> I would presume that under no circumstance do you want device memory to
> just suddenly disappear without some coordination from the driver.
> 
> Whether it's compaction or reclaim, you have some thread that's going to
> migrate a virtual mapping from HPA(A) to HPA(B) and HPA(B) may or may not
> even map to the same memory device.
> 
> That thread may not even be called in the context of a thread which
> accesses GPU memory (although, I think we could enforce that on top
> of SPM nodes, but devil is in the details).
> 
> Maybe that "all magically works" because of the ATS described above?

Pretty much - both ATS and hmm_range_fault() are, conceptually at least, just
methods of sharing/mirroring the CPU page table to a device. So in your example
above if a thread was to migrate a mapping from one page to another this "black
magic" would keep everything in sync. Eg. For hmm_range_fault() the driver
gets a mmu_notifier callback saying the virtual mapping no longer points to
HPA(A). If it needs to find the new mapping to HPA(B) it can look it up using
hmm_range_fault() and program it's page tables with the new mapping.

At a sufficiently high level ATS is just a HW implemented equivalence of this.

> I suppose this assumes you have some kind of unified memory view between
> host and device memory?  Are there docs here you can point me at that
> might explain this wizardry?  (Sincerely, this is fascinating)

Right - it's all predicated on the host and device sharing the same view of the
virtual address space. I'm not sure of any good docs on this, but I will be at
LPC so would be happy to have a discussion there.

> > > The value is clearly that you get to manage GPU memory like any other
> > > memory, but without worry that other parts of the system will touch it.
> > > 
> > > I'm much more focused on the "I have memory that is otherwise general
> > > purpose, and wants services like reclaim and compaction, but I want
> > > strong controls over how things can land there in the first place".
> > 
> > So maybe there is some overlap here - what I have is memoy that we want managed
> > much like normal memory but with strong controls over what it can be used for
> > (ie. just for tasks utilising the processing element on the accelerator).
> > 
> 
> I think it might be great if we could discuss this a bit more in-depth,
> as i've already been considering very mild refactors to reclaim to
> enable a driver to engage it with an SPM node as the only shrink target.

Absolutely! Looking forward to an in-person discussion.

 - Alistair

> This all becomes much more complicated due to per-memcg LRUs and such.
> 
> All that said, I'm focused on the isolation / allocation pieces first.
> If that can't be agreed upon, the rest isn't worth exploring.
> 
> I do have a mild extension to mempolicy that allows mbind() to hit an
> SPM node as an example as well.  I'll discuss this in the response to
> David's thread, as he had some related questions about the GFP flag.
> 
> ~Gregory
>