linux-kernel - Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aRxMP_wDRxJIhIiB@gourry-fedora-PF4VCD3F>
Date: Tue, 18 Nov 2025 04:36:47 -0600
From: Gregory Price <gourry@...rry.net>
To: Alistair Popple <apopple@...dia.com>
Cc: linux-mm@...ck.org, kernel-team@...a.com, linux-cxl@...r.kernel.org,
	linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev,
	linux-fsdevel@...r.kernel.org, cgroups@...r.kernel.org,
	dave@...olabs.net, jonathan.cameron@...wei.com,
	dave.jiang@...el.com, alison.schofield@...el.com,
	vishal.l.verma@...el.com, ira.weiny@...el.com,
	dan.j.williams@...el.com, longman@...hat.com,
	akpm@...ux-foundation.org, david@...hat.com,
	lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, vbabka@...e.cz,
	rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
	osalvador@...e.de, ziy@...dia.com, matthew.brost@...el.com,
	joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com,
	ying.huang@...ux.alibaba.com, mingo@...hat.com,
	peterz@...radead.org, juri.lelli@...hat.com,
	vincent.guittot@...aro.org, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, tj@...nel.org, hannes@...xchg.org,
	mkoutny@...e.com, kees@...nel.org, muchun.song@...ux.dev,
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev,
	rientjes@...gle.com, jackmanb@...gle.com, cl@...two.org,
	harry.yoo@...cle.com, axelrasmussen@...gle.com, yuanchu@...gle.com,
	weixugc@...gle.com, zhengqi.arch@...edance.com,
	yosry.ahmed@...ux.dev, nphamcs@...il.com, chengming.zhou@...ux.dev,
	fabio.m.de.francesco@...ux.intel.com, rrichter@....com,
	ming.li@...omail.com, usamaarif642@...il.com, brauner@...nel.org,
	oleg@...hat.com, namcao@...utronix.de, escape@...ux.alibaba.com,
	dongjoo.seo1@...sung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

On Tue, Nov 18, 2025 at 06:02:02PM +1100, Alistair Popple wrote:
> On 2025-11-13 at 06:29 +1100, Gregory Price <gourry@...rry.net> wrote...
> > - Why? (In short: shunting to DAX is a failed pattern for users)
> > - Other designs I considered (mempolicy, cpusets, zone_device)
> 
> I'm interested in the contrast with zone_device, and in particular why
> device_coherent memory doesn't end up being a good fit for this.
>

I did consider zone_device briefly, but if you want sparse allocation
you end up essentially re-implementing some form of buddy allocator.

That seemed less then ideal, to say the least.

Additionally, pgmap use precludes these pages from using LRU/Reclaim,
and some devices may very well be compatible with such patterns.

(I think compression will be, but it still needs work)

> > - Why mempolicy.c and cpusets as-is are insufficient
> > - SPM types seeking this form of interface (Accelerator, Compression)
> 
> I'm sure you can guess my interest is in GPUs which also have memory some people
> consider should only be used for specific purposes :-) Currently our coherent
> GPUs online this as a normal NUMA noode, for which we have also generally
> found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
> hear what short comings you have been running into (I'm less familiar with the
> Compression cases you talk about here though).
> 

The TL;DR:

cpusets as-designed doesn't really allow the concept of "Nothing can
access XYZ node except specific things" because this would involve
removing a node from the root cpusets.mems - and that can't be loosened.

mempolicy is more of a suggestion and can be completely overridden. It
is entirely ignored by things like demotion/reclaim/etc.

I plan to discuss a bit of the specifics at LPC, but a lot of this stems
from the zone-iteration logic in page_alloc.c and the rather... ermm...
"complex" nature of how mempolicy and cpusets interacts with each other.

I may add some additional notes on this thread prior to LPC given that
time may be too short to get into the nasty bits in the session.

> > - Platform extensions that would be nice to see (SPM-only Bits)
> > 
> > Open Questions
> > - Single SPM nodemask, or multiple based on features?
> > - Apply SPM/SysRAM bit on-boot only or at-hotplug?
> > - Allocate extra "possible" NUMA nodes for flexbility?
> 
> I guess this might make hotplug easier? Particularly in cases where FW hasn't
> created the nodes.
>

In cases where you need to reach back to the device for some signal, you
likely need to have the driver for that device manage the alloc/free
patterns - so this may (or may not) generalize to 1-device-per-node.

In the scenario where you want some flexibility in managing regions,
this may require multiple nodes for device.  Maybe one device provides
multiple types of memory - you want those on separate nodes.

This doesn't seem like something you need to solve right away, just
something for folks to consider.

> > - Should SPM Nodes be zone-restricted? (MOVABLE only?)
> 
> For device based memory I think so - otherwise you can never gurantee devices
> can be removed or drivers (if required to access the memory) can be unbound as
> you can't migrate things off the memory.
> 

Zones in this scenario are bit of a square-peg/round-hole.  Forcing
everything in ZONE_MOVABLE means you can't do page pinning or things
like 1GB gigantic pages.  But the device driver should be capable of
managing hotplug anyway, so what's the point of ZONE_MOVABLE? :shrug:

> > The ZSwap example demonstrates this with the `mt_spm_nodemask`.  This
> > hack treats all spm nodes as-if they are compressed memory nodes, and
> > we bypass the software compression logic in zswap in favor of simply
> > copying memory directly to the allocated page.  In a real design
> 
> So in your example (I get it's a hack) is the main advantage that you can use
> all the same memory allocation policies (eg. cgroups) when needing to allocate
> the pages? Given this is ZSwap I guess these pages would never be mapped
> directly into user-space but would anything in the design prevent that? 

This is, in-fact, the long term intent. As long as the device can manage
inline decompression with reasonable latencies, there's no reason you
shouldn't be able to leave the pages mapped Read-Only in user-space.

The driver would be responsible for migrating on write-fault, similar to
a NUMA Hint Fault on the existing transparent page placement system.

> For example could a driver say allocate SPM memory and then explicitly
> migrate an existing page to it?

You might even extend migrate_pages with a new flag that simply drops
the write-able flag from the page table mapping and abstract that entire
complexity out of the driver :]

~Gregory