[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48078454-f441-4699-9c50-db93783f00fd@nvidia.com>
Date: Wed, 26 Nov 2025 14:23:23 +1100
From: Balbir Singh <balbirs@...dia.com>
To: Gregory Price <gourry@...rry.net>, linux-mm@...ck.org
Cc: kernel-team@...a.com, linux-cxl@...r.kernel.org,
linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev,
linux-fsdevel@...r.kernel.org, cgroups@...r.kernel.org, dave@...olabs.net,
jonathan.cameron@...wei.com, dave.jiang@...el.com,
alison.schofield@...el.com, vishal.l.verma@...el.com, ira.weiny@...el.com,
dan.j.williams@...el.com, longman@...hat.com, akpm@...ux-foundation.org,
david@...hat.com, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com,
vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
osalvador@...e.de, ziy@...dia.com, matthew.brost@...el.com,
joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com,
ying.huang@...ux.alibaba.com, apopple@...dia.com, mingo@...hat.com,
peterz@...radead.org, juri.lelli@...hat.com, vincent.guittot@...aro.org,
dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
mgorman@...e.de, vschneid@...hat.com, tj@...nel.org, hannes@...xchg.org,
mkoutny@...e.com, kees@...nel.org, muchun.song@...ux.dev,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, rientjes@...gle.com,
jackmanb@...gle.com, cl@...two.org, harry.yoo@...cle.com,
axelrasmussen@...gle.com, yuanchu@...gle.com, weixugc@...gle.com,
zhengqi.arch@...edance.com, yosry.ahmed@...ux.dev, nphamcs@...il.com,
chengming.zhou@...ux.dev, fabio.m.de.francesco@...ux.intel.com,
rrichter@....com, ming.li@...omail.com, usamaarif642@...il.com,
brauner@...nel.org, oleg@...hat.com, namcao@...utronix.de,
escape@...ux.alibaba.com, dongjoo.seo1@...sung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
On 11/13/25 06:29, Gregory Price wrote:
> This is a code RFC for discussion related to
>
> "Mempolicy is dead, long live memory policy!"
> https://lpc.events/event/19/contributions/2143/
>
:)
I am trying to read through your series, but in the past I tried
https://lwn.net/Articles/720380/
> base-commit: 24172e0d79900908cf5ebf366600616d29c9b417
> (version notes at end)
>
> At LSF 2026, I plan to discuss:
> - Why? (In short: shunting to DAX is a failed pattern for users)
> - Other designs I considered (mempolicy, cpusets, zone_device)
> - Why mempolicy.c and cpusets as-is are insufficient
> - SPM types seeking this form of interface (Accelerator, Compression)
> - Platform extensions that would be nice to see (SPM-only Bits)
>
> Open Questions
> - Single SPM nodemask, or multiple based on features?
> - Apply SPM/SysRAM bit on-boot only or at-hotplug?
> - Allocate extra "possible" NUMA nodes for flexbility?
> - Should SPM Nodes be zone-restricted? (MOVABLE only?)
> - How to handle things like reclaim and compaction on these nodes.
>
>
> With this set, we aim to enable allocation of "special purpose memory"
> with the page allocator (mm/page_alloc.c) without exposing the same
> memory as "System RAM". Unless a non-userland component, and does so
> with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
>
> This isolation mechanism is a requirement for memory policies which
> depend on certain sets of memory never being used outside special
> interfaces (such as a specific mm/component or driver).
>
> We present an example of using this mechanism within ZSWAP, as-if
> a "compressed memory node" was present. How to describe the features
> of memory present on nodes is left up to comment here and at LPC '26.
>
> Userspace-driven allocations are restricted by the sysram_nodes mask,
> nothing in userspace can explicitly request memory from SPM nodes.
>
> Instead, the intent is to create new components which understand memory
> features and register those nodes with those components. This abstracts
> the hardware complexity away from userland while also not requiring new
> memory innovations to carry entirely new allocators.
>
> The ZSwap example demonstrates this with the `mt_spm_nodemask`. This
> hack treats all spm nodes as-if they are compressed memory nodes, and
> we bypass the software compression logic in zswap in favor of simply
> copying memory directly to the allocated page. In a real design
>
> There are 4 major changes in this set:
>
> 1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes
> the set of nodes which are eligible for use as normal system ram
>
> Some existing users now pass mt_sysram_nodelist into the page
> allocator instead of NULL, but passing a NULL pointer in will simply
> have it replaced by mt_sysram_nodelist anyway. Should a fully NULL
> pointer still make it to the page allocator, without GFP_SPM_NODE
> SPM node zones will simply be skipped.
>
> mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes
> present during __init, but if empty the use of mt_sysram_nodes()
> will return a NULL to preserve current behavior.
>
>
> 2) The addition of `cpuset.mems.sysram` which restricts allocations to
> `mt_sysram_nodes` unless GFP_SPM_NODE is used.
>
> SPM Nodes are still allowed in cpuset.mems.allowed and effective.
>
> This is done to allow separate control over sysram and SPM node sets
> by cgroups while maintaining the existing hierarchical rules.
>
> current cpuset configuration
> cpuset.mems_allowed
> |.mems_effective < (mems_allowed ∩ parent.mems_effective)
> |->tasks.mems_allowed < cpuset.mems_effective
>
> new cpuset configuration
> cpuset.mems_allowed
> |.mems_effective < (mems_allowed ∩ parent.mems_effective)
> |.sysram_nodes < (mems_effective ∩ default_sys_nodemask)
> |->task.sysram_nodes < cpuset.sysram_nodes
>
> This means mems_allowed still restricts all node usage in any given
> task context, which is the existing behavior.
>
> 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
> capacity being added should mark the node as an SPM Node.
>
> A node is either SysRAM or SPM - never both. Attempting to add
> incompatible memory to a node results in hotplug failure.
>
> DAX and CXL are made aware of the bit and have `spm_node` bits added
> to their relevant subsystems.
>
> 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
> from the provided node or nodemask. It changes the behavior of
> the cpuset mems_allowed and mt_node_allowed() checks.
>
> v1->v2:
> - naming improvements
> default_node -> sysram_node
> protected -> spm (Specific Purpose Memory)
> - add missing constify patch
> - add patch to update callers of __cpuset_zone_allowed
> - add additional logic to the mm sysram_nodes patch
> - fix bot build issues (ifdef config builds)
> - fix out-of-tree driver build issues (function renames)
> - change compressed_nodelist to spm_nodelist
> - add latch mechanism for sysram/spm nodes (Dan Williams)
> this drops some extra memory-hotplug logic which is nice
> v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/
>
> Gregory Price (11):
> mm: constify oom_control, scan_control, and alloc_context nodemask
> mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
> gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
> memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
> mm: restrict slub, oom, compaction, and page_alloc to sysram by
> default
> mm,cpusets: rename task->mems_allowed to task->sysram_nodes
> cpuset: introduce cpuset.mems.sysram
> mm/memory_hotplug: add MHP_SPM_NODE flag
> drivers/dax: add spm_node bit to dev_dax
> drivers/cxl: add spm_node bit to cxl region
> [HACK] mm/zswap: compressed ram integration example
>
> drivers/cxl/core/region.c | 30 ++++++
> drivers/cxl/cxl.h | 2 +
> drivers/dax/bus.c | 39 ++++++++
> drivers/dax/bus.h | 1 +
> drivers/dax/cxl.c | 1 +
> drivers/dax/dax-private.h | 1 +
> drivers/dax/kmem.c | 2 +
> fs/proc/array.c | 2 +-
> include/linux/cpuset.h | 62 +++++++------
> include/linux/gfp_types.h | 5 +
> include/linux/memory-tiers.h | 47 ++++++++++
> include/linux/memory_hotplug.h | 10 ++
> include/linux/mempolicy.h | 2 +-
> include/linux/mm.h | 4 +-
> include/linux/mmzone.h | 6 +-
> include/linux/oom.h | 2 +-
> include/linux/sched.h | 6 +-
> include/linux/swap.h | 2 +-
> init/init_task.c | 2 +-
> kernel/cgroup/cpuset-internal.h | 8 ++
> kernel/cgroup/cpuset-v1.c | 7 ++
> kernel/cgroup/cpuset.c | 158 ++++++++++++++++++++------------
> kernel/fork.c | 2 +-
> kernel/sched/fair.c | 4 +-
> mm/compaction.c | 10 +-
> mm/hugetlb.c | 8 +-
> mm/internal.h | 2 +-
> mm/memcontrol.c | 3 +-
> mm/memory-tiers.c | 66 ++++++++++++-
> mm/memory_hotplug.c | 7 ++
> mm/mempolicy.c | 34 +++----
> mm/migrate.c | 4 +-
> mm/mmzone.c | 5 +-
> mm/oom_kill.c | 11 ++-
> mm/page_alloc.c | 57 +++++++-----
> mm/show_mem.c | 11 ++-
> mm/slub.c | 15 ++-
> mm/vmscan.c | 6 +-
> mm/zswap.c | 66 ++++++++++++-
> 39 files changed, 532 insertions(+), 178 deletions(-)
>
Balbir
Powered by blists - more mailing lists