[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6604d787-1744-4acf-80c0-e428fee1677e@nvidia.com>
Date: Mon, 12 Jan 2026 22:12:23 +1100
From: Balbir Singh <balbirs@...dia.com>
To: Gregory Price <gourry@...rry.net>, linux-mm@...ck.org,
cgroups@...r.kernel.org, linux-cxl@...r.kernel.org
Cc: linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-fsdevel@...r.kernel.org, kernel-team@...a.com, longman@...hat.com,
tj@...nel.org, hannes@...xchg.org, mkoutny@...e.com, corbet@....net,
gregkh@...uxfoundation.org, rafael@...nel.org, dakr@...nel.org,
dave@...olabs.net, jonathan.cameron@...wei.com, dave.jiang@...el.com,
alison.schofield@...el.com, vishal.l.verma@...el.com, ira.weiny@...el.com,
dan.j.williams@...el.com, akpm@...ux-foundation.org, vbabka@...e.cz,
surenb@...gle.com, mhocko@...e.com, jackmanb@...gle.com, ziy@...dia.com,
david@...nel.org, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com,
rppt@...nel.org, axelrasmussen@...gle.com, yuanchu@...gle.com,
weixugc@...gle.com, yury.norov@...il.com, linux@...musvillemoes.dk,
rientjes@...gle.com, shakeel.butt@...ux.dev, chrisl@...nel.org,
kasong@...cent.com, shikemeng@...weicloud.com, nphamcs@...il.com,
bhe@...hat.com, baohua@...nel.org, yosry.ahmed@...ux.dev,
chengming.zhou@...ux.dev, roman.gushchin@...ux.dev, muchun.song@...ux.dev,
osalvador@...e.de, matthew.brost@...el.com, joshua.hahnjy@...il.com,
rakie.kim@...com, byungchul@...com, ying.huang@...ux.alibaba.com,
apopple@...dia.com, cl@...two.org, harry.yoo@...cle.com,
zhengqi.arch@...edance.com
Subject: Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for
device-managed memory
On 1/9/26 06:37, Gregory Price wrote:
> This series introduces N_PRIVATE, a new node state for memory nodes
> whose memory is not intended for general system consumption. Today,
> device drivers (CXL, accelerators, etc.) hotplug their memory to access
> mm/ services like page allocation and reclaim, but this exposes general
> workloads to memory with different characteristics and reliability
> guarantees than system RAM.
>
> N_PRIVATE provides isolation by default while enabling explicit access
> via __GFP_THISNODE for subsystems that understand how to manage these
> specialized memory regions.
>
I assume each class of N_PRIVATE is a separate set of NUMA nodes, these
could be real or virtual memory nodes?
> Motivation
> ==========
>
> Several emerging memory technologies require kernel memory management
> services but should not be used for general allocations:
>
> - CXL Compressed RAM (CRAM): Hardware-compressed memory where the
> effective capacity depends on data compressibility. Uncontrolled
> use risks capacity exhaustion when compression ratios degrade.
>
> - Accelerator Memory: GPU/TPU-attached memory optimized for specific
> access patterns that are not intended for general allocation.
>
> - Tiered Memory: Memory intended only as a demotion target, not for
> initial allocations.
>
> Currently, these devices either avoid hotplugging entirely (losing mm/
> services) or hotplug as regular N_MEMORY (risking reliability issues).
> N_PRIVATE solves this by creating an isolated node class.
>
> Design
> ======
>
> The series introduces:
>
> 1. N_PRIVATE node state (mutually exclusive with N_MEMORY)
We should call it N_PRIVATE_MEMORY
> 2. private_memtype enum for policy-based access control
> 3. cpuset.mems.sysram for user-visible isolation
> 4. Integration points for subsystems (zswap demonstrated)
> 5. A cxl private_region example to demonstrate full plumbing
>
> Private Memory Types (private_memtype)
> ======================================
>
> The private_memtype enum defines policy bits that control how different
> kernel subsystems may access private nodes:
>
> enum private_memtype {
> NODE_MEM_NOTYPE, /* No type assigned (invalid state) */
> NODE_MEM_ZSWAP, /* Swap compression target */
> NODE_MEM_COMPRESSED, /* General compressed RAM */
> NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
> NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
> NODE_MAX_MEMTYPE,
> };
>
> These types serve as policy hints for subsystems:
>
Do these nodes have fallback(s)? Are these nodes prone to OOM when memory is exhausted
in one class of N_PRIVATE node(s)?
What about page cache allocation form these nodes? Since default allocations
never use them, a file system would need to do additional work to allocate
on them, if there was ever a desire to use them. Would memory
migration would work between N_PRIVATE and N_MEMORY using move_pages()?
> NODE_MEM_ZSWAP
> --------------
> Nodes with this type are registered as zswap compression targets. When
> zswap compresses a page, it can allocate directly from ZSWAP-typed nodes
> using __GFP_THISNODE, bypassing software compression if the device
> provides hardware compression.
>
> Example flow:
> 1. CXL device creates private_region with type=zswap
> 2. Driver calls node_register_private() with NODE_MEM_ZSWAP
> 3. zswap_add_direct_node() registers the node as a compression target
> 4. On swap-out, zswap allocates from the private node
> 5. page_allocated() callback validates compression ratio headroom
> 6. page_freed() callback zeros pages to improve device compression
>
> Prototype Note:
> This patch set does not actually do compression ratio validation, as
> this requires an actual device to provide some kind of counter and/or
> interrupt to denote when allocations are safe. The callbacks are
> left as stubs with TODOs for device vendors to pick up the next step
> (we'll continue with a QEMU example if reception is positive).
>
> For now, this always succeeds because compressed=real capacity.
>
> NODE_MEM_COMPRESSED (CRAM)
> --------------------------
> For general compressed RAM devices. Unlike ZSWAP nodes, CRAM nodes
> could be exposed to subsystems that understand compression semantics:
>
> - vmscan: Could prefer demoting pages to CRAM nodes before swap
> - memory-tiering: Could place CRAM between DRAM and persistent memory
> - zram: Could use as backing store instead of or alongside zswap
>
> Such a component (mm/cram.c) would differ from zswap or zram by allowing
> the compressed pages to remain mapped Read-Only in the page table.
>
> NODE_MEM_ACCELERATOR
> --------------------
> For GPU/TPU/accelerator-attached memory. Policy implications:
>
> - Default allocations: Never (isolated from general page_alloc)
> - GPU drivers: Explicit allocation via __GFP_THISNODE
> - NUMA balancing: Excluded from automatic migration
> - Memory tiering: Not a demotion target
>
> Some GPU vendors want management of their memory via NUMA nodes, but
> don't want fallback or migration allocations to occur. This enables
> that pattern.
>
> mm/mempolicy.c could be used to allow for N_PRIVATE nodes of this type
> if the intent is per-vma access to accelerator memory (e.g. via mbind)
> but this is omitted from this series from now to limit userland
> exposure until first class examples are provided.
>
> NODE_MEM_DEMOTE_ONLY
> --------------------
> For memory intended exclusively as a demotion target in memory tiering:
>
> - page_alloc: Never allocates initially (slab, page faults, etc.)
> - vmscan/reclaim: Valid demotion target during memory pressure
> - memory-tiering: Allow hotness monitoring/promotion for this region
>
> This enables "cold storage" tiers using slower/cheaper memory (CXL-
> attached DRAM, persistent memory in volatile mode) without the memory
> appearing in allocation fast paths.
>
> This also adds some additional bonus of enforcing memory placement on
> these nodes to be movable allocations only (with all the normal caveats
> around page pinning).
>
> Subsystem Integration Points
> ============================
>
> The private_node_ops structure provides callbacks for integration:
>
> struct private_node_ops {
> struct list_head list;
> resource_size_t res_start;
> resource_size_t res_end;
> enum private_memtype memtype;
> int (*page_allocated)(struct page *page, void *data);
> void (*page_freed)(struct page *page, void *data);
> void *data;
> };
>
> page_allocated(): Called after allocation, returns 0 to accept or
> -ENOSPC/-ENODEV to reject (caller retries elsewhere). Enables:
> - Compression ratio enforcement for CRAM/zswap
> - Capacity tracking for accelerator memory
> - Rate limiting for demotion targets
>
> page_freed(): Called on free, enables:
> - Zeroing for compression ratio recovery
> - Capacity accounting updates
> - Device-specific cleanup
>
> Isolation Enforcement
> =====================
>
> The series modifies core allocators to respect N_PRIVATE isolation:
>
> - page_alloc: Constrains zone iteration to cpuset.mems.sysram
> - slub: Allocates only from N_MEMORY nodes
> - compaction: Skips N_PRIVATE nodes
> - mempolicy: Uses sysram_nodes for policy evaluation
>
> __GFP_THISNODE bypasses isolation, enabling explicit access:
>
> page = alloc_pages_node(private_nid, GFP_KERNEL | __GFP_THISNODE, 0);
>
> This pattern is used by zswap, and would be used by other subsystems
> that explicitly opt into private node access.
>
> User-Visible Changes
> ====================
>
> cpuset gains cpuset.mems.sysram (read-only), shows N_MEMORY nodes.
>
> ABI: /proc/<pid>/status Mems_allowed shows sysram nodes only.
>
> Drivers create private regions via sysfs:
> echo region0 > /sys/bus/cxl/.../create_private_region
> echo zswap > /sys/bus/cxl/.../region0/private_type
> echo 1 > /sys/bus/cxl/.../region0/commit
>
> Series Organization
> ===================
>
> Patch 1: numa,memory_hotplug: create N_PRIVATE (Private Nodes)
> Core infrastructure: N_PRIVATE node state, node_mark_private(),
> private_memtype enum, and private_node_ops registration.
>
> Patch 2: mm: constify oom_control, scan_control, and alloc_context
> nodemask
> Preparatory cleanup for enforcing that nodemasks don't change.
>
> Patch 3: mm: restrict slub, compaction, and page_alloc to sysram
> Enforce N_MEMORY-only allocation for general paths.
>
> Patch 4: cpuset: introduce cpuset.mems.sysram
> User-visible isolation via cpuset interface.
>
> Patch 5: Documentation/admin-guide/cgroups: update docs for mems_allowed
> Document the new behavior and sysram_nodes.
>
> Patch 6: drivers/cxl/core/region: add private_region
> CXL infrastructure for private regions.
>
> Patch 7: mm/zswap: compressed ram direct integration
> Zswap integration demonstrating direct hardware compression.
>
> Patch 8: drivers/cxl: add zswap private_region type
> Complete example: CXL region as zswap compression target.
>
> Future Work
> ===========
>
> This series provides the foundation. Planned follow-ups include:
>
> - CRAM integration with vmscan for smart demotion
> - ACCELERATOR type for GPU memory management
> - Memory-tiering integration with DEMOTE_ONLY nodes
>
> Testing
> =======
>
> All patches build cleanly. Tested with:
> - CXL QEMU emulation with private regions
> - Zswap stress tests with private compression targets
> - Cpuset verification of mems.sysram isolation
>
>
> Gregory Price (8):
> numa,memory_hotplug: create N_PRIVATE (Private Nodes)
> mm: constify oom_control, scan_control, and alloc_context nodemask
> mm: restrict slub, compaction, and page_alloc to sysram
> cpuset: introduce cpuset.mems.sysram
> Documentation/admin-guide/cgroups: update docs for mems_allowed
> drivers/cxl/core/region: add private_region
> mm/zswap: compressed ram direct integration
> drivers/cxl: add zswap private_region type
>
> .../admin-guide/cgroup-v1/cpusets.rst | 19 +-
> Documentation/admin-guide/cgroup-v2.rst | 26 ++-
> Documentation/filesystems/proc.rst | 2 +-
> drivers/base/node.c | 199 ++++++++++++++++++
> drivers/cxl/core/Makefile | 1 +
> drivers/cxl/core/core.h | 4 +
> drivers/cxl/core/port.c | 4 +
> drivers/cxl/core/private_region/Makefile | 12 ++
> .../cxl/core/private_region/private_region.c | 129 ++++++++++++
> .../cxl/core/private_region/private_region.h | 14 ++
> drivers/cxl/core/private_region/zswap.c | 127 +++++++++++
> drivers/cxl/core/region.c | 63 +++++-
> drivers/cxl/cxl.h | 22 ++
> include/linux/cpuset.h | 24 ++-
> include/linux/gfp.h | 6 +
> include/linux/mm.h | 4 +-
> include/linux/mmzone.h | 6 +-
> include/linux/node.h | 60 ++++++
> include/linux/nodemask.h | 1 +
> include/linux/oom.h | 2 +-
> include/linux/swap.h | 2 +-
> include/linux/zswap.h | 5 +
> kernel/cgroup/cpuset-internal.h | 8 +
> kernel/cgroup/cpuset-v1.c | 8 +
> kernel/cgroup/cpuset.c | 98 ++++++---
> mm/compaction.c | 6 +-
> mm/internal.h | 2 +-
> mm/memcontrol.c | 2 +-
> mm/memory_hotplug.c | 2 +-
> mm/mempolicy.c | 6 +-
> mm/migrate.c | 4 +-
> mm/mmzone.c | 5 +-
> mm/page_alloc.c | 31 +--
> mm/show_mem.c | 9 +-
> mm/slub.c | 8 +-
> mm/vmscan.c | 6 +-
> mm/zswap.c | 106 +++++++++-
> 37 files changed, 942 insertions(+), 91 deletions(-)
> create mode 100644 drivers/cxl/core/private_region/Makefile
> create mode 100644 drivers/cxl/core/private_region/private_region.c
> create mode 100644 drivers/cxl/core/private_region/private_region.h
> create mode 100644 drivers/cxl/core/private_region/zswap.c
> ---
> base-commit: 803dd4b1159cf9864be17aab8a17653e6ecbbbb6
>
Thanks,
Balbir
Powered by blists - more mailing lists