lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6604d787-1744-4acf-80c0-e428fee1677e@nvidia.com>
Date: Mon, 12 Jan 2026 22:12:23 +1100
From: Balbir Singh <balbirs@...dia.com>
To: Gregory Price <gourry@...rry.net>, linux-mm@...ck.org,
 cgroups@...r.kernel.org, linux-cxl@...r.kernel.org
Cc: linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, kernel-team@...a.com, longman@...hat.com,
 tj@...nel.org, hannes@...xchg.org, mkoutny@...e.com, corbet@....net,
 gregkh@...uxfoundation.org, rafael@...nel.org, dakr@...nel.org,
 dave@...olabs.net, jonathan.cameron@...wei.com, dave.jiang@...el.com,
 alison.schofield@...el.com, vishal.l.verma@...el.com, ira.weiny@...el.com,
 dan.j.williams@...el.com, akpm@...ux-foundation.org, vbabka@...e.cz,
 surenb@...gle.com, mhocko@...e.com, jackmanb@...gle.com, ziy@...dia.com,
 david@...nel.org, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com,
 rppt@...nel.org, axelrasmussen@...gle.com, yuanchu@...gle.com,
 weixugc@...gle.com, yury.norov@...il.com, linux@...musvillemoes.dk,
 rientjes@...gle.com, shakeel.butt@...ux.dev, chrisl@...nel.org,
 kasong@...cent.com, shikemeng@...weicloud.com, nphamcs@...il.com,
 bhe@...hat.com, baohua@...nel.org, yosry.ahmed@...ux.dev,
 chengming.zhou@...ux.dev, roman.gushchin@...ux.dev, muchun.song@...ux.dev,
 osalvador@...e.de, matthew.brost@...el.com, joshua.hahnjy@...il.com,
 rakie.kim@...com, byungchul@...com, ying.huang@...ux.alibaba.com,
 apopple@...dia.com, cl@...two.org, harry.yoo@...cle.com,
 zhengqi.arch@...edance.com
Subject: Re: [RFC PATCH v3 0/8] mm,numa: N_PRIVATE node isolation for
 device-managed memory

On 1/9/26 06:37, Gregory Price wrote:
> This series introduces N_PRIVATE, a new node state for memory nodes 
> whose memory is not intended for general system consumption.  Today,
> device drivers (CXL, accelerators, etc.) hotplug their memory to access
> mm/ services like page allocation and reclaim, but this exposes general
> workloads to memory with different characteristics and reliability
> guarantees than system RAM.
> 
> N_PRIVATE provides isolation by default while enabling explicit access
> via __GFP_THISNODE for subsystems that understand how to manage these
> specialized memory regions.
> 

I assume each class of N_PRIVATE is a separate set of NUMA nodes, these
could be real or virtual memory nodes?

> Motivation
> ==========
> 
> Several emerging memory technologies require kernel memory management
> services but should not be used for general allocations:
> 
>   - CXL Compressed RAM (CRAM): Hardware-compressed memory where the
>     effective capacity depends on data compressibility.  Uncontrolled
>     use risks capacity exhaustion when compression ratios degrade.
> 
>   - Accelerator Memory: GPU/TPU-attached memory optimized for specific
>     access patterns that are not intended for general allocation.
> 
>   - Tiered Memory: Memory intended only as a demotion target, not for
>     initial allocations.
> 
> Currently, these devices either avoid hotplugging entirely (losing mm/
> services) or hotplug as regular N_MEMORY (risking reliability issues).
> N_PRIVATE solves this by creating an isolated node class.
> 
> Design
> ======
> 
> The series introduces:
> 
>   1. N_PRIVATE node state (mutually exclusive with N_MEMORY)

We should call it N_PRIVATE_MEMORY

>   2. private_memtype enum for policy-based access control
>   3. cpuset.mems.sysram for user-visible isolation
>   4. Integration points for subsystems (zswap demonstrated)
>   5. A cxl private_region example to demonstrate full plumbing
> 
> Private Memory Types (private_memtype)
> ======================================
> 
> The private_memtype enum defines policy bits that control how different
> kernel subsystems may access private nodes:
> 
>   enum private_memtype {
>       NODE_MEM_NOTYPE,      /* No type assigned (invalid state) */
>       NODE_MEM_ZSWAP,       /* Swap compression target */
>       NODE_MEM_COMPRESSED,  /* General compressed RAM */
>       NODE_MEM_ACCELERATOR, /* Accelerator-attached memory */
>       NODE_MEM_DEMOTE_ONLY, /* Memory-tier demotion target only */
>       NODE_MAX_MEMTYPE,
>   };
> 
> These types serve as policy hints for subsystems:
> 

Do these nodes have fallback(s)? Are these nodes prone to OOM when memory is exhausted
in one class of N_PRIVATE node(s)?


What about page cache allocation form these nodes? Since default allocations
never use them, a file system would need to do additional work to allocate
on them, if there was ever a desire to use them. Would memory
migration would work between N_PRIVATE and N_MEMORY using move_pages()?


> NODE_MEM_ZSWAP
> --------------
> Nodes with this type are registered as zswap compression targets.  When
> zswap compresses a page, it can allocate directly from ZSWAP-typed nodes
> using __GFP_THISNODE, bypassing software compression if the device
> provides hardware compression.
> 
> Example flow:
>   1. CXL device creates private_region with type=zswap
>   2. Driver calls node_register_private() with NODE_MEM_ZSWAP
>   3. zswap_add_direct_node() registers the node as a compression target
>   4. On swap-out, zswap allocates from the private node
>   5. page_allocated() callback validates compression ratio headroom
>   6. page_freed() callback zeros pages to improve device compression
> 
> Prototype Note:
>   This patch set does not actually do compression ratio validation, as
>   this requires an actual device to provide some kind of counter and/or
>   interrupt to denote when allocations are safe.  The callbacks are
>   left as stubs with TODOs for device vendors to pick up the next step
>   (we'll continue with a QEMU example if reception is positive).
> 
>   For now, this always succeeds because compressed=real capacity.
> 
> NODE_MEM_COMPRESSED (CRAM)
> --------------------------
> For general compressed RAM devices.  Unlike ZSWAP nodes, CRAM nodes
> could be exposed to subsystems that understand compression semantics:
> 
>   - vmscan: Could prefer demoting pages to CRAM nodes before swap
>   - memory-tiering: Could place CRAM between DRAM and persistent memory
>   - zram: Could use as backing store instead of or alongside zswap
> 
> Such a component (mm/cram.c) would differ from zswap or zram by allowing
> the compressed pages to remain mapped Read-Only in the page table.
> 
> NODE_MEM_ACCELERATOR
> --------------------
> For GPU/TPU/accelerator-attached memory.  Policy implications:
> 
>   - Default allocations: Never (isolated from general page_alloc)
>   - GPU drivers: Explicit allocation via __GFP_THISNODE
>   - NUMA balancing: Excluded from automatic migration
>   - Memory tiering: Not a demotion target
> 
> Some GPU vendors want management of their memory via NUMA nodes, but
> don't want fallback or migration allocations to occur.  This enables
> that pattern.
> 
> mm/mempolicy.c could be used to allow for N_PRIVATE nodes of this type
> if the intent is per-vma access to accelerator memory (e.g. via mbind)
> but this is omitted from this series from now to limit userland
> exposure until first class examples are provided.
> 
> NODE_MEM_DEMOTE_ONLY
> --------------------
> For memory intended exclusively as a demotion target in memory tiering:
> 
>   - page_alloc: Never allocates initially (slab, page faults, etc.)
>   - vmscan/reclaim: Valid demotion target during memory pressure
>   - memory-tiering: Allow hotness monitoring/promotion for this region
> 
> This enables "cold storage" tiers using slower/cheaper memory (CXL-
> attached DRAM, persistent memory in volatile mode) without the memory
> appearing in allocation fast paths.
> 
> This also adds some additional bonus of enforcing memory placement on
> these nodes to be movable allocations only (with all the normal caveats
> around page pinning).
> 
> Subsystem Integration Points
> ============================
> 
> The private_node_ops structure provides callbacks for integration:
> 
>   struct private_node_ops {
>       struct list_head list;
>       resource_size_t res_start;
>       resource_size_t res_end;
>       enum private_memtype memtype;
>       int (*page_allocated)(struct page *page, void *data);
>       void (*page_freed)(struct page *page, void *data);
>       void *data;
>   };
> 
> page_allocated(): Called after allocation, returns 0 to accept or
> -ENOSPC/-ENODEV to reject (caller retries elsewhere).  Enables:
>   - Compression ratio enforcement for CRAM/zswap
>   - Capacity tracking for accelerator memory
>   - Rate limiting for demotion targets
> 
> page_freed(): Called on free, enables:
>   - Zeroing for compression ratio recovery
>   - Capacity accounting updates
>   - Device-specific cleanup
> 
> Isolation Enforcement
> =====================
> 
> The series modifies core allocators to respect N_PRIVATE isolation:
> 
>   - page_alloc: Constrains zone iteration to cpuset.mems.sysram
>   - slub: Allocates only from N_MEMORY nodes
>   - compaction: Skips N_PRIVATE nodes
>   - mempolicy: Uses sysram_nodes for policy evaluation
> 
> __GFP_THISNODE bypasses isolation, enabling explicit access:
> 
>   page = alloc_pages_node(private_nid, GFP_KERNEL | __GFP_THISNODE, 0);
> 
> This pattern is used by zswap, and would be used by other subsystems
> that explicitly opt into private node access.
> 
> User-Visible Changes
> ====================
> 
> cpuset gains cpuset.mems.sysram (read-only), shows N_MEMORY nodes.
> 
> ABI: /proc/<pid>/status Mems_allowed shows sysram nodes only.
> 
> Drivers create private regions via sysfs:
>   echo region0 > /sys/bus/cxl/.../create_private_region
>   echo zswap > /sys/bus/cxl/.../region0/private_type
>   echo 1 > /sys/bus/cxl/.../region0/commit
> 
> Series Organization
> ===================
> 
> Patch 1: numa,memory_hotplug: create N_PRIVATE (Private Nodes)
>   Core infrastructure: N_PRIVATE node state, node_mark_private(),
>   private_memtype enum, and private_node_ops registration.
> 
> Patch 2: mm: constify oom_control, scan_control, and alloc_context 
> nodemask
>   Preparatory cleanup for enforcing that nodemasks don't change.
> 
> Patch 3: mm: restrict slub, compaction, and page_alloc to sysram
>   Enforce N_MEMORY-only allocation for general paths.
> 
> Patch 4: cpuset: introduce cpuset.mems.sysram
>   User-visible isolation via cpuset interface.
> 
> Patch 5: Documentation/admin-guide/cgroups: update docs for mems_allowed
>   Document the new behavior and sysram_nodes.
> 
> Patch 6: drivers/cxl/core/region: add private_region
>   CXL infrastructure for private regions.
> 
> Patch 7: mm/zswap: compressed ram direct integration
>   Zswap integration demonstrating direct hardware compression.
> 
> Patch 8: drivers/cxl: add zswap private_region type
>   Complete example: CXL region as zswap compression target.
> 
> Future Work
> ===========
> 
> This series provides the foundation.  Planned follow-ups include:
> 
>   - CRAM integration with vmscan for smart demotion
>   - ACCELERATOR type for GPU memory management
>   - Memory-tiering integration with DEMOTE_ONLY nodes
> 
> Testing
> =======
> 
> All patches build cleanly.  Tested with:
>   - CXL QEMU emulation with private regions
>   - Zswap stress tests with private compression targets
>   - Cpuset verification of mems.sysram isolation
> 
> 
> Gregory Price (8):
>   numa,memory_hotplug: create N_PRIVATE (Private Nodes)
>   mm: constify oom_control, scan_control, and alloc_context nodemask
>   mm: restrict slub, compaction, and page_alloc to sysram
>   cpuset: introduce cpuset.mems.sysram
>   Documentation/admin-guide/cgroups: update docs for mems_allowed
>   drivers/cxl/core/region: add private_region
>   mm/zswap: compressed ram direct integration
>   drivers/cxl: add zswap private_region type
> 
>  .../admin-guide/cgroup-v1/cpusets.rst         |  19 +-
>  Documentation/admin-guide/cgroup-v2.rst       |  26 ++-
>  Documentation/filesystems/proc.rst            |   2 +-
>  drivers/base/node.c                           | 199 ++++++++++++++++++
>  drivers/cxl/core/Makefile                     |   1 +
>  drivers/cxl/core/core.h                       |   4 +
>  drivers/cxl/core/port.c                       |   4 +
>  drivers/cxl/core/private_region/Makefile      |  12 ++
>  .../cxl/core/private_region/private_region.c  | 129 ++++++++++++
>  .../cxl/core/private_region/private_region.h  |  14 ++
>  drivers/cxl/core/private_region/zswap.c       | 127 +++++++++++
>  drivers/cxl/core/region.c                     |  63 +++++-
>  drivers/cxl/cxl.h                             |  22 ++
>  include/linux/cpuset.h                        |  24 ++-
>  include/linux/gfp.h                           |   6 +
>  include/linux/mm.h                            |   4 +-
>  include/linux/mmzone.h                        |   6 +-
>  include/linux/node.h                          |  60 ++++++
>  include/linux/nodemask.h                      |   1 +
>  include/linux/oom.h                           |   2 +-
>  include/linux/swap.h                          |   2 +-
>  include/linux/zswap.h                         |   5 +
>  kernel/cgroup/cpuset-internal.h               |   8 +
>  kernel/cgroup/cpuset-v1.c                     |   8 +
>  kernel/cgroup/cpuset.c                        |  98 ++++++---
>  mm/compaction.c                               |   6 +-
>  mm/internal.h                                 |   2 +-
>  mm/memcontrol.c                               |   2 +-
>  mm/memory_hotplug.c                           |   2 +-
>  mm/mempolicy.c                                |   6 +-
>  mm/migrate.c                                  |   4 +-
>  mm/mmzone.c                                   |   5 +-
>  mm/page_alloc.c                               |  31 +--
>  mm/show_mem.c                                 |   9 +-
>  mm/slub.c                                     |   8 +-
>  mm/vmscan.c                                   |   6 +-
>  mm/zswap.c                                    | 106 +++++++++-
>  37 files changed, 942 insertions(+), 91 deletions(-)
>  create mode 100644 drivers/cxl/core/private_region/Makefile
>  create mode 100644 drivers/cxl/core/private_region/private_region.c
>  create mode 100644 drivers/cxl/core/private_region/private_region.h
>  create mode 100644 drivers/cxl/core/private_region/zswap.c
> ---
> base-commit: 803dd4b1159cf9864be17aab8a17653e6ecbbbb6
> 

Thanks,
Balbir


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ