lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aktv2ivkrvtrox6nvcpxsnq6sagxnmj4yymelgkst6pazzpogo@aexnxfcklg75>
Date: Tue, 18 Nov 2025 18:02:02 +1100
From: Alistair Popple <apopple@...dia.com>
To: Gregory Price <gourry@...rry.net>
Cc: linux-mm@...ck.org, kernel-team@...a.com, linux-cxl@...r.kernel.org, 
	linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev, linux-fsdevel@...r.kernel.org, 
	cgroups@...r.kernel.org, dave@...olabs.net, jonathan.cameron@...wei.com, 
	dave.jiang@...el.com, alison.schofield@...el.com, vishal.l.verma@...el.com, 
	ira.weiny@...el.com, dan.j.williams@...el.com, longman@...hat.com, 
	akpm@...ux-foundation.org, david@...hat.com, lorenzo.stoakes@...cle.com, 
	Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, 
	mhocko@...e.com, osalvador@...e.de, ziy@...dia.com, matthew.brost@...el.com, 
	joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com, ying.huang@...ux.alibaba.com, 
	mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org, 
	bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, tj@...nel.org, 
	hannes@...xchg.org, mkoutny@...e.com, kees@...nel.org, muchun.song@...ux.dev, 
	roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, rientjes@...gle.com, jackmanb@...gle.com, 
	cl@...two.org, harry.yoo@...cle.com, axelrasmussen@...gle.com, 
	yuanchu@...gle.com, weixugc@...gle.com, zhengqi.arch@...edance.com, 
	yosry.ahmed@...ux.dev, nphamcs@...il.com, chengming.zhou@...ux.dev, 
	fabio.m.de.francesco@...ux.intel.com, rrichter@....com, ming.li@...omail.com, usamaarif642@...il.com, 
	brauner@...nel.org, oleg@...hat.com, namcao@...utronix.de, escape@...ux.alibaba.com, 
	dongjoo.seo1@...sung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

On 2025-11-13 at 06:29 +1100, Gregory Price <gourry@...rry.net> wrote...
> This is a code RFC for discussion related to
> 
> "Mempolicy is dead, long live memory policy!"
> https://lpc.events/event/19/contributions/2143/
> 
> base-commit: 24172e0d79900908cf5ebf366600616d29c9b417
> (version notes at end)
> 
> At LSF 2026, I plan to discuss:

Excellent! This all sounds quite interesting to me at least so I've added my two
cents here but looking forward to discussing at LPC.

> - Why? (In short: shunting to DAX is a failed pattern for users)
> - Other designs I considered (mempolicy, cpusets, zone_device)

I'm interested in the contrast with zone_device, and in particular why
device_coherent memory doesn't end up being a good fit for this.

> - Why mempolicy.c and cpusets as-is are insufficient
> - SPM types seeking this form of interface (Accelerator, Compression)

I'm sure you can guess my interest is in GPUs which also have memory some people
consider should only be used for specific purposes :-) Currently our coherent
GPUs online this as a normal NUMA noode, for which we have also generally
found mempolicy, cpusets, etc. inadequate as well, so it will be interesting to
hear what short comings you have been running into (I'm less familiar with the
Compression cases you talk about here though).

> - Platform extensions that would be nice to see (SPM-only Bits)
> 
> Open Questions
> - Single SPM nodemask, or multiple based on features?
> - Apply SPM/SysRAM bit on-boot only or at-hotplug?
> - Allocate extra "possible" NUMA nodes for flexbility?

I guess this might make hotplug easier? Particularly in cases where FW hasn't
created the nodes.

> - Should SPM Nodes be zone-restricted? (MOVABLE only?)

For device based memory I think so - otherwise you can never gurantee devices
can be removed or drivers (if required to access the memory) can be unbound as
you can't migrate things off the memory.

> - How to handle things like reclaim and compaction on these nodes.
> 
> 
> With this set, we aim to enable allocation of "special purpose memory"
> with the page allocator (mm/page_alloc.c) without exposing the same
> memory as "System RAM".  Unless a non-userland component, and does so
> with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
> 
> This isolation mechanism is a requirement for memory policies which
> depend on certain sets of memory never being used outside special
> interfaces (such as a specific mm/component or driver).
> 
> We present an example of using this mechanism within ZSWAP, as-if
> a "compressed memory node" was present.  How to describe the features
> of memory present on nodes is left up to comment here and at LPC '26.
> 
> Userspace-driven allocations are restricted by the sysram_nodes mask,
> nothing in userspace can explicitly request memory from SPM nodes.
> 
> Instead, the intent is to create new components which understand memory
> features and register those nodes with those components. This abstracts
> the hardware complexity away from userland while also not requiring new
> memory innovations to carry entirely new allocators.
> 
> The ZSwap example demonstrates this with the `mt_spm_nodemask`.  This
> hack treats all spm nodes as-if they are compressed memory nodes, and
> we bypass the software compression logic in zswap in favor of simply
> copying memory directly to the allocated page.  In a real design

So in your example (I get it's a hack) is the main advantage that you can use
all the same memory allocation policies (eg. cgroups) when needing to allocate
the pages? Given this is ZSwap I guess these pages would never be mapped
directly into user-space but would anything in the design prevent that? For
example could a driver say allocate SPM memory and then explicitly migrate an
existing page to it?

> There are 4 major changes in this set:
> 
> 1) Introducing mt_sysram_nodelist in mm/memory-tiers.c which denotes
>    the set of nodes which are eligible for use as normal system ram
> 
>    Some existing users now pass mt_sysram_nodelist into the page
>    allocator instead of NULL, but passing a NULL pointer in will simply
>    have it replaced by mt_sysram_nodelist anyway.  Should a fully NULL
>    pointer still make it to the page allocator, without GFP_SPM_NODE
>    SPM node zones will simply be skipped.
> 
>    mt_sysram_nodelist is always guaranteed to contain the N_MEMORY nodes
>    present during __init, but if empty the use of mt_sysram_nodes()
>    will return a NULL to preserve current behavior.
> 
> 
> 2) The addition of `cpuset.mems.sysram` which restricts allocations to
>    `mt_sysram_nodes` unless GFP_SPM_NODE is used.
> 
>    SPM Nodes are still allowed in cpuset.mems.allowed and effective.
> 
>    This is done to allow separate control over sysram and SPM node sets
>    by cgroups while maintaining the existing hierarchical rules.
> 
>    current cpuset configuration
>    cpuset.mems_allowed
>     |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
>     |->tasks.mems_allowed    < cpuset.mems_effective
> 
>    new cpuset configuration
>    cpuset.mems_allowed
>     |.mems_effective         < (mems_allowed ∩ parent.mems_effective)
>     |.sysram_nodes           < (mems_effective ∩ default_sys_nodemask)
>     |->task.sysram_nodes     < cpuset.sysram_nodes
> 
>    This means mems_allowed still restricts all node usage in any given
>    task context, which is the existing behavior.
> 
> 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
>    capacity being added should mark the node as an SPM Node. 
> 
>    A node is either SysRAM or SPM - never both.  Attempting to add
>    incompatible memory to a node results in hotplug failure.
> 
>    DAX and CXL are made aware of the bit and have `spm_node` bits added
>    to their relevant subsystems.
> 
> 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
>    from the provided node or nodemask.  It changes the behavior of
>    the cpuset mems_allowed and mt_node_allowed() checks.
> 
> v1->v2:
> - naming improvements
>     default_node -> sysram_node
>     protected    -> spm (Specific Purpose Memory)
> - add missing constify patch
> - add patch to update callers of __cpuset_zone_allowed
> - add additional logic to the mm sysram_nodes patch
> - fix bot build issues (ifdef config builds)
> - fix out-of-tree driver build issues (function renames)
> - change compressed_nodelist to spm_nodelist
> - add latch mechanism for sysram/spm nodes (Dan Williams)
>   this drops some extra memory-hotplug logic which is nice
> v1: https://lore.kernel.org/linux-mm/20251107224956.477056-1-gourry@gourry.net/
> 
> Gregory Price (11):
>   mm: constify oom_control, scan_control, and alloc_context nodemask
>   mm: change callers of __cpuset_zone_allowed to cpuset_zone_allowed
>   gfp: Add GFP_SPM_NODE for Specific Purpose Memory (SPM) allocations
>   memory-tiers: Introduce SysRAM and Specific Purpose Memory Nodes
>   mm: restrict slub, oom, compaction, and page_alloc to sysram by
>     default
>   mm,cpusets: rename task->mems_allowed to task->sysram_nodes
>   cpuset: introduce cpuset.mems.sysram
>   mm/memory_hotplug: add MHP_SPM_NODE flag
>   drivers/dax: add spm_node bit to dev_dax
>   drivers/cxl: add spm_node bit to cxl region
>   [HACK] mm/zswap: compressed ram integration example
> 
>  drivers/cxl/core/region.c       |  30 ++++++
>  drivers/cxl/cxl.h               |   2 +
>  drivers/dax/bus.c               |  39 ++++++++
>  drivers/dax/bus.h               |   1 +
>  drivers/dax/cxl.c               |   1 +
>  drivers/dax/dax-private.h       |   1 +
>  drivers/dax/kmem.c              |   2 +
>  fs/proc/array.c                 |   2 +-
>  include/linux/cpuset.h          |  62 +++++++------
>  include/linux/gfp_types.h       |   5 +
>  include/linux/memory-tiers.h    |  47 ++++++++++
>  include/linux/memory_hotplug.h  |  10 ++
>  include/linux/mempolicy.h       |   2 +-
>  include/linux/mm.h              |   4 +-
>  include/linux/mmzone.h          |   6 +-
>  include/linux/oom.h             |   2 +-
>  include/linux/sched.h           |   6 +-
>  include/linux/swap.h            |   2 +-
>  init/init_task.c                |   2 +-
>  kernel/cgroup/cpuset-internal.h |   8 ++
>  kernel/cgroup/cpuset-v1.c       |   7 ++
>  kernel/cgroup/cpuset.c          | 158 ++++++++++++++++++++------------
>  kernel/fork.c                   |   2 +-
>  kernel/sched/fair.c             |   4 +-
>  mm/compaction.c                 |  10 +-
>  mm/hugetlb.c                    |   8 +-
>  mm/internal.h                   |   2 +-
>  mm/memcontrol.c                 |   3 +-
>  mm/memory-tiers.c               |  66 ++++++++++++-
>  mm/memory_hotplug.c             |   7 ++
>  mm/mempolicy.c                  |  34 +++----
>  mm/migrate.c                    |   4 +-
>  mm/mmzone.c                     |   5 +-
>  mm/oom_kill.c                   |  11 ++-
>  mm/page_alloc.c                 |  57 +++++++-----
>  mm/show_mem.c                   |  11 ++-
>  mm/slub.c                       |  15 ++-
>  mm/vmscan.c                     |   6 +-
>  mm/zswap.c                      |  66 ++++++++++++-
>  39 files changed, 532 insertions(+), 178 deletions(-)
> 
> -- 
> 2.51.1
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ