linux-kernel - Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aSSepu6NDqS8HHCa@gourry-fedora-PF4VCD3F>
Date: Mon, 24 Nov 2025 11:06:30 -0700
From: Gregory Price <gourry@...rry.net>
To: "David Hildenbrand (Red Hat)" <david@...nel.org>
Cc: linux-mm@...ck.org, kernel-team@...a.com, linux-cxl@...r.kernel.org,
	linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev,
	linux-fsdevel@...r.kernel.org, cgroups@...r.kernel.org,
	dave@...olabs.net, jonathan.cameron@...wei.com,
	dave.jiang@...el.com, alison.schofield@...el.com,
	vishal.l.verma@...el.com, ira.weiny@...el.com,
	dan.j.williams@...el.com, longman@...hat.com,
	akpm@...ux-foundation.org, lorenzo.stoakes@...cle.com,
	Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org,
	surenb@...gle.com, mhocko@...e.com, osalvador@...e.de,
	ziy@...dia.com, matthew.brost@...el.com, joshua.hahnjy@...il.com,
	rakie.kim@...com, byungchul@...com, ying.huang@...ux.alibaba.com,
	apopple@...dia.com, mingo@...hat.com, peterz@...radead.org,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
	mgorman@...e.de, vschneid@...hat.com, tj@...nel.org,
	hannes@...xchg.org, mkoutny@...e.com, kees@...nel.org,
	muchun.song@...ux.dev, roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev, rientjes@...gle.com, jackmanb@...gle.com,
	cl@...two.org, harry.yoo@...cle.com, axelrasmussen@...gle.com,
	yuanchu@...gle.com, weixugc@...gle.com, zhengqi.arch@...edance.com,
	yosry.ahmed@...ux.dev, nphamcs@...il.com, chengming.zhou@...ux.dev,
	fabio.m.de.francesco@...ux.intel.com, rrichter@....com,
	ming.li@...omail.com, usamaarif642@...il.com, brauner@...nel.org,
	oleg@...hat.com, namcao@...utronix.de, escape@...ux.alibaba.com,
	dongjoo.seo1@...sung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

On Mon, Nov 24, 2025 at 10:19:37AM +0100, David Hildenbrand (Red Hat) wrote:
> [...]
> 

Apologies in advance for the wall of text, both of your questions really
do cut to the core of the series.  The first (SPM nodes) is basically a
plumbing problem I haven't had time to address pre-LPC, the second (GFP)
is actually a design decision that is definitely up in the air.

So consider this a dump of everything I wouldn't have had time to cover
in the LPC session.

> > 3) Addition of MHP_SPM_NODE flag to instruct memory_hotplug.c that the
> >     capacity being added should mark the node as an SPM Node.
> 
> Sounds a bit like the wrong interface for configuring this. This smells like
> a per-node setting that should be configured before hotplugging any memory.
> 

Assuming you're specifically talking about the MHP portion of this.

I agree, and I think the plumbing ultimately goes through acpi and
kernel configs.  This was my shortest path to demonstrate a functional
prototype by LPC.

I think the most likely option simply reserving additional NUMA nodes for
hotpluggable regions based on a Kconfig setting.

I think the real setup process should look like follows:

1. At __init time, Linux reserves additional SPM nodes based on some
   configuration (build? runtime? etc)

   Essentially create:  nodes[N_SPM]

2. At SPM setup time, a driver registers an "Abstract Type" with
   mm/memory_tiers.c  which maps SPM->Type.

   This gives the core some management callback infrastructure without
   polluting the core with device specific nonsense.

   This also gives the driver a change to define things like SLIT
   distances for those nodes, which otherwise won't exist.

3. At hotplug time, memory-hotplug.c should only have to flip a bit
   in `mt_sysram_nodes` if NID is not in nodes[N_SPM].  That logic
   is still there to ensure the base filtering works as intended.


I haven't quite figured out how to plumb out nodes[N_SPM] as described
above, but I did figure out how to demonstrate roughly the same effect
through memory-hotplug.c - hopefully that much is clear.

The problem with the above plan, is whether that "Makes sense" according
to ACPI specs and friends.

This operates in "Ambiguity Land", which is uncomfortable.

======== How Linux ingests ACPI Tables to make NUMA nodes =======
For the sake of completeness:

NUMA nodes are "marked as possible" primarily via entries in the ACPI
SRAT (Static Resource Affinity Table).
https://docs.kernel.org/driver-api/cxl/platform/acpi/srat.html

        Subtable Type : 01 [Memory Affinity]
               Length : 28
     Proximity Domain : 00000001          <- NUMA Node 1

A proximity domain (PXM) is simply a logical grouping of components
according to the OSPM.  Linux takes PXMs and maps them to NUMA nodes.

In most cases (NR_PXM == NR_NODES), but not always.  For example, if
the CXL Early Detection Table (CEDT) describes a CXL memory region for
which there is no SRAT entry, Linux reserves a "Fake PXM" id and
marks that ID as a "possible" NUMA node.

= drivers/acpi/numa/srat.c

int __init acpi_numa_init(void)
{
...
        /* fake_pxm is the next unused PXM value after SRAT parsing */
        for (i = 0, fake_pxm = -1; i < MAX_NUMNODES; i++) {
                if (node_to_pxm_map[i] > fake_pxm)
                        fake_pxm = node_to_pxm_map[i];
        }
        last_real_pxm = fake_pxm;
        fake_pxm++;
        acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
                              &fake_pxm);
...
}

static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
                                   void *arg, const unsigned long table_end)
{
...
        /* No SRAT description. Create a new node. */
        node = acpi_map_pxm_to_node(*fake_pxm);
...
        node_set(node, numa_nodes_parsed);   <- this is used to set N_POSSIBLE
}


Here's where we get into "Specification Ambiguity"

The ACPI spec does not limit (as far as I can see) a memory region from
being associated with multiple proximity domains (NUMA nodes).

Therefore, the OSPM could actually report it multiple times in the SRAT
in order to reserve multiple NUMA node possiblities for the same device.

A further extention to ACPI could be used to mark such Memory PXMs as
"Specific Purpose" - similar to the EFI_MEMORY_SP bit used to mark
memory regions as "Soft Reserved".

(this would probably break quite a lot of existing linux code, which
 a quick browse around gives you the sense that there's an assumption
 a given page can only be affiliated with one possible numa node)

But Linux could also utilize build or runtime settings to add additional
nodes which are reserved for SPM use - but are otherwise left out of
all the default maps.  This at least seems reasonable.

Note: N_POSSIBLE nodes is set at __init time, and is more or less
expected to never change.  It's probably preferable to work with this
restriction, rather than to try to change it.  Many race conditions.

<skippable wall>
================= Spec nonsense for reference ====================
(ACPI 6.5 Spec)
5.2.16.2 Memory Affinity Structure
The Memory Affinity structure provides the following topology information statically to the operating system:
• The association between a memory range and the proximity domain to which it belongs
• Information about whether the memory range can be hot-plugged.


5.2.19 Maximum System Characteristics Table (MSCT)
This section describes the format of the Maximum System Characteristic Table (MSCT), which provides OSPM with
information characteristics of a system’s maximum topology capabilities. If the system maximum topology is not
known up front at boot time, then this table is not present. OSPM will use information provided by the MSCT only
when the System Resource Affinity Table (SRAT) exists. The MSCT must contain all proximity and clock domains
defined in the SRAT.

-- field: Maximum Number of Proximity Domains
   Indicates the maximum number of Proximity Domains ever possible in the system.

   In theory an OSPM could make (MAX_NODES > (NR_NODES in SRAT)) and
   that delta could be used to indicate the presense of SPM nodes.

   This doesn't solve the SLIT PXM distance problem.


6.2.14 _PXM (Proximity)
This optional object is used to describe proximity domain associations within a machine. _PXM evaluates to an integer
that identifies a device as belonging to a Proximity Domain defined in the System Resource Affinity Table (SRAT).
OSPM assumes that two devices in the same proximity domain are tightly coupled.


17.2.1 System Resource Affinity Table Definition
The optional System Resource Affinity Table (SRAT) provides the boot time description of the processor and memory
ranges belonging to a system locality. OSPM will consume the SRAT only at boot time. For any devices not in the
SRAT, OSPM should use _PXM (Proximity) for them or their ancestors that are hot-added into the system after boot
up.

The SRAT describes the system locality that all processors and memory present in a system belong to at system boot.
This includes memory that can be hot-added (that is memory that can be added to the system while it is running,
without requiring a reboot). OSPM can use this information to optimize the performance of NUMA architecture
systems. For example, OSPM could utilize this information to optimize allocation of memory resources and the
scheduling of software threads.

=============================================================
</skippable wall>


So TL;DR: Yes, I agree, this logic should __init time configured, but
while we work on that plumbing, the memory-hotplug.c interface can be
used to unblock exploratory work (such as Alistair's GPU interests).

> > 4) Adding GFP_SPM_NODE - which allows page_alloc.c to request memory
> >     from the provided node or nodemask.  It changes the behavior of
> >     the cpuset mems_allowed and mt_node_allowed() checks.
> 
> I wonder why that is required. Couldn't we disallow allocation from one of
> these special nodes as default, and only allow it if someone explicitly
> passes in the node for allocation?
> 
> What's the problem with that?
> 

Simple answer:  We can choose how hard this guardrail is to break.

This initial attempt makes it "Hard":
   You cannot "accidentally" allocate SPM, the call must be explicit.

Removing the GFP would work, and make it "Easier" to access SPM memory.
(There would be other adjustments needed, but the idea is the same).


To do this you would revert the mems_allowed check changes in cpuset
to check mems_allowed always (instead of sysram_nodes).

This would allow a trivial 

   mbind(range, SPM_NODE_ID)

Which is great, but is also an incredible tripping hazard:

   numactl --interleave --all

and in kernel land:

   __alloc_pages_noprof(..., nodes[N_MEMORY])

These will now instantly be subject to SPM node memory.


The first pass leverages the GFP flag to make all these tripping hazards
disappear.  You can pass a completely garbage nodemask into the page
allocator and still rest assured that you won't touch SPM nodes.


So TL;DR: "What do we want here?" (if anything at all)


For completeness, here are the page_alloc/cpuset/mempolicy interactions
which lead me to a GFP flag as the "loosening mechanism" for the filter,
rather than allowing any nodemask to "just work".


Apologies again for the wall of text here, essentially dumping
~6 months of research and prototyping.

====================
There are basically 3 components which interact with each other:

   1) the page allocator nodemask / zone logic
   2) cpuset.mems_allowed
   3) mempolicy (task, vma)

   and now:

   4) GFP_SPM_NODE


=== 1) the page allocator nodemask and zone iteration logic

   - page allocator uses prepare_alloc_pages() to decide what
     alloc_context.nodemask will contain

   - nodemask can be NULL or a set of nodes.

   - for_zone() iteration logic will iterate all zones if mask=NULL
     Otherwise, it skips zones on nodes not present in the mask

   - the value of alloc_context.nodemask may change
     for example it may end up loosened if in an interrupt context or
     if reclaim/compaction/fallbacks are invoked.


Some issues might be obvious:

   It would be bad, for example, for an interrupt to have its allocation
   context loosened to nodes[N_MEMORY] and end up allocating SPM memory

   Capturing all of these scenarios would be very difficult if not
   impossible.



The page allocator does an initial filtering of nodes if nodemask=NULL,
or it defers the filter operation to the allocation logic if a nodemask
is present (or we're in a interrupt context).

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
                int preferred_nid, nodemask_t *nodemask,
                struct alloc_context *ac, gfp_t *alloc_gfp,
                unsigned int *alloc_flags)
{
...
        ac->nodemask = nodemask;
        if (cpuset_enabled()) {
	...
                if (in_task() && !ac->nodemask)
                        ac->nodemask = &cpuset_current_mems_allowed;
			               ^^^^ current_task.mems_allowed
                else 
                        *alloc_flags |= ALLOC_CPUSET;
			^^^ apply cpuset check during allocation instead
        }
}


Note here: If cpuset is not enabled, we don't filter!
           patch 05/11 uses mt_sysram_nodes to filter in that scenario

In the actual allocation logic, we use this nodemask (or cpusets) to
filter out unwanted nodes.

static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
                                                const struct alloc_context *ac)
{
        z = ac->preferred_zoneref;
        for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
                                        ac->nodemask) {
		^ if nodemask=NULL - iterates ALL zones in all nodes ^
	...
                if (cpuset_enabled() &&
                        (alloc_flags & ALLOC_CPUSET) &&
                        !__cpuset_zone_allowed(zone, gfp_mask))
                                continue;
 		^^^^^^^^  Skip zone if not in mems_allowed ^^^^^^^^^


Of course we could change the page allocator logic more explicitly
to support this kind of scenario.

For example:

   We might add alloc_spm_pages() which checks mems_allowed instead
   of sysram_nodes.

I tried this, and the code duplication and spaghetti it resulted in
was embarassing.  It did work, but adding hundreds of lines to
page_alloc.c, with the risk of breaking something just lead me to
quickly disgarded it. 

It also just bluntly made using SPM memory worse - you just want to
call alloc_pages(nodemask) and be done with it.

This is what lead me to focus on modifying cpuset.mems_allowed and
add global filter logic when cpusets is disabled.



=== 2) cpuset.mems

   - cpuset.mems_allowed is the "primary filter" for most allocations

   - if cpusets is not enabled, basically all nodes are "allowed"

   - cpuset.mems_allowed is an *inherited value*

     child cgroups are restricted by the parent's mems_allowed
     cpuset.effective_mems is the actual nodemask filter.

cpuset.mems_allowed as-is cannot both restrict *AND* allow SPM nodes.

See the filtering functions above:

   If you remove an SPM node from root_cgroup.cpuset.mems_allowed
   to all of its children from using it, you effectively prevent
   ANYTHING from using it:  The node is simply not allowed.

   Since all tasks operate from within a the root context or its
   children - you can never "Allow" the node.

   If you don't remove the SPM node from the root cgroup, you aren't
   preventing tasks in the root cgroup from accessing the memory.


I chose to break mems_allowed into (mems_allowed, sysram_nodes) to:

  a) create simple nodemask=NULL default nodemask filters:
     mt_sysram_nodes, cpuset.sysram_nodes, task.sysram_nodes

  b) Leverage the existing cpuset filtering mechanism in
     mems_allowed() checks

  c) Simplify the non-cpuset filter mechanism to a 2-line change
     in page_alloc.c -- from Patch 04/11:

@@ -3753,6 +3754,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		if ((alloc_flags & ALLOC_CPUSET) &&
 		    !cpuset_zone_allowed(zone, gfp_mask))
 			continue;
+		else if (!mt_node_allowed(zone_to_nid(zone), gfp_mask))
+			continue;



page_alloc.c changes are much cleaner and easy to understand this way


=== 3) mempolicy

   - mempolicy allows you change the task or vma node-policy, separate
     from (but restricted by) cpuset.mems

   - there are some policies like interleave which provide (ALL) options
     which create, basically, a nodemask=nodes[N_MEMORY] scenario.

   - This is entirely controllable via userspace.

   - There exists a lot of software out there which makes use of this
     interface via numactl syscalls (set_mempolicy, mbind, etc)

   - There is a global "default" mempolicy which is leveraged when
     task->mempolicy=NULL or vma->vm_policy=NULL.

     The default policy is essentially "Allocate from local node, but
     fallback to any possible node as-needed"


During my initial explorations I started by looking at whether a filter
function could be implemented via the global policy.

It should be somewhat obvious this falls apart completely as soon as you
find the page allocator actually filters using cpusets.

So mempolicies are dead as a candidate for any real isolation mechanism.
It is nothing more than a suggestion at best, and is actually explicitly
ignored by things like reclaim.

   (cough: Mempolicy is dead, long live Memory Policy)

I was also very worried about introducing an SPM Node solution which
presented as an isolation mechanism... which then immediately crashed
and burned when deployed by anyone already using numactl.

I have since, however, been experimenting with how you might enable
mempolicy to include SPM nodes more explicitly (with the GFP flag).

(attached at the end, completely untested, just conceptual).


=== 4) GFP_SPM_NODE

Once the filtering functions are in place (sysram_nodes), we've hit
a point where absolutely nothing can actually touch those nodes at all.

So that was requirement #1... but of course we do actually want to
allocate this memory, that's the point.  But now we have a choice...

If a node is present in the nodemask, we can:

   1) filter it based on sysram_nodes
      a) cpuset.sysram, or
      b) mt_sysram_nodes

   or

   2) filter it based on mems_allowed
      a) cpuset.effective_mems, or
      b) nodes[N_MEMORY]


The first choice is "Hard Guardrails" - it requires both an explict mask
AND the GFP flag to reach SPM memory.

The second choice is "Soft Guardrails" - more or less any nodemask is
allowed, and we trust the callers to be sane.


The cpuset filter functions already had gfp argument by the way:

bool cpuset_current_node_allowed(int node, gfp_t gfp_mask) {...}


I chose the former for the first pass due to the mempolicy section
above.  If someone has an idea of how to apply this filtering logic
WITHOUT the GFP flag - I am absolutely welcome to suggestions.

My only other idea was separate alloc_spm_pages() interfaces, and that
just felt bad.

~Gregory


---------------  mempolicy extension ----------

mempolicy: add MPOL_F_SPM_NODE

Add a way for mempolicies to access SPM nodes.

Require MPOL_F_STATIC_NODES to prevent the policy mask from being
remapped onto other nodes.

Note: This doesn't work as-is because mempolicies are restricted by
cpuset.sysram_nodes instead of cpuset.mems_allowed, so the nodemask
will be rejected.  This can be changed in the new/rebind mempolicy
interfaces.

Signed-off-by: Gregory Price

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8fbbe613611a..c26aa8fb56d3 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -31,6 +31,7 @@ enum {
 #define MPOL_F_STATIC_NODES    (1 << 15)
 #define MPOL_F_RELATIVE_NODES  (1 << 14)
 #define MPOL_F_NUMA_BALANCING  (1 << 13) /* Optimize with NUMA balancing if possible */
+#define MPOL_F_SPM_NODE        (1 << 12) /* Nodemask contains SPM Nodes */

 /*
  * MPOL_MODE_FLAGS is the union of all possible optional mode flags passed to
diff --git a/mm/memory.c b/mm/memory.c
index b59ae7ce42eb..7097d7045954 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3459,8 +3459,14 @@ static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma)
 {
        struct file *vm_file = vma->vm_file;

-       if (vm_file)
-               return mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+       if (vm_file) {
+               gfp_t gfp;
+               gfp = mapping_gfp_mask(vm_file->f_mapping) | __GFP_FS | __GFP_IO;
+               if (vma->vm_policy)
+                       gfp |= (vma->vm_policy->flags & MPOL_F_SPM_NODE) ?
+                               __GFP_SPM_NODE : 0;
+               return gfp;
+       }

        /*
         * Special mappings (e.g. VDSO) do not have any file so fake
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e1e8a1f3e1a2..2b4d23983ef8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1652,6 +1652,8 @@ static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
                return -EINVAL;
        if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES))
                return -EINVAL;
+       if ((*flags & MPOL_F_SPM_NODE) && !(*flags & MPOL_F_STATIC_NODES))
+               return -EINVAL;
        if (*flags & MPOL_F_NUMA_BALANCING) {
                if (*mode == MPOL_BIND || *mode == MPOL_PREFERRED_MANY)
                        *flags |= (MPOL_F_MOF | MPOL_F_MORON);