[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <icora3w7wfisv2vtdc5w3w4kum2wbwqx2fmnxrrjo4tp7hgvem@jmb35qkh5ylx>
Date: Thu, 27 Nov 2025 16:12:05 +1100
From: Alistair Popple <apopple@...dia.com>
To: Gregory Price <gourry@...rry.net>
Cc: Kiryl Shutsemau <kirill@...temov.name>, linux-mm@...ck.org,
kernel-team@...a.com, linux-cxl@...r.kernel.org, linux-kernel@...r.kernel.org,
nvdimm@...ts.linux.dev, linux-fsdevel@...r.kernel.org, cgroups@...r.kernel.org,
dave@...olabs.net, jonathan.cameron@...wei.com, dave.jiang@...el.com,
alison.schofield@...el.com, vishal.l.verma@...el.com, ira.weiny@...el.com,
dan.j.williams@...el.com, longman@...hat.com, akpm@...ux-foundation.org, david@...hat.com,
lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, vbabka@...e.cz, rppt@...nel.org,
surenb@...gle.com, mhocko@...e.com, osalvador@...e.de, ziy@...dia.com,
matthew.brost@...el.com, joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com,
ying.huang@...ux.alibaba.com, mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org,
bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, tj@...nel.org,
hannes@...xchg.org, mkoutny@...e.com, kees@...nel.org, muchun.song@...ux.dev,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, rientjes@...gle.com, jackmanb@...gle.com,
cl@...two.org, harry.yoo@...cle.com, axelrasmussen@...gle.com,
yuanchu@...gle.com, weixugc@...gle.com, zhengqi.arch@...edance.com,
yosry.ahmed@...ux.dev, nphamcs@...il.com, chengming.zhou@...ux.dev,
fabio.m.de.francesco@...ux.intel.com, rrichter@....com, ming.li@...omail.com, usamaarif642@...il.com,
brauner@...nel.org, oleg@...hat.com, namcao@...utronix.de, escape@...ux.alibaba.com,
dongjoo.seo1@...sung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes
On 2025-11-26 at 02:05 +1100, Gregory Price <gourry@...rry.net> wrote...
> On Tue, Nov 25, 2025 at 02:09:39PM +0000, Kiryl Shutsemau wrote:
> > On Wed, Nov 12, 2025 at 02:29:16PM -0500, Gregory Price wrote:
> > > With this set, we aim to enable allocation of "special purpose memory"
> > > with the page allocator (mm/page_alloc.c) without exposing the same
> > > memory as "System RAM". Unless a non-userland component, and does so
> > > with the GFP_SPM_NODE flag, memory on these nodes cannot be allocated.
> >
> > How special is "special purpose memory"? If the only difference is a
> > latency/bandwidth discrepancy compared to "System RAM", I don't believe
> > it deserves this designation.
> >
>
> That is not the only discrepancy, but it can certainly be one of them.
>
> I do think, at a certain latency/bandwidth level, memory becomes
> "Specific Purpose" - because the performance implications become so
> dramatic that you cannot allow just anything to land there.
>
> In my head, I've been thinking about this list
>
> 1) Plain old memory (<100ns)
> 2) Kinda slower, but basically still memory (100-300ns)
> 3) Slow Memory (>300ns, up to 2-3us loaded latencies)
> 4) Types 1-3, but with a special feature (Such as compression)
> 5) Coherent Accelerator Memory (various interconnects now exist)
> 6) Non-coherent Shared Memory and PMEM (FAMFS, Optane, etc)
>
> Originally I was considering [3,4], but with Alistar's comments I am
> also thinking about [5] since apparently some accelerators already
> toss their memory into the page allocator for management.
Thanks.
> Re: Slow memory --
>
> Think >500-700ns cache line fetches, or 1-2us loaded.
>
> It's still "Basically just memory", but the scenarios in which
> you can use it transparently shrink significantly. If you can
> control what and how things can land there with good policy,
> this can still be a boon compared to hitting I/O.
>
> But you still want things like reclaim and compaction to run
> on this memory, and you still want buddy-allocation of this memory.
>
> Re: Compression
>
> This is a class of memory device which presents "usable memory"
> but which carries stipulations around its use.
>
> The compressed case is the example I use in this set. There is an
> inline compression mechanism on the device. If the compression ratio
> drops to low, writes can get dropped resulting in memory poison.
>
> We could solve this kind of problem only allowing allocation via
> demotion and hack off the Write-bit in the PTE. This provides the
> interposition needed to fend-off compression ratio issues.
>
> But... it's basically still "just memory" - you can even leave it
> mapped in the CPU page tables and allow userland to read unimpeded.
>
> In fact, we even want things like compaction and reclaim to run here.
> This cannot be done *unless* this memory is in the page allocator,
> and basically necessitates reimplementing all the core services the
> kernel provides.
>
> Re: Accelerators
>
> Alistair has described accelerators onlining their memory as NUMA
> nodes being an existing pattern (apparently not in-tree as far as I
> can see, though).
Yeah, sadly not yet :-( Hopefully "soon". Although onlining the memory doesn't
have much driver involvement as the GPU memory all just appears in the ACPI
tables as a CPU-less memory node anyway (which is why it ended up being easy for
people to toss it into the page allocator).
> General consensus is "don't do this" - and it should be obvious
> why. Memory pressure can cause non-workload memory to spill to
> these NUMA nodes as fallback allocation targets.
Indeed, this is a common complaint when people have done this.
> But if we had a strong isolation mechanism, this could be supported.
> I'm not convinced this kind of memory actually needs core services
> like reclaim, so I will wait to see those arguments/data before I
> conclude whether the idea is sound.
Sounds reasonable, I don't have strong arugments either way at the moment so
will see if we can gather some data.
>
>
> >
> > I am not in favor of the new GFP flag approach. To me, this indicates
> > that our infrastructure surrounding nodemasks is lacking. I believe we
> > would benefit more by improving it rather than simply adding a GFP flag
> > on top.
> >
>
> The core of this series is not the GFP flag, it is the splitting of
> (cpuset.mems_allowed) into (cpuset.mems_allowed, cpuset.sysram_nodes)
>
> That is the nodemask infrastructure improvement. The GFP flag is one
> mechanism of loosening the validation logic from limiting allocations
> from (sysram_nodes) to including all nodes present in (mems_allowed).
>
> > While I am not an expert in NUMA, it appears that the approach with
> > default and opt-in NUMA nodes could be generally useful. Like,
> > introduce a system-wide default NUMA nodemask that is a subset of all
> > possible nodes.
>
> This patch set does that (cpuset.sysram_nodes and mt_sysram_nodemask)
>
> > This way, users can request the "special" nodes by using
> > a wider mask than the default.
> >
>
> I describe in the response to David that this is possible, but creates
> extreme tripping hazards for a large swath of existing software.
>
> snippet
> '''
> Simple answer: We can choose how hard this guardrail is to break.
>
> This initial attempt makes it "Hard":
> You cannot "accidentally" allocate SPM, the call must be explicit.
>
> Removing the GFP would work, and make it "Easier" to access SPM memory.
>
> This would allow a trivial
>
> mbind(range, SPM_NODE_ID)
>
> Which is great, but is also an incredible tripping hazard:
>
> numactl --interleave --all
>
> and in kernel land:
>
> __alloc_pages_noprof(..., nodes[N_MEMORY])
>
> These will now instantly be subject to SPM node memory.
> '''
>
> There are many places that use these patterns already.
>
> But at the end of the day, it is preference: we can choose to do that.
>
> > cpusets should allow to set both default and possible masks in a
> > hierarchical manner where a child's default/possible mask cannot be
> > wider than the parent's possible mask and default is not wider that
> > own possible.
> >
>
> This patch set implements exactly what you describe:
> sysram_nodes = default
> mems_allowed = possible
>
> > > Userspace-driven allocations are restricted by the sysram_nodes mask,
> > > nothing in userspace can explicitly request memory from SPM nodes.
> > >
> > > Instead, the intent is to create new components which understand memory
> > > features and register those nodes with those components. This abstracts
> > > the hardware complexity away from userland while also not requiring new
> > > memory innovations to carry entirely new allocators.
> >
> > I don't see how it is a positive. It seems to be negative side-effect of
> > GFP being a leaky abstraction.
> >
>
> It's a matter of applying an isolation mechanism and then punching an
> explicit hole in it. As it is right now, GFP is "leaky" in that there
> are, basically, no walls. Reclaim even ignored cpuset controls until
> recently, and the page_alloc code even says to ignore cpuset when
> in an interrupt context.
>
> The core of the proposal here is to provide a strong isolation mechanism
> and then allow punching explicit holes in it. The GFP flag is one
> pattern, I'm open to others.
>
> ~Gregory
Powered by blists - more mailing lists