lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <36edd166-7e11-4d43-9839-42467d4399d1@nvidia.com>
Date: Wed, 3 Dec 2025 15:36:33 +1100
From: Balbir Singh <balbirs@...dia.com>
To: Gregory Price <gourry@...rry.net>
Cc: linux-mm@...ck.org, kernel-team@...a.com, linux-cxl@...r.kernel.org,
 linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev,
 linux-fsdevel@...r.kernel.org, cgroups@...r.kernel.org, dave@...olabs.net,
 jonathan.cameron@...wei.com, dave.jiang@...el.com,
 alison.schofield@...el.com, vishal.l.verma@...el.com, ira.weiny@...el.com,
 dan.j.williams@...el.com, longman@...hat.com, akpm@...ux-foundation.org,
 david@...hat.com, lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com,
 vbabka@...e.cz, rppt@...nel.org, surenb@...gle.com, mhocko@...e.com,
 osalvador@...e.de, ziy@...dia.com, matthew.brost@...el.com,
 joshua.hahnjy@...il.com, rakie.kim@...com, byungchul@...com,
 ying.huang@...ux.alibaba.com, apopple@...dia.com, mingo@...hat.com,
 peterz@...radead.org, juri.lelli@...hat.com, vincent.guittot@...aro.org,
 dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
 mgorman@...e.de, vschneid@...hat.com, tj@...nel.org, hannes@...xchg.org,
 mkoutny@...e.com, kees@...nel.org, muchun.song@...ux.dev,
 roman.gushchin@...ux.dev, shakeel.butt@...ux.dev, rientjes@...gle.com,
 jackmanb@...gle.com, cl@...two.org, harry.yoo@...cle.com,
 axelrasmussen@...gle.com, yuanchu@...gle.com, weixugc@...gle.com,
 zhengqi.arch@...edance.com, yosry.ahmed@...ux.dev, nphamcs@...il.com,
 chengming.zhou@...ux.dev, fabio.m.de.francesco@...ux.intel.com,
 rrichter@....com, ming.li@...omail.com, usamaarif642@...il.com,
 brauner@...nel.org, oleg@...hat.com, namcao@...utronix.de,
 escape@...ux.alibaba.com, dongjoo.seo1@...sung.com
Subject: Re: [RFC LPC2026 PATCH v2 00/11] Specific Purpose Memory NUMA Nodes

On 11/26/25 19:29, Gregory Price wrote:
> On Wed, Nov 26, 2025 at 02:23:23PM +1100, Balbir Singh wrote:
>> On 11/13/25 06:29, Gregory Price wrote:
>>> This is a code RFC for discussion related to
>>>
>>> "Mempolicy is dead, long live memory policy!"
>>> https://lpc.events/event/19/contributions/2143/
>>>
>>
>> :)
>>
>> I am trying to read through your series, but in the past I tried
>> https://lwn.net/Articles/720380/
>>
> 
> This is very interesting, I gave the whole RFC a read and it seems you
> were working from the same conclusion ~8 years ago - that NUMA just
> plainly "Feels like the correct abstraction".
> 
> First, thank you, the read-through here filled in some holes regarding
> HMM-CDM for me.  If you have developed any other recent opinions on the
> use of HMM-CDM vs NUMA-CDM, your experience is most welcome.
> 

Sorry for the delay in responding, I've not yet read through your series

> 
> Some observations:
> 
> 1) You implemented what amounts to N_SPM_NODES 
> 
>    - I find it funny we separately came to the same conclusion. I had
>      not seen your series while researching this, that should be an
>      instructive history lesson for readers.
> 
>    - N_SPM_NODES probably dictates some kind of input from ACPI table
>      extension, drivers input (like my MHP flag), or kernel configs
>      (build/init) to make sense.
> 
>    - I discussed in my note to David that this is probably the right
>      way to go about doing it. I think N_MEMORY can still be set, if
>      a new global-default-node policy is created.
> 

I still think N_MEMORY as a flag should mean something different from
N_SPM_NODE_MEMORY because their characteristics are different

>    - cpuset/global sysram_nodes masks in this set are that policy.
> 
> 
> 2) You bring up the concept of NUMA node attributes
> 
>    - I have privately discussed this concept with MM folks, but had
>      not come around to formalize this.  It seems a natural extension.
> 
>    - I wasn't sure whether such a thing would end up in memory-tiers.c
>      or somehow abstracted otherwise.  We definitely do not want node
>      attributes to imply infinite N_XXXXX masks.

I have to think about this some more

> 
> 
> 3) You attacked the problem from the zone iteration mechanism as the
>    primary allocation filter - while I used cpusets and basically
>    implemented a new in-kernel policy (sysram_nodes)
> 
>    - I chose not to take that route (omitting these nodes from N_MEMORY)
>      precisely because it would require making changes all over the
>      kernel for components that may want to use the memory which
>      leverage N_MEMORY for zone iteration.
> 
>    - Instead, I can see either per-component policies (reclaim->nodes)
>      or a global policy that covers all of those components (similar to
>      my sysram_nodes).  Drivers would then be responsible to register
>      their hotplugged memory nodes with those components accordingly.
> 

To me node zonelists provide the right abstraction of where to allocate from
and how to fallback as needed. I'll read your patches to figure out how your
approach is different. I wanted the isolation at allocation time

>    - My mechanism requires a GFP flag to punch a hole in the isolation,
>      while yours depends on the fact that page_alloc uses N_MEMORY if
>      nodemask is not provided.  I can see an argument for going that
>      route instead of the sysram_nodes policy, but I also understand
>      why removing them from N_MEMORY causes issues (how do you opt these
>      nodes into core services like kswapd and such).
> 
>      Interesting discussions to be had.


Yes, we should look at the pros and cons. To be honest, I'd wouldn't be 
opposed to having kswapd and reclaim look different for these nodes, it
would also mean that we'd need pagecache hooks if we want page cache on
these nodes. Everything else, including move_pages() should just work.

> 
> 
> 4)   Many commenters tried pushing mempolicy as the place to do this.
>      We both independently came to the conclusion that 
> 
>    - mempolicy is at best an insufficient mechanism for isolation due
>      to the way the rest of the system is designed (cpusets, zones)
> 
>    - at worst, actually harmful because it leads kernel developers to
>      believe users view mempolicy APIs as reasonable. They don't.
>      In my experience it's viewed as:
>          - too complicated (SW doesn't want to know about HW)
>          - useless (it's not even respected by reclaim)
>          - actively harmful (it makes your code less portable)
> 	 - "The only thing we have"
> 
> Your RFC has the same concerns expressed that I have seen over past
> few years in Device-Memory development groups... except that the general
> consensus was (in 2017) that these devices were not commodity hardware
> the kernel needs a general abstraction (NUMA) to support.
> 
> "Push the complexity to userland" (mempolicy), and
> "Make the driver manage it." (hmm/zone_device)
> 
Yep

> Have been the prevailing opinions as a result.
> 
> From where I sit, this depends on the assumption that anyone using such
> systems is presumed to be sophisticated and empowered enough to accept
> that complexity.  This is just quite bluntly no longer the case.
> 
> GPUs, unified memory, and coherent interconnects have all become
> commodity hardware in the data center, and the "users" here are
> infrastructure-as-a-service folks that want these systems to be
> some definition of fungible.
> 

I also think the absence of better integration makes memory management harder

Balbir

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ