linux-kernel - Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250512082257.263-1-rakie.kim@sk.com>
Date: Mon, 12 May 2025 17:22:50 +0900
From: Rakie Kim <rakie.kim@...com>
To: Gregory Price <gourry@...rry.net>
Cc: joshua.hahnjy@...il.com,
	akpm@...ux-foundation.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	linux-cxl@...r.kernel.org,
	dan.j.williams@...el.com,
	ying.huang@...ux.alibaba.com,
	kernel_team@...ynix.com,
	honggyu.kim@...com,
	yunjeong.mun@...com,
	Rakie Kim <rakie.kim@...com>
Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

On Fri, 9 May 2025 01:49:59 -0400 Gregory Price <gourry@...rry.net> wrote:
> On Fri, May 09, 2025 at 11:30:26AM +0900, Rakie Kim wrote:
> > 
> > Scenario 1: Adapt weighting based on the task's execution node
> > A task prefers only the DRAM and locally attached CXL memory of the
> > socket on which it is running, in order to avoid cross-socket access and
> > optimize bandwidth.
> > - A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
> > - A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
> ... snip ...
> > 
> > However, Scenario 1 does not depend on such information. Rather, it is
> > a locality-preserving optimization where we isolate memory access to
> > each socket's DRAM and CXL nodes. I believe this use case is implementable
> > today and worth considering independently from interconnect performance
> > awareness.
> > 
> 
> There's nothing to implement - all the controls exist:
> 
> 1) --cpunodebind=0
> 2) --weighted-interleave=0,2
> 3) cpuset.mems
> 4) cpuset.cpus

Thank you again for your thoughtful response and the detailed suggestions.

As you pointed out, it is indeed possible to construct node-local memory
allocation behaviors using the existing interfaces such as --cpunodebind,
--weighted-interleave, cpuset.mems, and cpuset.cpus. I appreciate you
highlighting that path.

However, what I am proposing in Scenario 1 (Adapt weighting based on the
task's execution node) is slightly different in intent.

The idea is to allow tasks to dynamically prefer the DRAM and CXL nodes
attached to the socket on which they are executing without requiring a
fixed execution node or manual nodemask configuration. For instance, if
a task is running on node0, it would prefer node0 and node2; if running
on node1, it would prefer node1 and node3.

This differs from the current model, which relies on statically binding
both the CPU and memory nodes. My proposal aims to express this behavior
as a policy-level abstraction that dynamically adapts based on execution
locality.

So rather than being a combination of manual configuration and execution
constraints, the intent is to incorporate locality-awareness into the
memory policy itself.

> 
> You might consider maybe something like "--local-tier" (akin to
> --localalloc) that sets an explicitly fallback set based on the local
> node.  You'd end up doing something like
> 
> current_nid = memtier_next_local_node(socket_nid, current_nid)
> 
> Where this interface returns the preferred fallback ordering but doesn't
> allow cross-socket fallback.
> 
> That might be useful, i suppose, in letting a user do:
> 
> --cpunodebind=0 --weighted-interleave --local-tier
> 
> without having to know anything about the local memory tier structure.

That said, I believe your suggestion for a "--local-tier" option is a
very good one. It could provide a concise, user-friendly way to activate
such locality-aware fallback behavior, even if the underlying mechanism
requires some policy extension.

In this regard, I fully agree that such an interface could greatly help
users express their intent without requiring them to understand the
details of the memory tier topology.

> 
> > > At the same time we were discussing this, we were also discussing how to
> > > do external task-mempolicy modifications - which seemed significantly
> > > more useful, but ultimately more complex and without sufficient
> > > interested parties / users.
> > 
> > I'd like to learn more about that thread. If you happen to have a pointer
> > to that discussion, it would be really helpful.
> > 
> 
> https://lore.kernel.org/all/20231122211200.31620-1-gregory.price@memverge.com/
> https://lore.kernel.org/all/ZV5zGROLefrsEcHJ@r13-u19.micron.com/
> https://lore.kernel.org/linux-mm/ZWYsth2CtC4Ilvoz@memverge.com/
> https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/
> There are locking issues with these that aren't easy to fix.
> 
> I think the bytedance method uses a task_work queueing to defer a
> mempolicy update to the task itself the next time it makes a kernel/user
> transition.  That's probably the best overall approach i've seen.
> 
> https://lore.kernel.org/linux-mm/ZWezcQk+BYEq%2FWiI@memverge.com/
> More notes gathered prior to implementing weighted interleave.

Thank you for sharing the earlier links to related discussions and
patches. They were very helpful, and I will review them carefully to
gather more ideas and refine my thoughts further.

I look forward to any further feedback you may have on this topic.

Best regards,
Rakie

> 
> ~Gregory
>