linux-kernel - Re: [RFC] Add per-socket weight support for multi-socket systems in weighted interleave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aB2Xh4jEqpSTuvsi@gourry-fedora-PF4VCD3F>
Date: Fri, 9 May 2025 01:49:59 -0400
From: Gregory Price <gourry@...rry.net>
To: Rakie Kim <rakie.kim@...com>
Cc: joshua.hahnjy@...il.com, akpm@...ux-foundation.org, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, linux-cxl@...r.kernel.org,
	dan.j.williams@...el.com, ying.huang@...ux.alibaba.com,
	kernel_team@...ynix.com, honggyu.kim@...com, yunjeong.mun@...com
Subject: Re: [RFC] Add per-socket weight support for multi-socket systems in
 weighted interleave

On Fri, May 09, 2025 at 11:30:26AM +0900, Rakie Kim wrote:
> 
> Scenario 1: Adapt weighting based on the task's execution node
> A task prefers only the DRAM and locally attached CXL memory of the
> socket on which it is running, in order to avoid cross-socket access and
> optimize bandwidth.
> - A task running on CPU0 (node0) would prefer DRAM0 (w=3) and CXL0 (w=1)
> - A task running on CPU1 (node1) would prefer DRAM1 (w=3) and CXL1 (w=1)
... snip ...
> 
> However, Scenario 1 does not depend on such information. Rather, it is
> a locality-preserving optimization where we isolate memory access to
> each socket's DRAM and CXL nodes. I believe this use case is implementable
> today and worth considering independently from interconnect performance
> awareness.
> 

There's nothing to implement - all the controls exist:

1) --cpunodebind=0
2) --weighted-interleave=0,2
3) cpuset.mems
4) cpuset.cpus

You might consider maybe something like "--local-tier" (akin to
--localalloc) that sets an explicitly fallback set based on the local
node.  You'd end up doing something like

current_nid = memtier_next_local_node(socket_nid, current_nid)

Where this interface returns the preferred fallback ordering but doesn't
allow cross-socket fallback.

That might be useful, i suppose, in letting a user do:

--cpunodebind=0 --weighted-interleave --local-tier

without having to know anything about the local memory tier structure.

> > At the same time we were discussing this, we were also discussing how to
> > do external task-mempolicy modifications - which seemed significantly
> > more useful, but ultimately more complex and without sufficient
> > interested parties / users.
> 
> I'd like to learn more about that thread. If you happen to have a pointer
> to that discussion, it would be really helpful.
> 

https://lore.kernel.org/all/20231122211200.31620-1-gregory.price@memverge.com/
https://lore.kernel.org/all/ZV5zGROLefrsEcHJ@r13-u19.micron.com/
https://lore.kernel.org/linux-mm/ZWYsth2CtC4Ilvoz@memverge.com/
https://lore.kernel.org/linux-mm/20221010094842.4123037-1-hezhongkun.hzk@bytedance.com/
There are locking issues with these that aren't easy to fix.

I think the bytedance method uses a task_work queueing to defer a
mempolicy update to the task itself the next time it makes a kernel/user
transition.  That's probably the best overall approach i've seen.

https://lore.kernel.org/linux-mm/ZWezcQk+BYEq%2FWiI@memverge.com/
More notes gathered prior to implementing weighted interleave.

~Gregory