[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20231031003810.4532-1-gregory.price@memverge.com>
Date: Mon, 30 Oct 2023 20:38:06 -0400
From: Gregory Price <gourry.memverge@...il.com>
To: linux-kernel@...r.kernel.org
Cc: linux-cxl@...r.kernel.org, linux-mm@...ck.org,
ying.huang@...el.com, akpm@...ux-foundation.org,
aneesh.kumar@...ux.ibm.com, weixugc@...gle.com, apopple@...dia.com,
hannes@...xchg.org, tim.c.chen@...el.com, dave.hansen@...el.com,
mhocko@...nel.org, shy828301@...il.com, gregkh@...uxfoundation.org,
rafael@...nel.org, Gregory Price <gregory.price@...verge.com>
Subject: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
This patchset implements weighted interleave and adds a new sysfs
entry: /sys/devices/system/node/nodeN/accessM/il_weight.
The il_weight of a node is used by mempolicy to implement weighted
interleave when `numactl --interleave=...` is invoked. By default
il_weight for a node is always 1, which preserves the default round
robin interleave behavior.
Interleave weights may be set from 0-100, and denote the number of
pages that should be allocated from the node when interleaving
occurs.
For example, if a node's interleave weight is set to 5, 5 pages
will be allocated from that node before the next node is scheduled
for allocations.
Additionally, "node accessors" (synonmous with cpu nodes) are used
to allow for accessor-relative weighting. The "accessor" for a task
is defined as the node the task is presently running on.
# Set node weight for node0 accessed by tasks on node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight
# Set node weight for node0 accessed by tasks on node1 to 3
echo 3 > /sys/devices/system/node/node0/access1/il_weight
In this way it becomes possible to set an interleaving strategy
that fits the available bandwidth for the devices available on
the system. An example system:
Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
In this setup, the effective weights for nodes 0-3 for a task
running on Node 0 may be [60, 20, 10, 10].
This spreads memory out across devices which all have different
latency and bandwidth attributes at a way that can maximize the
available resources.
~Gregory
(sorry for the repeat send, automation failure)
================================================================
Version Notes:
v3: move weights into node rather than memtiers
some additional fixes to node.c to support this
v1/v2: add weighted-interleave support to mempolicy
= v3 notes
This update effectively removes the connection between mempolicy
and memory-tiers by simply placing the interleave weights directly
in the node accessor information structure.
Node was recommended by Huang, Ying
Accessor was recommended by Ravi Shankar
== Move weights into node
Originally this work was done by placing weights in the memory tier.
In this patch set we changed the weights to live in the numa node
accessor structure, which allows for a more natural weighting scheme
and also supports source-node relative weighting.
Interleave weight is located in:
/sys/devices/system/node/nodeN/accessM/il_weight
and is set with a value between 1 and 100:
# Set node weight for node0 accessed by node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight
By default, il_weight is always set to 1, which mimics the default
interleave behavior (simple round-robin).
== Other Node fixes
2 other updates to node.c were required to support this:
1) The access list must be initialized prior to the node struct
pointer being registered in the node array
2) The accessor's in the list must be registered regardless of
whether HMAT/HMEM information is reported. Presently this
results in 0-value information being present in the various
access subgroup
== Weighted interleave
mm/mempolicy: modify interleave mempolicy to use node weights
The node subsystem implements interleave weighting for the purpose
of bandwidth optimization. Each node may have different weights in
relation to each compute node ("access node").
The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement
weighted interleave. By default, since all nodes default to a weight
of 1, the original interleave behavior is retained.
Examples
Weight settings:
echo 4 > node0/access0/il_weight
echo 3 > node1/access0/il_weight
echo 2 > node1/access1/il_weight
echo 1 > node0/access1/il_weight
Results:
Task A:
cpunode: 0
nodemask: [0,1]
weights: [4,3]
allocation result: [0,0,0,0,1,1,1 repeat]
Task B:
cpunode: 1
nodemask: [0,1]
weights: [1,2]
allocation result: [0,1,1 repeat]
=== original RFCs ====
Memory-tier based weights
By: Ravi Shankar
https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@micron.com/
Mempolicy multi-node weighting w/ set_mempolicy2:
By: Gregory Price
https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/
N:M weighting in mempolicy
By: Hasan Al Maruf
https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/
Ying Huang's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
Gregory Price (4):
base/node.c: initialize the accessor list before registering
node: add accessors to sysfs when nodes are created
node: add interleave weights to node accessor
mm/mempolicy: modify interleave mempolicy to use node weights
drivers/base/node.c | 120 ++++++++++++++++++++++++++++++++-
include/linux/mempolicy.h | 4 ++
include/linux/node.h | 17 +++++
mm/mempolicy.c | 138 +++++++++++++++++++++++++++++---------
4 files changed, 246 insertions(+), 33 deletions(-)
--
2.39.1
Powered by blists - more mailing lists