[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3682b9cf-213c-497d-ab81-f70e1a785716@sk.com>
Date: Fri, 20 Dec 2024 17:25:28 +0900
From: Hyeonggon Yoo <hyeonggon.yoo@...com>
To: Joshua Hahn <joshua.hahnjy@...il.com>,
"gourry@...rry.net" <gourry@...rry.net>
Cc: kernel_team@...ynix.com, 42.hyeyoo@...il.com,
"rafael@...nel.org" <rafael@...nel.org>, "lenb@...nel.org"
<lenb@...nel.org>, "gregkh@...uxfoundation.org"
<gregkh@...uxfoundation.org>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
Honggyu Kim <honggyu.kim@...com>,
"ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
Rakie Kim <rakie.kim@...com>,
"dan.j.williams@...el.com" <dan.j.williams@...el.com>,
"Jonathan.Cameron@...wei.com" <Jonathan.Cameron@...wei.com>,
"dave.jiang@...el.com" <dave.jiang@...el.com>,
"horen.chuang@...ux.dev" <horen.chuang@...ux.dev>,
"hannes@...xchg.org" <hannes@...xchg.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning
On 2024-12-20 4:18 AM, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across NUMA nodes according to
> user-set ratios.
>
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
>
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the pressure present in the other two nodes).
>
> This patch introduces an auto-configuration for the interleave weights
> that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> This balance is controlled by a value "weightiness", which defines the
> interleaving aggression. Higher values lead to less interleaving
> (255:1), while lower values lead to more interleaving (1:1).
>
> Large weightiness values generally lead to increased weight-bandwidth
> proportionality, but can lead to underutilized nodes (think worst-case
> scenario, which is 1:max_node_weight). Lower weightiness reduces the
> effects of underutilized nodes, but may lead to improperly loaded
> distributions.
s/max_node_weight/weightiness/
> This knob is exposed as a sysfs interface with a default value of 32.
> Weights are re-calculated once at boottime and then every time the knob
> is changed by the user, or when the ACPI table is updated.
>
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@...il.com>
> Signed-off-by: Gregory Price <gourry@...rry.net>
> Co-Developed-by: Gregory Price <gourry@...rry.net>
>
> ---
> Changelog
>
> v2:
> - Name of the interface is changed from v1: "max_node_weight" --> "weightiness"
> - Default interleave weight table no longer exists. Rather, the
> interleave weight table is initialized with the defaults, if bandwidth
> information is available.
> - In addition, all sections that handle iw_table have been changed
> to reference iw_table if it exists, otherwise defaulting to 1.
> - All instances of unsigned long are converted to uint64_t to guarantee
> support for both 32-bit and 64-bit machines
> - sysfs initialization cleanup
> - Documentation has been rewritten to explicitly outline expected
> behavior and expand on the interpretation of "weightiness".
> - kzalloc replaced with kcalloc for readability
> - Thank you Gregory and Hyeonggon for your review & feedback!
>
> ...fs-kernel-mm-mempolicy-weighted-interleave | 36 ++++
> drivers/acpi/numa/hmat.c | 1 +
> drivers/base/node.c | 7 +
> include/linux/mempolicy.h | 4 +
> mm/mempolicy.c | 183 +++++++++++++++---
> 5 files changed, 209 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..edb2c1f4753f 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -23,3 +23,39 @@ Description: Weight configuration interface for nodeN
> Writing an empty string or `0` will reset the weight to the
> system default. The system default may be set by the kernel
> or drivers at boot or during hotplug events.
> +
> +What: /sys/kernel/mm/mempolicy/weighted_interleave/weightiness
> +Date: December 2024
> +Contact: Linux memory management mailing list <linux-mm@...ck.org>
> +Description: Weight limiting / scaling interface
> +
> + "Weightiness": a measure of interleave aggression between
> + memory nodes. Higher values lead to less interleaving (255:1),
> + while lower values lead to more interleaving (1:1).
It might be better to explain what low and high values of
weightness imply, like the way how you described
in the changelog?
> + When this value is updated, all node weights are re-calculated
> + to reflect the new weightiness. These re-calculated values
> + overwrite all existing node weights, including those manually
> + set by writing to the nodeN files.
> +
> + Node weight re-calculation is performed by scaling down
> + bandwidth values reported in the ACPI HMAT to the range
> + [1, weightiness]. Note that re-calculation uses only the
> + weightiness parameter and bandwidth values, and ignores all
> + current node weights.
> +
> + Minimum weight: 1
> + Default value: 32
> + Maximum weight: 255
> +
> + Writing an empty string will set the value to be the default
> + (32). Writing a value outside the valid range will return
> + EINVAL and will not re-trigger a weight scaling.
> +
> + If there is no bandwidth data in the ACPI HMAT, then this file
> + will return ENODEV on an attempted write and perform no updates.
> + Furthermore, if there is no bandwidth information available,
> + all nodes' weights will default to 1.
> +
> + Setting max_node_weight to 1 is equivalent to unweighted
> + interleave.
s/max_node_weight/weightiness/
> @@ -3397,6 +3471,54 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>
> static struct iw_node_attr **node_attrs;
>
> +static ssize_t weightiness_show(struct kobject *kobj,
> + struct kobj_attribute *attr, char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", weightiness);
> +}
> +
> +static ssize_t weightiness_store(struct kobject *kobj,
> + struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> + uint64_t *bw;
> + u8 *old_iw, *new_iw;
> + u8 new_weightiness;
> +
> + if (count == 0 || sysfs_streq(buf, ""))
> + new_weightiness = 32;
> + else if (kstrtou8(buf, 0, &new_weightiness) || new_weightiness == 0)
> + return -EINVAL;
> +
> + new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
> + if (!new_iw)
> + return -ENOMEM;
Could you please use kcalloc here similar to mempolicy_set_node_perf()?
Otherwise the patch looks fine to me. (will add a review and test on the
next revision)
By the way, this might be out of scope, but let me ask for my own
learning.
We have a server with 2 sockets, each attached with local DRAM and CXL
memory (and thus 4 NUMA nodes). When accessing remote socket's memory
(either CXL or not), the bandwidth is limited by the interconnect's
bandwidth.
On this server, ideally weighted interleaving should be configured
within a socket (e.g. local NUMA node + local CXL node) because
weighted interleaving does not consider the bandwidth when accessed
from a remote socket.
So, the question is: On systems with multiple sockets (and CXL mem
attached to each socket), do you always assume the admin must bind to
a specific socket for optimal performance or is there any plan to
mitigate this problem without binding tasks to a socket?
> +
> + mutex_lock(&iw_table_lock);
> + bw = node_bw_table;
> +
> + if (!bw) {
> + mutex_unlock(&iw_table_lock);
> + kfree(new_iw);
> + return -ENODEV;
> + }
> +
> + weightiness = new_weightiness;
> + old_iw = rcu_dereference_protected(iw_table,
> + lockdep_is_held(&iw_table_lock));
> +
> + reduce_interleave_weights(bw, new_iw);
> + rcu_assign_pointer(iw_table, new_iw);
> + mutex_unlock(&iw_table_lock);
> +
> + synchronize_rcu();
> + kfree(old_iw);
> +
> + return count;
> +}
> +
> +static struct kobj_attribute wi_attr =
> + __ATTR(weightiness, 0664, weightiness_show, weightiness_store);
> +
> static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
> struct kobject *parent)
> {
> @@ -3413,6 +3535,7 @@ static void sysfs_wi_release(struct kobject *wi_kobj)
>
> for (i = 0; i < nr_node_ids; i++)
> sysfs_wi_node_release(node_attrs[i], wi_kobj);
> +
> kobject_put(wi_kobj);
> }
>
> @@ -3454,6 +3577,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj)
> return 0;
> }
>
> +static struct attribute *wi_default_attrs[] = {
> + &wi_attr.attr,
> + NULL
> +};
> +
> +static const struct attribute_group wi_attr_group = {
> + .attrs = wi_default_attrs,
> +};
> +
> static int add_weighted_interleave_group(struct kobject *root_kobj)
> {
> struct kobject *wi_kobj;
> @@ -3470,6 +3602,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
> return err;
> }
>
> + err = sysfs_create_group(wi_kobj, &wi_attr_group);
> + if (err) {
> + pr_err("failed to add sysfs [weightiness]\n");
> + kobject_put(wi_kobj);
> + return err;
> + }
> +
> for_each_node_state(nid, N_POSSIBLE) {
> err = add_weight_node(nid, wi_kobj);
> if (err) {
Powered by blists - more mailing lists