linux-kernel - Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3682b9cf-213c-497d-ab81-f70e1a785716@sk.com>
Date: Fri, 20 Dec 2024 17:25:28 +0900
From: Hyeonggon Yoo <hyeonggon.yoo@...com>
To: Joshua Hahn <joshua.hahnjy@...il.com>,
 "gourry@...rry.net" <gourry@...rry.net>
Cc: kernel_team@...ynix.com, 42.hyeyoo@...il.com,
 "rafael@...nel.org" <rafael@...nel.org>, "lenb@...nel.org"
 <lenb@...nel.org>, "gregkh@...uxfoundation.org"
 <gregkh@...uxfoundation.org>,
 "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
 Honggyu Kim <honggyu.kim@...com>,
 "ying.huang@...ux.alibaba.com" <ying.huang@...ux.alibaba.com>,
 Rakie Kim <rakie.kim@...com>,
 "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
 "Jonathan.Cameron@...wei.com" <Jonathan.Cameron@...wei.com>,
 "dave.jiang@...el.com" <dave.jiang@...el.com>,
 "horen.chuang@...ux.dev" <horen.chuang@...ux.dev>,
 "hannes@...xchg.org" <hannes@...xchg.org>,
 "linux-mm@...ck.org" <linux-mm@...ck.org>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
 "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
 "kernel-team@...a.com" <kernel-team@...a.com>
Subject: Re: [External Mail] [RFC PATCH v2] Weighted interleave auto-tuning



On 2024-12-20 4:18 AM, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across NUMA nodes according to
> user-set ratios.
> 
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
> 
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the pressure present in the other two nodes).
> 
> This patch introduces an auto-configuration for the interleave weights
> that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> This balance is controlled by a value "weightiness", which defines the
> interleaving aggression. Higher values lead to less interleaving
> (255:1), while lower values lead to more interleaving (1:1).
> 
> Large weightiness values generally lead to increased weight-bandwidth
> proportionality, but can lead to underutilized nodes (think worst-case
> scenario, which is 1:max_node_weight). Lower weightiness reduces the
> effects of underutilized nodes, but may lead to improperly loaded
> distributions.

s/max_node_weight/weightiness/

> This knob is exposed as a sysfs interface with a default value of 32.
> Weights are re-calculated once at boottime and then every time the knob
> is changed by the user, or when the ACPI table is updated.
> 
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@...il.com>
> Signed-off-by: Gregory Price <gourry@...rry.net>
> Co-Developed-by: Gregory Price <gourry@...rry.net>
> 
> ---
> Changelog
> 
> v2:
> - Name of the interface is changed from v1: "max_node_weight" --> "weightiness"
> - Default interleave weight table no longer exists. Rather, the
>    interleave weight table is initialized with the defaults, if bandwidth
>    information is available.
>    - In addition, all sections that handle iw_table have been changed
>      to reference iw_table if it exists, otherwise defaulting to 1.
> - All instances of unsigned long are converted to uint64_t to guarantee
>    support for both 32-bit and 64-bit machines
> - sysfs initialization cleanup
> - Documentation has been rewritten to explicitly outline expected
>    behavior and expand on the interpretation of "weightiness".
> - kzalloc replaced with kcalloc for readability
> - Thank you Gregory and Hyeonggon for your review & feedback!
> 
>   ...fs-kernel-mm-mempolicy-weighted-interleave |  36 ++++
>   drivers/acpi/numa/hmat.c                      |   1 +
>   drivers/base/node.c                           |   7 +
>   include/linux/mempolicy.h                     |   4 +
>   mm/mempolicy.c                                | 183 +++++++++++++++---
>   5 files changed, 209 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..edb2c1f4753f 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -23,3 +23,39 @@ Description:	Weight configuration interface for nodeN
>   		Writing an empty string or `0` will reset the weight to the
>   		system default. The system default may be set by the kernel
>   		or drivers at boot or during hotplug events.
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/weightiness
> +Date:		December 2024
> +Contact:	Linux memory management mailing list <linux-mm@...ck.org>
> +Description:	Weight limiting / scaling interface
> +
> +		"Weightiness": a measure of interleave aggression between
> +		memory nodes. Higher values lead to less interleaving (255:1),
> +		while lower values lead to more interleaving (1:1).

It might be better to explain what low and high values of
weightness imply, like the way how you described
in the changelog?

> +		When this value is updated, all node weights are re-calculated
> +		to reflect the new weightiness. These re-calculated values
> +		overwrite all existing node weights, including those manually
> +		set by writing to the nodeN files.
> +
> +		Node weight re-calculation is performed by scaling down
> +		bandwidth values reported in the ACPI HMAT to the range
> +		[1, weightiness]. Note that re-calculation uses only the
> +		weightiness parameter and bandwidth values, and ignores all
> +		current node weights.
> +
> +		Minimum weight: 1
> +		Default value: 32
> +		Maximum weight: 255
> +
> +		Writing an empty string will set the value to be the default
> +		(32). Writing a value outside the valid range  will return
> +		EINVAL and will not re-trigger a weight scaling.
> +
> +		If there is no bandwidth data in the ACPI HMAT, then this file
> +		will return ENODEV on an attempted write and perform no updates.
> +		Furthermore, if there is no bandwidth information available,
> +		all nodes' weights will default to 1.
> +
> +		Setting max_node_weight to 1 is equivalent to unweighted
> +		interleave.

s/max_node_weight/weightiness/

> @@ -3397,6 +3471,54 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>   
>   static struct iw_node_attr **node_attrs;
>   
> +static ssize_t weightiness_show(struct kobject *kobj,
> +		struct kobj_attribute *attr, char *buf)
> +{
> +	return sysfs_emit(buf, "%d\n", weightiness);
> +}
> +
> +static ssize_t weightiness_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	uint64_t *bw;
> +	u8 *old_iw, *new_iw;
> +	u8 new_weightiness;
> +
> +	if (count == 0 || sysfs_streq(buf, ""))
> +		new_weightiness = 32;
> +	else if (kstrtou8(buf, 0, &new_weightiness) || new_weightiness == 0)
> +		return -EINVAL;
> +
> +	new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
> +	if (!new_iw)
> +		return -ENOMEM;

Could you please use kcalloc here similar to mempolicy_set_node_perf()?
Otherwise the patch looks fine to me. (will add a review and test on the
next revision)

By the way, this might be out of scope, but let me ask for my own
learning.

We have a server with 2 sockets, each attached with local DRAM and CXL 
memory (and thus 4 NUMA nodes). When accessing remote socket's memory
(either CXL or not), the bandwidth is limited by the interconnect's
bandwidth.

On this server, ideally weighted interleaving should be configured
within a socket (e.g. local NUMA node + local CXL node) because
weighted interleaving does not consider the bandwidth when accessed
from a remote socket.

So, the question is: On systems with multiple sockets (and CXL mem
attached to each socket), do you always assume the admin must bind to
a specific socket for optimal performance or is there any plan to
mitigate this problem without binding tasks to a socket?

> +
> +	mutex_lock(&iw_table_lock);
> +	bw = node_bw_table;
> +
> +	if (!bw) {
> +		mutex_unlock(&iw_table_lock);
> +		kfree(new_iw);
> +		return -ENODEV;
> +	}
> +
> +	weightiness = new_weightiness;
> +	old_iw = rcu_dereference_protected(iw_table,
> +					   lockdep_is_held(&iw_table_lock));
> +
> +	reduce_interleave_weights(bw, new_iw);
> +	rcu_assign_pointer(iw_table, new_iw);
> +	mutex_unlock(&iw_table_lock);
> +
> +	synchronize_rcu();
> +	kfree(old_iw);
> +
> +	return count;
> +}
> +
> +static struct kobj_attribute wi_attr =
> +	__ATTR(weightiness, 0664, weightiness_show, weightiness_store);
> +
>   static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
>   				  struct kobject *parent)
>   {
> @@ -3413,6 +3535,7 @@ static void sysfs_wi_release(struct kobject *wi_kobj)
>   
>   	for (i = 0; i < nr_node_ids; i++)
>   		sysfs_wi_node_release(node_attrs[i], wi_kobj);
> +
>   	kobject_put(wi_kobj);
>   }
>   
> @@ -3454,6 +3577,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj)
>   	return 0;
>   }
>   
> +static struct attribute *wi_default_attrs[] = {
> +	&wi_attr.attr,
> +	NULL
> +};
> +
> +static const struct attribute_group wi_attr_group = {
> +	.attrs = wi_default_attrs,
> +};
> +
>   static int add_weighted_interleave_group(struct kobject *root_kobj)
>   {
>   	struct kobject *wi_kobj;
> @@ -3470,6 +3602,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
>   		return err;
>   	}
>   
> +	err = sysfs_create_group(wi_kobj, &wi_attr_group);
> +	if (err) {
> +		pr_err("failed to add sysfs [weightiness]\n");
> +		kobject_put(wi_kobj);
> +		return err;
> +	}
> +
>   	for_each_node_state(nid, N_POSSIBLE) {
>   		err = add_weight_node(nid, wi_kobj);
>   		if (err) {