linux-kernel - Re: [PATCH v6 3/3] mm/mempolicy: Support memory hotplug in weighted interleave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e69e6f79-f05b-47d5-8f04-36908e357fff@redhat.com>
Date: Mon, 7 Apr 2025 12:47:23 +0200
From: David Hildenbrand <david@...hat.com>
To: Rakie Kim <rakie.kim@...com>
Cc: gourry@...rry.net, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 linux-cxl@...r.kernel.org, joshua.hahnjy@...il.com,
 dan.j.williams@...el.com, ying.huang@...ux.alibaba.com,
 Jonathan.Cameron@...wei.com, osalvador@...e.de, kernel_team@...ynix.com,
 honggyu.kim@...com, yunjeong.mun@...com, akpm@...ux-foundation.org
Subject: Re: [PATCH v6 3/3] mm/mempolicy: Support memory hotplug in weighted
 interleave

On 07.04.25 11:39, Rakie Kim wrote:
> On Fri, 4 Apr 2025 22:45:00 +0200 David Hildenbrand <david@...hat.com> wrote:
>> On 04.04.25 09:46, Rakie Kim wrote:
>>> The weighted interleave policy distributes page allocations across multiple
>>> NUMA nodes based on their performance weight, thereby improving memory
>>> bandwidth utilization. The weight values for each node are configured
>>> through sysfs.
>>>
>>> Previously, sysfs entries for configuring weighted interleave were created
>>> for all possible nodes (N_POSSIBLE) at initialization, including nodes that
>>> might not have memory. However, not all nodes in N_POSSIBLE are usable at
>>> runtime, as some may remain memoryless or offline.
>>> This led to sysfs entries being created for unusable nodes, causing
>>> potential misconfiguration issues.
>>>
>>> To address this issue, this patch modifies the sysfs creation logic to:
>>> 1) Limit sysfs entries to nodes that are online and have memory, avoiding
>>>      the creation of sysfs entries for nodes that cannot be used.
>>> 2) Support memory hotplug by dynamically adding and removing sysfs entries
>>>      based on whether a node transitions into or out of the N_MEMORY state.
>>>
>>> Additionally, the patch ensures that sysfs attributes are properly managed
>>> when nodes go offline, preventing stale or redundant entries from persisting
>>> in the system.
>>>
>>> By making these changes, the weighted interleave policy now manages its
>>> sysfs entries more efficiently, ensuring that only relevant nodes are
>>> considered for interleaving, and dynamically adapting to memory hotplug
>>> events.
>>>
>>> Signed-off-by: Rakie Kim <rakie.kim@...com>
>>> Signed-off-by: Honggyu Kim <honggyu.kim@...com>
>>> Signed-off-by: Yunjeong Mun <yunjeong.mun@...com>
>>> ---
>>>    mm/mempolicy.c | 109 ++++++++++++++++++++++++++++++++++++++-----------
>>>    1 file changed, 86 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>> index 73a9405ff352..f25c2c7f8fcf 100644
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -113,6 +113,7 @@
>>>    #include <asm/tlbflush.h>
>>>    #include <asm/tlb.h>
>>>    #include <linux/uaccess.h>
>>> +#include <linux/memory.h>
>>>    
>>>    #include "internal.h"
>>>    
>>> @@ -3390,6 +3391,7 @@ struct iw_node_attr {
>>>    
>>>    struct sysfs_wi_group {
>>>    	struct kobject wi_kobj;
>>> +	struct mutex kobj_lock;
>>>    	struct iw_node_attr *nattrs[];
>>>    };
>>>    
>>> @@ -3439,13 +3441,24 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>>>    
>>>    static void sysfs_wi_node_delete(int nid)
>>>    {
>>> -	if (!wi_group->nattrs[nid])
>>> +	struct iw_node_attr *attr;
>>> +
>>> +	if (nid < 0 || nid >= nr_node_ids)
>>> +		return;
>>> +
>>> +	mutex_lock(&wi_group->kobj_lock);
>>> +	attr = wi_group->nattrs[nid];
>>> +	if (!attr) {
>>> +		mutex_unlock(&wi_group->kobj_lock);
>>>    		return;
>>> +	}
>>> +
>>> +	wi_group->nattrs[nid] = NULL;
>>> +	mutex_unlock(&wi_group->kobj_lock);
>>>    
>>> -	sysfs_remove_file(&wi_group->wi_kobj,
>>> -			  &wi_group->nattrs[nid]->kobj_attr.attr);
>>> -	kfree(wi_group->nattrs[nid]->kobj_attr.attr.name);
>>> -	kfree(wi_group->nattrs[nid]);
>>> +	sysfs_remove_file(&wi_group->wi_kobj, &attr->kobj_attr.attr);
>>> +	kfree(attr->kobj_attr.attr.name);
>>> +	kfree(attr);
>>>    }
>>>    
>>>    static void sysfs_wi_release(struct kobject *wi_kobj)
>>> @@ -3464,35 +3477,80 @@ static const struct kobj_type wi_ktype = {
>>>    
>>>    static int sysfs_wi_node_add(int nid)
>>>    {
>>> -	struct iw_node_attr *node_attr;
>>> +	int ret = 0;
>>>    	char *name;
>>> +	struct iw_node_attr *new_attr = NULL;
>>>    
>>> -	node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
>>> -	if (!node_attr)
>>> +	if (nid < 0 || nid >= nr_node_ids) {
>>> +		pr_err("Invalid node id: %d\n", nid);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	new_attr = kzalloc(sizeof(struct iw_node_attr), GFP_KERNEL);
>>> +	if (!new_attr)
>>>    		return -ENOMEM;
>>>    
>>>    	name = kasprintf(GFP_KERNEL, "node%d", nid);
>>>    	if (!name) {
>>> -		kfree(node_attr);
>>> +		kfree(new_attr);
>>>    		return -ENOMEM;
>>>    	}
>>>    
>>> -	sysfs_attr_init(&node_attr->kobj_attr.attr);
>>> -	node_attr->kobj_attr.attr.name = name;
>>> -	node_attr->kobj_attr.attr.mode = 0644;
>>> -	node_attr->kobj_attr.show = node_show;
>>> -	node_attr->kobj_attr.store = node_store;
>>> -	node_attr->nid = nid;
>>> +	mutex_lock(&wi_group->kobj_lock);
>>> +	if (wi_group->nattrs[nid]) {
>>> +		mutex_unlock(&wi_group->kobj_lock);
>>> +		pr_info("Node [%d] already exists\n", nid);
>>> +		kfree(new_attr);
>>> +		kfree(name);
>>> +		return 0;
>>> +	}
>>> +	wi_group->nattrs[nid] = new_attr;
>>>    
>>> -	if (sysfs_create_file(&wi_group->wi_kobj, &node_attr->kobj_attr.attr)) {
>>> -		kfree(node_attr->kobj_attr.attr.name);
>>> -		kfree(node_attr);
>>> -		pr_err("failed to add attribute to weighted_interleave\n");
>>> -		return -ENOMEM;
>>> +	sysfs_attr_init(&wi_group->nattrs[nid]->kobj_attr.attr);
>>> +	wi_group->nattrs[nid]->kobj_attr.attr.name = name;
>>> +	wi_group->nattrs[nid]->kobj_attr.attr.mode = 0644;
>>> +	wi_group->nattrs[nid]->kobj_attr.show = node_show;
>>> +	wi_group->nattrs[nid]->kobj_attr.store = node_store;
>>> +	wi_group->nattrs[nid]->nid = nid;
>>> +
>>> +	ret = sysfs_create_file(&wi_group->wi_kobj,
>>> +				&wi_group->nattrs[nid]->kobj_attr.attr);
>>> +	if (ret) {
>>> +		kfree(wi_group->nattrs[nid]->kobj_attr.attr.name);
>>> +		kfree(wi_group->nattrs[nid]);
>>> +		wi_group->nattrs[nid] = NULL;
>>> +		pr_err("Failed to add attribute to weighted_interleave: %d\n", ret);
>>>    	}
>>> +	mutex_unlock(&wi_group->kobj_lock);
>>>    
>>> -	wi_group->nattrs[nid] = node_attr;
>>> -	return 0;
>>> +	return ret;
>>> +}
>>> +
>>> +static int wi_node_notifier(struct notifier_block *nb,
>>> +			       unsigned long action, void *data)
>>> +{
>>> +	int err;
>>> +	struct memory_notify *arg = data;
>>> +	int nid = arg->status_change_nid;
>>> +
>>> +	if (nid < 0)
>>> +		goto notifier_end;
>>> +
>>> +	switch(action) {
>>> +	case MEM_ONLINE:
>>
>> MEM_ONLINE is too late, we cannot fail hotplug at that point.
>>
>> Would MEM_GOING_ONLINE / MEM_CANCEL_ONLINE be better?
> 
> Hi David,

Hi,

> 
> Thank you for raising these points. I would appreciate your clarification
> on the following:
> 
> Issue1: I want to invoke sysfs_wi_node_add() after a node with memory
> has been fully transitioned to the online state. Does replacing
> MEM_ONLINE with MEM_GOING_ONLINE or MEM_CANCEL_ONLINE still ensure
> that the node is considered online and usable by that point?

After MEM_GOING_ONLINE nothing can go wrong except that some other 
notifier fails MEM_GOING_ONLINE.

That happens rarely -- only if some other allocation, like for kasan, fails.

In MEM_CANCEL_ONLINE you have to undo what you did (remove the node again).

> 
>>
>>
>>> +		err = sysfs_wi_node_add(nid);
>>> +		if (err) {
>>> +			pr_err("failed to add sysfs [node%d]\n", nid);
>>> +			return NOTIFY_BAD;
>>
>> Note that NOTIFY_BAD includes NOTIFY_STOP_MASK. So you wouldn't call
>> other notifiers, but the overall memory onlining would succeed, which is
>> bad.
>>
>> If we don't care about the error (not prevent hotplug) we could only
>> pr_warn() and continue. Maybe this (unlikely) case is not a good reason
>> to stop memory from getting onlined. OTOH, it will barely ever trigger
>> in practice ...
>>
> 
> Issue2: Regarding your note about NOTIFY_BAD ? are you saying that
> if sysfs_wi_node_add() returns NOTIFY_BAD, it will trigger
> NOTIFY_STOP_MASK, preventing other notifiers from running, while
> still allowing the memory hotplug operation to complete?

Yes.

> 
> If so, then I'm thinking of resolving both issues as follows:
> - For Issue1: I keep using MEM_ONLINE, assuming it is safe and
>    sufficient to ensure the node is fully online.
> - For Issue2: I avoid returning NOTIFY_BAD from the notifier.
>    Instead, I log the error using pr_err() and continue the operation.
> 
> This would result in the following code:
> 
> 	if (nid < 0)
> 		return NOTIFY_OK;
> 
> 	switch (action) {
> 	case MEM_ONLINE: // Issue1: keeping this unchanged
> 		err = sysfs_wi_node_add(nid);
> 		if (err) {
> 			pr_err("failed to add sysfs [node%d]\n", nid);
> 			// Issue2: Do not return NOTIFY_BAD
> 		}
> 		break;
> 	case MEM_OFFLINE:
> 		sysfs_wi_node_delete(nid);
> 		break;
> 	}
> 
> 	// Always return NOTIFY_OK
> 	return NOTIFY_OK;
> 
> Please let me know if this approach is acceptable.

That would work. The alternative is failing during MEM_GOING_ONLINE if 
the allocation failed, and deleting the node during MEM_CANCEL_ONLINE.

-- 
Cheers,

David / dhildenb