[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250407093926.450-1-rakie.kim@sk.com>
Date: Mon, 7 Apr 2025 18:39:19 +0900
From: Rakie Kim <rakie.kim@...com>
To: David Hildenbrand <david@...hat.com>
Cc: gourry@...rry.net,
linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
linux-cxl@...r.kernel.org,
joshua.hahnjy@...il.com,
dan.j.williams@...el.com,
ying.huang@...ux.alibaba.com,
Jonathan.Cameron@...wei.com,
osalvador@...e.de,
kernel_team@...ynix.com,
honggyu.kim@...com,
yunjeong.mun@...com,
Rakie Kim <rakie.kim@...com>,
akpm@...ux-foundation.org
Subject: Re: [PATCH v6 3/3] mm/mempolicy: Support memory hotplug in weighted interleave
On Fri, 4 Apr 2025 22:45:00 +0200 David Hildenbrand <david@...hat.com> wrote:
> On 04.04.25 09:46, Rakie Kim wrote:
> > The weighted interleave policy distributes page allocations across multiple
> > NUMA nodes based on their performance weight, thereby improving memory
> > bandwidth utilization. The weight values for each node are configured
> > through sysfs.
> >
> > Previously, sysfs entries for configuring weighted interleave were created
> > for all possible nodes (N_POSSIBLE) at initialization, including nodes that
> > might not have memory. However, not all nodes in N_POSSIBLE are usable at
> > runtime, as some may remain memoryless or offline.
> > This led to sysfs entries being created for unusable nodes, causing
> > potential misconfiguration issues.
> >
> > To address this issue, this patch modifies the sysfs creation logic to:
> > 1) Limit sysfs entries to nodes that are online and have memory, avoiding
> > the creation of sysfs entries for nodes that cannot be used.
> > 2) Support memory hotplug by dynamically adding and removing sysfs entries
> > based on whether a node transitions into or out of the N_MEMORY state.
> >
> > Additionally, the patch ensures that sysfs attributes are properly managed
> > when nodes go offline, preventing stale or redundant entries from persisting
> > in the system.
> >
> > By making these changes, the weighted interleave policy now manages its
> > sysfs entries more efficiently, ensuring that only relevant nodes are
> > considered for interleaving, and dynamically adapting to memory hotplug
> > events.
> >
> > Signed-off-by: Rakie Kim <rakie.kim@...com>
> > Signed-off-by: Honggyu Kim <honggyu.kim@...com>
> > Signed-off-by: Yunjeong Mun <yunjeong.mun@...com>
> > ---
> > mm/mempolicy.c | 109 ++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 86 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 73a9405ff352..f25c2c7f8fcf 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -113,6 +113,7 @@
> > #include <asm/tlbflush.h>
> > #include <asm/tlb.h>
> > #include <linux/uaccess.h>
> > +#include <linux/memory.h>
> >
> > #include "internal.h"
> >
> > @@ -3390,6 +3391,7 @@ struct iw_node_attr {
> >
> > struct sysfs_wi_group {
> > struct kobject wi_kobj;
> > + struct mutex kobj_lock;
> > struct iw_node_attr *nattrs[];
> > };
> >
> > @@ -3439,13 +3441,24 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> >
> > static void sysfs_wi_node_delete(int nid)
> > {
> > - if (!wi_group->nattrs[nid])
> > + struct iw_node_attr *attr;
> > +
> > + if (nid < 0 || nid >= nr_node_ids)
> > + return;
> > +
> > + mutex_lock(&wi_group->kobj_lock);
> > + attr = wi_group->nattrs[nid];
> > + if (!attr) {
> > + mutex_unlock(&wi_group->kobj_lock);
> > return;
> > + }
> > +
> > + wi_group->nattrs[nid] = NULL;
> > + mutex_unlock(&wi_group->kobj_lock);
> >
> > - sysfs_remove_file(&wi_group->wi_kobj,
> > - &wi_group->nattrs[nid]->kobj_attr.attr);
> > - kfree(wi_group->nattrs[nid]->kobj_attr.attr.name);
> > - kfree(wi_group->nattrs[nid]);
> > + sysfs_remove_file(&wi_group->wi_kobj, &attr->kobj_attr.attr);
> > + kfree(attr->kobj_attr.attr.name);
> > + kfree(attr);
> > }
> >
> > static void sysfs_wi_release(struct kobject *wi_kobj)
> > @@ -3464,35 +3477,80 @@ static const struct kobj_type wi_ktype = {
> >
> > static int sysfs_wi_node_add(int nid)
> > {
> > - struct iw_node_attr *node_attr;
> > + int ret = 0;
> > char *name;
> > + struct iw_node_attr *new_attr = NULL;
> >
> > - node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
> > - if (!node_attr)
> > + if (nid < 0 || nid >= nr_node_ids) {
> > + pr_err("Invalid node id: %d\n", nid);
> > + return -EINVAL;
> > + }
> > +
> > + new_attr = kzalloc(sizeof(struct iw_node_attr), GFP_KERNEL);
> > + if (!new_attr)
> > return -ENOMEM;
> >
> > name = kasprintf(GFP_KERNEL, "node%d", nid);
> > if (!name) {
> > - kfree(node_attr);
> > + kfree(new_attr);
> > return -ENOMEM;
> > }
> >
> > - sysfs_attr_init(&node_attr->kobj_attr.attr);
> > - node_attr->kobj_attr.attr.name = name;
> > - node_attr->kobj_attr.attr.mode = 0644;
> > - node_attr->kobj_attr.show = node_show;
> > - node_attr->kobj_attr.store = node_store;
> > - node_attr->nid = nid;
> > + mutex_lock(&wi_group->kobj_lock);
> > + if (wi_group->nattrs[nid]) {
> > + mutex_unlock(&wi_group->kobj_lock);
> > + pr_info("Node [%d] already exists\n", nid);
> > + kfree(new_attr);
> > + kfree(name);
> > + return 0;
> > + }
> > + wi_group->nattrs[nid] = new_attr;
> >
> > - if (sysfs_create_file(&wi_group->wi_kobj, &node_attr->kobj_attr.attr)) {
> > - kfree(node_attr->kobj_attr.attr.name);
> > - kfree(node_attr);
> > - pr_err("failed to add attribute to weighted_interleave\n");
> > - return -ENOMEM;
> > + sysfs_attr_init(&wi_group->nattrs[nid]->kobj_attr.attr);
> > + wi_group->nattrs[nid]->kobj_attr.attr.name = name;
> > + wi_group->nattrs[nid]->kobj_attr.attr.mode = 0644;
> > + wi_group->nattrs[nid]->kobj_attr.show = node_show;
> > + wi_group->nattrs[nid]->kobj_attr.store = node_store;
> > + wi_group->nattrs[nid]->nid = nid;
> > +
> > + ret = sysfs_create_file(&wi_group->wi_kobj,
> > + &wi_group->nattrs[nid]->kobj_attr.attr);
> > + if (ret) {
> > + kfree(wi_group->nattrs[nid]->kobj_attr.attr.name);
> > + kfree(wi_group->nattrs[nid]);
> > + wi_group->nattrs[nid] = NULL;
> > + pr_err("Failed to add attribute to weighted_interleave: %d\n", ret);
> > }
> > + mutex_unlock(&wi_group->kobj_lock);
> >
> > - wi_group->nattrs[nid] = node_attr;
> > - return 0;
> > + return ret;
> > +}
> > +
> > +static int wi_node_notifier(struct notifier_block *nb,
> > + unsigned long action, void *data)
> > +{
> > + int err;
> > + struct memory_notify *arg = data;
> > + int nid = arg->status_change_nid;
> > +
> > + if (nid < 0)
> > + goto notifier_end;
> > +
> > + switch(action) {
> > + case MEM_ONLINE:
>
> MEM_ONLINE is too late, we cannot fail hotplug at that point.
>
> Would MEM_GOING_ONLINE / MEM_CANCEL_ONLINE be better?
Hi David,
Thank you for raising these points. I would appreciate your clarification
on the following:
Issue1: I want to invoke sysfs_wi_node_add() after a node with memory
has been fully transitioned to the online state. Does replacing
MEM_ONLINE with MEM_GOING_ONLINE or MEM_CANCEL_ONLINE still ensure
that the node is considered online and usable by that point?
>
>
> > + err = sysfs_wi_node_add(nid);
> > + if (err) {
> > + pr_err("failed to add sysfs [node%d]\n", nid);
> > + return NOTIFY_BAD;
>
> Note that NOTIFY_BAD includes NOTIFY_STOP_MASK. So you wouldn't call
> other notifiers, but the overall memory onlining would succeed, which is
> bad.
>
> If we don't care about the error (not prevent hotplug) we could only
> pr_warn() and continue. Maybe this (unlikely) case is not a good reason
> to stop memory from getting onlined. OTOH, it will barely ever trigger
> in practice ...
>
Issue2: Regarding your note about NOTIFY_BAD ? are you saying that
if sysfs_wi_node_add() returns NOTIFY_BAD, it will trigger
NOTIFY_STOP_MASK, preventing other notifiers from running, while
still allowing the memory hotplug operation to complete?
If so, then I'm thinking of resolving both issues as follows:
- For Issue1: I keep using MEM_ONLINE, assuming it is safe and
sufficient to ensure the node is fully online.
- For Issue2: I avoid returning NOTIFY_BAD from the notifier.
Instead, I log the error using pr_err() and continue the operation.
This would result in the following code:
if (nid < 0)
return NOTIFY_OK;
switch (action) {
case MEM_ONLINE: // Issue1: keeping this unchanged
err = sysfs_wi_node_add(nid);
if (err) {
pr_err("failed to add sysfs [node%d]\n", nid);
// Issue2: Do not return NOTIFY_BAD
}
break;
case MEM_OFFLINE:
sysfs_wi_node_delete(nid);
break;
}
// Always return NOTIFY_OK
return NOTIFY_OK;
Please let me know if this approach is acceptable.
Rakie
> --
> Cheers,
>
> David / dhildenb
>
Powered by blists - more mailing lists