linux-kernel - Re: [PATCH v3 3/3] mm/mempolicy: Support memory hotplug in weighted interleave

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250324085433.998-1-rakie.kim@sk.com>
Date: Mon, 24 Mar 2025 17:54:27 +0900
From: Rakie Kim <rakie.kim@...com>
To: Rakie Kim <rakie.kim@...com>
Cc: akpm@...ux-foundation.org,
	linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	linux-cxl@...r.kernel.org,
	joshua.hahnjy@...il.com,
	dan.j.williams@...el.com,
	ying.huang@...ux.alibaba.com,
	david@...hat.com,
	Jonathan.Cameron@...wei.com,
	kernel_team@...ynix.com,
	honggyu.kim@...com,
	yunjeong.mun@...com,
	Gregory Price <gourry@...rry.net>
Subject: Re: [PATCH v3 3/3] mm/mempolicy: Support memory hotplug in weighted interleave

On Mon, 24 Mar 2025 17:48:39 +0900 Rakie Kim <rakie.kim@...com> wrote:
> On Fri, 21 Mar 2025 10:24:46 -0400 Gregory Price <gourry@...rry.net> wrote:
> > On Thu, Mar 20, 2025 at 01:17:48PM +0900, Rakie Kim wrote:
> > ... snip ...
> > > +	mutex_lock(&sgrp->kobj_lock);
> > > +	if (sgrp->nattrs[nid]) {
> > > +		mutex_unlock(&sgrp->kobj_lock);
> > > +		pr_info("Node [%d] already exists\n", nid);
> > > +		kfree(new_attr);
> > > +		kfree(name);
> > > +		return 0;
> > > +	}
> > >  
> > > -	if (sysfs_create_file(&sgrp->wi_kobj, &node_attr->kobj_attr.attr)) {
> > > -		kfree(node_attr->kobj_attr.attr.name);
> > > -		kfree(node_attr);
> > > -		pr_err("failed to add attribute to weighted_interleave\n");
> > > -		return -ENOMEM;
> > > +	sgrp->nattrs[nid] = new_attr;
> > > +	mutex_unlock(&sgrp->kobj_lock);
> > > +
> > > +	sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
> > > +	sgrp->nattrs[nid]->kobj_attr.attr.name = name;
> > > +	sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
> > > +	sgrp->nattrs[nid]->kobj_attr.show = node_show;
> > > +	sgrp->nattrs[nid]->kobj_attr.store = node_store;
> > > +	sgrp->nattrs[nid]->nid = nid;
> > 
> > These accesses need to be inside the lock as well.  Probably we can't
> > get here concurrently, but I can't so so definitively that I'm
> > comfortable blind-accessing it outside the lock.
> 
> You're right, and I appreciate your point. It's not difficult to apply your
> suggestion, so I plan to update the code as follows:
> 
>     sgrp->nattrs[nid] = new_attr;
> 
>     sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
>     sgrp->nattrs[nid]->kobj_attr.attr.name = name;
>     sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
>     sgrp->nattrs[nid]->kobj_attr.show = node_show;
>     sgrp->nattrs[nid]->kobj_attr.store = node_store;
>     sgrp->nattrs[nid]->nid = nid;
> 
>     ret = sysfs_create_file(&sgrp->wi_kobj,
>            &sgrp->nattrs[nid]->kobj_attr.attr);
>     if (ret) {
>         mutex_unlock(&sgrp->kobj_lock);
>         ...
>     }
>     mutex_unlock(&sgrp->kobj_lock);
> 
> > 
> > > +static int wi_node_notifier(struct notifier_block *nb,
> > > +			       unsigned long action, void *data)
> > > +{
> > ... snip ...
> > > +	case MEM_OFFLINE:
> > > +		sysfs_wi_node_release(nid);
> > 
> > I'm still not convinced this is correct.  `offline_pages()` says this:
> > 
> > /*
> >  * {on,off}lining is constrained to full memory sections (or more
> >  * precisely to memory blocks from the user space POV).
> >  */
> > 
> > And that is the function calling:
> > 	memory_notify(MEM_OFFLINE, &arg);
> > 
> > David pointed out that this should be called when offlining each memory
> > block.  This is not the same as simply doing `echo 0 > online`, you need
> > to remove the dax device associated with the memory.
> > 
> > For example:
> > 
> >       node1
> >     /       \
> >  dax0.0    dax1.0
> >    |          |
> >   mb1        mb2
> > 
> > 
> > With this code, if I `daxctl reconfigure-device devmem dax0.0` it will
> > remove the first memory block, causing MEM_OFFLINE event to fire and
> > removing the node - despite the fact that dax1.0 is still present.
> > 
> > This matters for systems with memory holes in CXL hotplug memory and
> > also for systems with Dynamic Capacity Devices surfacing capacity as
> > separate dax devices.
> > 
> > ~Gregory
> 
> If all memory blocks belonging to a node are offlined, the node will lose its
> `N_MEMORY` state before the notifier callback is invoked. This should help avoid
> the issue you mentioned.
> Please let me know your thoughts on this approach.
> 
> Rakie
> 

I'm sorry, the code is missing.
I may not fully understand the scenario you described, but I think your concern
can be addressed by adding a simple check like the following:

    case MEM_OFFLINE:
        if (!node_state(nid, N_MEMORY)) --> this point
            sysfs_wi_node_release(nid);

If all memory blocks belonging to a node are offlined, the node will lose its
`N_MEMORY` state before the notifier callback is invoked. This should help avoid
the issue you mentioned.
Please let me know your thoughts on this approach.

Rakie.