linux-kernel - Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZYp6ZRLZQVtTHest@memverge.com>
Date: Tue, 26 Dec 2023 02:01:57 -0500
From: Gregory Price <gregory.price@...verge.com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: Gregory Price <gourry.memverge@...il.com>, linux-mm@...ck.org,
	linux-doc@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
	x86@...nel.org, akpm@...ux-foundation.org, arnd@...db.de,
	tglx@...utronix.de, luto@...nel.org, mingo@...hat.com, bp@...en8.de,
	dave.hansen@...ux.intel.com, hpa@...or.com, mhocko@...nel.org,
	tj@...nel.org, corbet@....net, rakie.kim@...com,
	hyeongtak.ji@...com, honggyu.kim@...com, vtavarespetr@...ron.com,
	peterz@...radead.org, jgroves@...ron.com, ravis.opensrc@...ron.com,
	sthanneeru@...ron.com, emirakhur@...ron.com, Hasan.Maruf@....com,
	seungjun.ha@...sung.com,
	Srinivasulu Thanneeru <sthanneeru.opensrc@...ron.com>
Subject: Re: [PATCH v5 02/11] mm/mempolicy: introduce
 MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@...il.com> writes:
> 
> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> > +{
> > +	nodemask_t nodemask = pol->nodes;
> > +	unsigned int target, weight_total = 0;
> > +	int nid;
> > +	unsigned char weights[MAX_NUMNODES];
> 
> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
> large?
> 

I've been struggling with a good solution to this.  We need a local copy
of weights to prevent weights from changing out from under us during
allocation (which may take quite some time), but it seemed unwise to
to allocate 1KB heap in this particular path.

Is my concern unfounded?  If so, I can go ahead and add the allocation
code.

> > +	unsigned char weight;
> > +
> > +	barrier();
> 
> Memory barrier needs comments.
> 

Barrier is to stabilize nodemask on the stack, but yes i'll carry the
comment from interleave_nid into this barrier as well.

> > +
> > +	/* first ensure we have a valid nodemask */
> > +	nid = first_node(nodemask);
> > +	if (nid == MAX_NUMNODES)
> > +		return nid;
> 
> It appears that this isn't necessary, because we can check whether
> weight_total == 0 after the next loop.
> 

fair, will snip.

> > +
> > +	/* Then collect weights on stack and calculate totals */
> > +	for_each_node_mask(nid, nodemask) {
> > +		weight = iw_table[nid];
> > +		weight_total += weight;
> > +		weights[nid] = weight;
> > +	}
> > +
> > +	/* Finally, calculate the node offset based on totals */
> > +	target = (unsigned int)ilx % weight_total;
> 
> Why use type casting?
> 

Artifact of old prototypes, snipped.

> > +
> > +	/* Stabilize the nodemask on the stack */
> > +	barrier();
> 
> I don't think barrier() is needed to wait for memory operations for
> stack.  It's usually used for cross-processor memory order.
>

This is present in the old interleave code.  To the best of my
understanding, the concern is for mempolicy->nodemask rebinding that can
occur when cgroups.cpusets.mems_allowed changes.

so we can't iterate over (mempolicy->nodemask), we have to take a local
copy.

My *best* understanding of the barrier here is to prevent the compiler
from reordering operations such that it attempts to optimize out the
local copy (or do lazy-fetch).

It is present in the original interleave code, so I pulled it forward to
this, but I have not tested whether this is a bit paranoid or not.

from `interleave_nid`:

 /*
  * The barrier will stabilize the nodemask in a register or on
  * the stack so that it will stop changing under the code.
  *
  * Between first_node() and next_node(), pol->nodes could be changed
  * by other threads. So we put pol->nodes in a local stack.
  */
 barrier();

> > +		/* Otherwise we adjust nr_pages down, and continue from there */
> > +		rem_pages -= pol->wil.cur_weight;
> > +		pol->wil.cur_weight = 0;
> > +		prev_node = node;
> 
> If pol->wil.cur_weight == 0, prev_node will be used without being
> initialized below.
> 

pol->wil.cur_weight is not used below.

> > +	}
> > +
> > +	/* Now we can continue allocating as if from 0 instead of an offset */
> > +	rounds = rem_pages / weight_total;
> > +	delta = rem_pages % weight_total;
> > +	for (i = 0; i < nnodes; i++) {
> > +		node = next_node_in(prev_node, nodes);
> > +		weight = weights[node];
> > +		node_pages = weight * rounds;
> > +		if (delta) {
> > +			if (delta > weight) {
> > +				node_pages += weight;
> > +				delta -= weight;
> > +			} else {
> > +				node_pages += delta;
> > +				delta = 0;
> > +			}
> > +		}
> > +		/* We may not make it all the way around */
> > +		if (!node_pages)
> > +			break;
> > +		/* If an over-allocation would occur, floor it */
> > +		if (node_pages + total_allocated > nr_pages) {
> 
> Why is this possible?
> 

this may have been a paranoid artifact from an early prototype, will
snip and validate.

~Gregory