linux-kernel - Re: [PATCH v5 02/11] mm/mempolicy: introduce MPOL_WEIGHTED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878r58dt31.fsf@yhuang6-desk2.ccr.corp.intel.com>
Date: Tue, 02 Jan 2024 16:42:42 +0800
From: "Huang, Ying" <ying.huang@...el.com>
To: Gregory Price <gregory.price@...verge.com>
Cc: Gregory Price <gourry.memverge@...il.com>,  <linux-mm@...ck.org>,
  <linux-doc@...r.kernel.org>,  <linux-fsdevel@...r.kernel.org>,
  <linux-kernel@...r.kernel.org>,  <linux-api@...r.kernel.org>,
  <x86@...nel.org>,  <akpm@...ux-foundation.org>,  <arnd@...db.de>,
  <tglx@...utronix.de>,  <luto@...nel.org>,  <mingo@...hat.com>,
  <bp@...en8.de>,  <dave.hansen@...ux.intel.com>,  <hpa@...or.com>,
  <mhocko@...nel.org>,  <tj@...nel.org>,  <corbet@....net>,
  <rakie.kim@...com>,  <hyeongtak.ji@...com>,  <honggyu.kim@...com>,
  <vtavarespetr@...ron.com>,  <peterz@...radead.org>,
  <jgroves@...ron.com>,  <ravis.opensrc@...ron.com>,
  <sthanneeru@...ron.com>,  <emirakhur@...ron.com>,  <Hasan.Maruf@....com>,
  <seungjun.ha@...sung.com>,  Srinivasulu Thanneeru
 <sthanneeru.opensrc@...ron.com>
Subject: Re: [PATCH v5 02/11] mm/mempolicy: introduce
 MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

Gregory Price <gregory.price@...verge.com> writes:

> On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@...il.com> writes:
>> 
>> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> > +{
>> > +	nodemask_t nodemask = pol->nodes;
>> > +	unsigned int target, weight_total = 0;
>> > +	int nid;
>> > +	unsigned char weights[MAX_NUMNODES];
>> 
>> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
>> large?
>> 
>
> I've been struggling with a good solution to this.  We need a local copy
> of weights to prevent weights from changing out from under us during
> allocation (which may take quite some time), but it seemed unwise to
> to allocate 1KB heap in this particular path.
>
> Is my concern unfounded?  If so, I can go ahead and add the allocation
> code.

Please take a look at NODEMASK_ALLOC().

>> > +	unsigned char weight;
>> > +
>> > +	barrier();
>> 
>> Memory barrier needs comments.
>> 
>
> Barrier is to stabilize nodemask on the stack, but yes i'll carry the
> comment from interleave_nid into this barrier as well.

Please see below.

>> > +
>> > +	/* first ensure we have a valid nodemask */
>> > +	nid = first_node(nodemask);
>> > +	if (nid == MAX_NUMNODES)
>> > +		return nid;
>> 
>> It appears that this isn't necessary, because we can check whether
>> weight_total == 0 after the next loop.
>> 
>
> fair, will snip.
>
>> > +
>> > +	/* Then collect weights on stack and calculate totals */
>> > +	for_each_node_mask(nid, nodemask) {
>> > +		weight = iw_table[nid];
>> > +		weight_total += weight;
>> > +		weights[nid] = weight;
>> > +	}
>> > +
>> > +	/* Finally, calculate the node offset based on totals */
>> > +	target = (unsigned int)ilx % weight_total;
>> 
>> Why use type casting?
>> 
>
> Artifact of old prototypes, snipped.
>
>> > +
>> > +	/* Stabilize the nodemask on the stack */
>> > +	barrier();
>> 
>> I don't think barrier() is needed to wait for memory operations for
>> stack.  It's usually used for cross-processor memory order.
>>
>
> This is present in the old interleave code.  To the best of my
> understanding, the concern is for mempolicy->nodemask rebinding that can
> occur when cgroups.cpusets.mems_allowed changes.
>
> so we can't iterate over (mempolicy->nodemask), we have to take a local
> copy.
>
> My *best* understanding of the barrier here is to prevent the compiler
> from reordering operations such that it attempts to optimize out the
> local copy (or do lazy-fetch).
>
> It is present in the original interleave code, so I pulled it forward to
> this, but I have not tested whether this is a bit paranoid or not.
>
> from `interleave_nid`:
>
>  /*
>   * The barrier will stabilize the nodemask in a register or on
>   * the stack so that it will stop changing under the code.
>   *
>   * Between first_node() and next_node(), pol->nodes could be changed
>   * by other threads. So we put pol->nodes in a local stack.
>   */
>  barrier();

Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
comments all over the place.  Can we implement a wrapper for it?  For
example, memcpy_once().  __read_once_size() in
tools/include/linux/compiler.h can be used as reference.

Because node_weights[] may be changed simultaneously too.  We may need
to consider similar issue for it too.  But RCU seems more appropriate
for node_weights[].

>> > +		/* Otherwise we adjust nr_pages down, and continue from there */
>> > +		rem_pages -= pol->wil.cur_weight;
>> > +		pol->wil.cur_weight = 0;
>> > +		prev_node = node;
>> 
>> If pol->wil.cur_weight == 0, prev_node will be used without being
>> initialized below.
>> 
>
> pol->wil.cur_weight is not used below.
>
>> > +	}
>> > +
>> > +	/* Now we can continue allocating as if from 0 instead of an offset */
>> > +	rounds = rem_pages / weight_total;
>> > +	delta = rem_pages % weight_total;
>> > +	for (i = 0; i < nnodes; i++) {
>> > +		node = next_node_in(prev_node, nodes);
>> > +		weight = weights[node];
>> > +		node_pages = weight * rounds;
>> > +		if (delta) {
>> > +			if (delta > weight) {
>> > +				node_pages += weight;
>> > +				delta -= weight;
>> > +			} else {
>> > +				node_pages += delta;
>> > +				delta = 0;
>> > +			}
>> > +		}
>> > +		/* We may not make it all the way around */
>> > +		if (!node_pages)
>> > +			break;
>> > +		/* If an over-allocation would occur, floor it */
>> > +		if (node_pages + total_allocated > nr_pages) {
>> 
>> Why is this possible?
>> 
>
> this may have been a paranoid artifact from an early prototype, will
> snip and validate.

--
Best Regards,
Huang, Ying