linux-kernel - Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZdgxaLSBznupVmJK@memverge.com>
Date: Fri, 23 Feb 2024 00:47:20 -0500
From: Gregory Price <gregory.price@...verge.com>
To: "Huang, Ying" <ying.huang@...el.com>
Cc: Gregory Price <gourry.memverge@...il.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, hannes@...xchg.org,
	dan.j.williams@...el.com, dave.jiang@...el.com
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave
 weights

On Thu, Feb 22, 2024 at 03:10:11PM +0800, Huang, Ying wrote:
> Hi, Gregory,
> 
> Thanks a lot for working on this!
> 

It's worth doing :]

> > +	new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
> > +	if (!new_bw)
> > +		return -ENOMEM;
> 
> We only use "node_bw_table" in this function with "default_iwt_lock"
> held.  So, we don't need to copy-on-write?  Just change in place?
> 

I'd originally planned to add a sysfs entry for the data, which would
have added RCU to this, but i realized it's just duplicating the
node/accessX/initiator information, so i'll rip this out and just do in
place changes.

> > +	new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
> > +	if (!new_iw) {
> > +		kfree(new_bw);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	mutex_lock(&default_iwt_lock);
> > +	old_bw = node_bw_table;
> > +	old_iw = rcu_dereference_protected(default_iw_table,
> > +					   lockdep_is_held(&default_iwt_lock));
> > +
> > +	if (old_bw)
> > +		memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long));
> > +	new_bw[node] = min(coords->read_bandwidth, coords->write_bandwidth);
> 
> We need to compress two bandwidth data into one.  The possible choice
> could be,
> 
> - min(coords->read_bandwidth, coords->write_bandwidth), that is, your code
> 
> - coords->read_bandwidth + coords->write_bandwidth
> 
> I don't know which one is better.  Do you have some use cases to help to
> determine which one is better?

More generally:  Are either read_bandwidth or write_bandwidth values
even worth trusting as-is?  Should they be combined? Averaged?
Minimumed? Maximumed?  Should they be floored to some reasonably round
number?  These are the comments i'm hoping to garner :].

I've also considered maybe adding a weighted_interleave/read_write_ratio
sysfs entry that informs the system on how to treat the incoming
numbers.  This would require us to cache more information, obviously.

I have limited access to hardware, but here is one datum from an Intel
platform w/ a sample CXL memory expander.

# DRAM on node0
cat /sys/bus/node/devices/node0/access0/initiators/*bandwidth
262100 < read
176100 < write

Notice the 90GB/s difference between read and write, and the trailing
100! That doesn't look to be good for a clean reduction :[

# CXL 1.1 device on node2
cat /sys/bus/node/devices/node2/access0/initiators/*bandwidth
 60000 < read
 60000 < write

These are pretty un-even distributions, and we may want to consider
forcing numbers to be a little more round - otherwise we're doomed to
just end up with whatever the ~/100 value is. Or we need to come up with
some reduction that gets us down to reasonable small interleave values.

In this scenario, we end up with:
>>> 60000+176100
236100
>>> 60000/236100
0.25412960609911056
>>> 176100/236100
0.7458703939008895

Which turns into 25:74 if you jsut round down, or 25:75 if you round up.

The problem is that any heuristic you come up with for rounding out the
bandwidth values is bound to have degenerate results. What happens if I
add another CXL memory expander? What happens with 2DPC? etc.

I wanted to collect some thoughts on this.  I'm not sure what the best
"General" approach would be, and we may need some more data from people
with access to more varied hardware.

Maybe worth submitting to LSF/MM for a quick discussion, but I think
we'll need some help figuring this one out.

> > +
> > +	/* New recalculate the bandwidth distribution given the new info */
> > +	for (i = 0; i < nr_node_ids; i++)
> > +		ttl_bw += new_bw[i];
> > +
> > +	/* If node is not set or has < 1% of total bw, use minimum value of 1 */
> > +	for (i = 0; i < nr_node_ids; i++) {
> > +		if (new_bw[i])
> > +			new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);
> > +		else
> > +			new_iw[i] = 1;
> 
> If we lacks performance data for some node, it will use "1" as default
> weight.  It doesn't look like the best solution for me.  How about use
> the average available bandwidth to calculate the default weight?  Or use
> memory bandwidth of node 0 if performance data is missing?
> 

If we lack performance data for a node, it's one of 3 cases

1) The device did not provide HMAT information
2) We did not integrate that driver into the system yet.
3) The node is not online yet (therefore the data hasn't been reported)

#2 and #3 are not issues, the only real issue is #1.

In this scenario, I'm not sure what to do.  We must have a non-0 value
for that device (to avoid div-by-0), but setting an abitrarily large
value also seems bad.

My thought was that if you are using weighted interleave, you're
probably already pretty confident that your environment is reasonably
sane - i.e. HMAT is providing the values.

> > +	}
> > +	/*
> > +	 * Now attempt to aggressively reduce the interleave weights by GCD
> > +	 * We want smaller interleave intervals to have a better distribution
> > +	 * of memory, even on smaller memory regions. If weights are divisible
> > +	 * by each other, we can do some quick math to aggresively squash them.
> > +	 */
> > +reduce:
> > +	gcd_val = new_iw[i];
> 
> "i" will be "nr_node_ids" in the first loop.  Right?
> 

ah good catch, this should be new_iw[node_being_updated], will update

> > -		weight = table ? table[nid] : 1;
> > -		weight = weight ? weight : 1;
> > +		weight = table ? table[nid] : 0;
> > +		weight = weight ? weight :
> > +				  (default_table ? default_table[nid] : 1);
> 
> This becomes long.  I think that we should define a help function for this.

Maybe? I didn't bother since it's not replicated elsewhere.  It does
look similar to the help function which fetches the node weight, but
that function itself calls rcu_read_lock() since it is called during the
bulk allocator.

I think probably some more thought could be put into this interaction,
this was just a first pass.  Certainly could be improved.

Thanks for the feedback, will chew on it a bit.  Let me know your
thoughts on the bandwidth examples above if you have any.

~Gregory