lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 22 May 2024 14:24:12 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Tony Luck <tony.luck@...el.com>, Fenghua Yu <fenghua.yu@...el.com>, Maciej
 Wieczor-Retman <maciej.wieczor-retman@...el.com>, Peter Newman
	<peternewman@...gle.com>, James Morse <james.morse@....com>, Babu Moger
	<babu.moger@....com>, Drew Fustini <dfustini@...libre.com>, Dave Martin
	<Dave.Martin@....com>
CC: <x86@...nel.org>, <linux-kernel@...r.kernel.org>,
	<patches@...ts.linux.dev>
Subject: Re: [PATCH v18 14/17] x86/resctrl: Sum monitor data acrss Sub-NUMA
 (SNC) nodes when needed

Hi Tony,

shortlog: acrss -> across?

On 5/15/2024 3:23 PM, Tony Luck wrote:
> When the sumdomains fields is set in the rmid_read structure, walk
> the list of domains in this resource to find all that share an L3
> cache id (rr->display_id).

"L3 cache id" vs "monitor display scope"?

> 
> Adjust the RMID value based on which SNC domain is being accessed.

Thinking generously this changelog contains two brief descriptions of code
snippets. There is no context or explanation of what the goal of this change
is and why it chooses to implement things the way it does ... implementations
that definitely needs some explanation.

> 
> Signed-off-by: Tony Luck <tony.luck@...el.com>
> ---
>  arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 20 +++++++++++---
>  arch/x86/kernel/cpu/resctrl/monitor.c     | 33 ++++++++++++++++++++++-
>  2 files changed, 48 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 3b9383612c35..7ab788d47ad3 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -575,15 +575,27 @@ int rdtgroup_mondata_show(struct seq_file *m, void *arg)
>  	resid = md.u.rid;
>  	domid = md.u.domid;
>  	evtid = md.u.evtid;
> -
> +	rr.sumdomains = md.u.sum;
>  	r = &rdt_resources_all[resid].r_resctrl;
> -	hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> -	if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
> +
> +	if (rr.sumdomains) {
> +		rr.display_id = domid;
> +		list_for_each_entry(d, &r->mon_domains, hdr.list) {
> +			if (d->display_id == domid)
> +				goto got_domain;
> +		}
>  		ret = -ENOENT;
>  		goto out;
> +	} else {
> +		hdr = rdt_find_domain(&r->mon_domains, domid, NULL);
> +		if (!hdr || WARN_ON_ONCE(hdr->type != RESCTRL_MON_DOMAIN)) {
> +			ret = -ENOENT;
> +			goto out;
> +		}
> +		d = container_of(hdr, struct rdt_mon_domain, hdr);
>  	}
> -	d = container_of(hdr, struct rdt_mon_domain, hdr);
>  
> +got_domain:
>  	mon_event_read(&rr, r, d, rdtgrp, evtid, false);

This looks more like "wedging things until it works". In the "sumdomains" case
it just picks the first matching domain without any explanation why it would
be ok. The only reason why mon_event_read() needs the domain is so that
it can use it to determine which CPUs it should read the counters from and it looks
like this code is written to take advantage of just that. It is a hack based on
knowledge of internals of mon_event_read() to just get the reader to run on a
CPU that works for SNC where another quirk awaits to override the domain
when the counter is _actually_ read so that it can get the correct architectural
state.

I was expecting here to at least see some documentation/comments
to explain why the code behaves the way it is. Optimistically it can
be documented as an "optimization" for the sumdomains case that is only
used by SNC where it is ok to read counters from any domain in the
"monitor display scope".

Apart from the documentation I do not think this code should wedge itself
in like this. 

To make it obvious what this code does I think the non-SNC case should
set rr.d and mon_event_read() should take a CPU mask as parameter
instead of a struct rdt_domain. This means that in SNC case rr.d will be
NULL and make the code flow more obvious.

>  
>  	if (rr.err == -EIO)
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 0f66825a1ac9..668d2fdf58cd 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -546,7 +546,7 @@ static struct mbm_state *get_mbm_state(struct rdt_mon_domain *d, u32 closid,
>  	}
>  }
>  
> -static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> +static int ___mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
>  {
>  	struct mbm_state *m;
>  	u64 tval = 0;
> @@ -569,6 +569,37 @@ static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
>  	return 0;
>  }
>  
> +static u32 get_node_rmid(struct rdt_resource *r, struct rdt_mon_domain *d, u32 rmid)
> +{
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
> +
> +	return rmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;

Is this missing a "if (snc_nodes_per_l3_cache > 1)" ?

I do not think this belongs in resctrl fs code though. Should this algorithm be forced
on all architectures? It seems more appropriate for the arch specific code.

> +}
> +
> +static int __mon_event_count(u32 closid, u32 rmid, struct rmid_read *rr)
> +{
> +	struct rdt_mon_domain *d;
> +	u32 node_rmid;
> +	int ret = 0;
> +
> +	if (!rr->sumdomains) {
> +		node_rmid = get_node_rmid(rr->r, rr->d, rmid);
> +		return ___mon_event_count(closid, node_rmid, rr);
> +	}
> +

/*
 * rr->sumdomains is only set by SNC mode where the event
 * counters of a monitoring domain can be read from a CPU belonging
 * to a different monitoring domain that shares the same monitoring
 * display domain. Optimize counter reads when needing to sum the
 * values by reading the counter for several domains from
 * the same CPU instead of sending IPI to one CPU of each monitoring
 * domain.
 */

Please feel free to improve, please do help folks trying to understand
what this code does.

> +	list_for_each_entry(d, &rr->r->mon_domains, hdr.list) {
> +		if (d->display_id != rr->display_id)
> +			continue;
> +		rr->d = d;
> +		node_rmid = get_node_rmid(rr->r, d, rmid);
> +		ret = ___mon_event_count(closid, node_rmid, rr);
> +		if (ret)
> +			break;
> +	}
> +
> +	return ret;
> +}
> +
>  /*
>   * mbm_bw_count() - Update bw count from values previously read by
>   *		    __mon_event_count().

Reinette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ