linux-kernel - Re: [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d261c249-177f-44a5-9715-26ec647b1ae8@intel.com>
Date: Thu, 20 Jun 2024 14:19:46 -0700
From: Reinette Chatre <reinette.chatre@...el.com>
To: Tony Luck <tony.luck@...el.com>, Fenghua Yu <fenghua.yu@...el.com>, Maciej
 Wieczor-Retman <maciej.wieczor-retman@...el.com>, Peter Newman
	<peternewman@...gle.com>, James Morse <james.morse@....com>, Babu Moger
	<babu.moger@....com>, Drew Fustini <dfustini@...libre.com>, Dave Martin
	<Dave.Martin@....com>
CC: <x86@...nel.org>, <linux-kernel@...r.kernel.org>,
	<patches@...ts.linux.dev>
Subject: Re: [PATCH v20 06/18] x86/resctrl: Introduce snc_nodes_per_l3_cache

Hi Tony,

On 6/10/24 11:35 AM, Tony Luck wrote:
> Intel Sub-NUMA Cluster (SNC) is a feature that subdivides the CPU cores
> and memory controllers on a socket into two or more groups. These are
> presented to the operating system as NUMA nodes.
> 
> This may enable some workloads to have slightly lower latency to memory
> as the memory controller(s) in an SNC node are electrically closer to the
> CPU cores on that SNC node. This cost may be offset by lower bandwidth
> since the memory accesses for each core can only be interleaved between
> the memory controllers on the same SNC node.
> 
> Resctrl monitoring on an Intel system depends upon attaching RMIDs to tasks
> to track L3 cache occupancy and memory bandwidth. There is an MSR that
> controls how the RMIDs are shared between SNC nodes.
> 
> The default mode divides them numerically. E.g. when there are two SNC
> nodes on a socket the lower number half of the RMIDs are given to the
> first node, the remainder to the second node. This would be difficult
> to use with the Linux resctrl interface as specific RMID values assigned
> to resctrl groups are not visible to users.
> 
> RMID sharing mode divides the physical RMIDs evenly between SNC nodes
> but uses a logical RMID in the IA32_PQR_ASSOC MSR. For example a system
> with 200 physical RMIDs (as enumerated by CPUID leaf 0xF) that has two
> SNC nodes per L3 cache instance would have 100 logical RMIDs available
> for Linux to use. A task running on SNC node 0 with RMID 5 would
> accumulate LLC occupancy and MBM bandwidth data in physical RMID 5.
> Another task using RMID 5, but running on SNC node 1 would accumulate
> data in physical RMID 105.
> 
> Even with this renumbering SNC mode requires several changes in resctrl
> behavior for correct operation.
> 
> Add a static global to arch/x86/kernel/cpu/resctrl/monitor.c to indicate
> how many SNC domains share an L3 cache instance.  Initialize this to
> "1". Runtime detection of SNC mode will adjust this value.
> 
> Update all places to take appropriate action when SNC mode is enabled:
> 1) The number of logical RMIDs per L3 cache available for use is the
>     number of physical RMIDs divided by the number of SNC nodes.
> 2) Likewise the "mon_scale" value must be divided by the number of SNC
>     nodes.
> 3) Add a function to convert from logical RMID values (assigned to
>     tasks and loaded into the IA32_PQR_ASSOC MSR on context switch)
>     to physical RMID values to load into IA32_QM_EVTSEL MSR when
>     reading counters on each SNC node.
> 
> Signed-off-by: Tony Luck <tony.luck@...el.com>
> ---
>   arch/x86/kernel/cpu/resctrl/monitor.c | 56 ++++++++++++++++++++++++---
>   1 file changed, 50 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 89d7e6fcbaa1..f2fd35d294f2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -97,6 +97,8 @@ unsigned int resctrl_rmid_realloc_limit;
>   
>   #define CF(cf)	((unsigned long)(1048576 * (cf) + 0.5))
>   
> +static int snc_nodes_per_l3_cache = 1;
> +
>   /*
>    * The correction factor table is documented in Documentation/arch/x86/resctrl.rst.
>    * If rmid > rmid threshold, MBM total and local values should be multiplied
> @@ -185,7 +187,43 @@ static inline struct rmid_entry *__rmid_entry(u32 idx)
>   	return entry;
>   }
>   
> -static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
> +/*
> + * When Sub-NUMA Cluster (SNC) mode is not enabled (as indicated by
> + * "snc_nodes_per_l3_cache  == 1") no translation of the RMID value is

(nit: same unnecessary space as in code)

> + * needed. The physical RMID is the same as the logical RMID.
> + *
> + * On a platform with SNC mode enabled, Linux enables RMID sharing mode
> + * via MSR 0xCA0 (see the "RMID Sharing Mode" section in the "Intel
> + * Resource Director Technology Architecture Specification" for a full
> + * description of RMID sharing mode).
> + *
> + * In RMID sharing mode there are fewer "logical RMID" values available
> + * to accumulate data ("physical RMIDs" are divided evenly between SNC
> + * nodes that share an L3 cache). Linux creates an rdt_mon_domain for
> + * each SNC node.
> + *
> + * The value loaded into IA32_PQR_ASSOC is the "logical RMID".
> + *
> + * Data is collected independently on each SNC node and can be retrieved
> + * using the "physical RMID" value computed by this function and loaded
> + * into IA32_QM_EVTSEL. @cpu can be any CPU in the SNC node.
> + *
> + * The scope of the IA32_QM_EVTSEL and IA32_QM_CTR MSRs is at the L3
> + * cache.  So a "physical RMID" may be read from any CPU that shares
> + * the L3 cache with the desired SNC node, not just from a CPU in
> + * the specific SNC node.
> + */
> +static int logical_rmid_to_physical_rmid(int cpu, int lrmid)

It is not clear to me where we are in the discussion about the naming. If
the "logical" vs "physical" becomes an issue then perhaps the "logical" can
just be dropped? Resulting in just "rmid_to_phys_rmid()" (to match with
__rmid_read_phys()) ? I'm ok with what you have here also.

> +{
> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> +
> +	if (snc_nodes_per_l3_cache  == 1)

(nit: extra space here)

> +		return lrmid;
> +
> +	return lrmid + (cpu_to_node(cpu) % snc_nodes_per_l3_cache) * r->num_rmid;
> +}
> +
> +static int __rmid_read_phys(u32 prmid, enum resctrl_event_id eventid, u64 *val)
>   {
>   	u64 msr_val;
>   
> @@ -197,7 +235,7 @@ static int __rmid_read(u32 rmid, enum resctrl_event_id eventid, u64 *val)
>   	 * IA32_QM_CTR.Error (bit 63) and IA32_QM_CTR.Unavailable (bit 62)
>   	 * are error bits.
>   	 */
> -	wrmsr(MSR_IA32_QM_EVTSEL, eventid, rmid);
> +	wrmsr(MSR_IA32_QM_EVTSEL, eventid, prmid);
>   	rdmsrl(MSR_IA32_QM_CTR, msr_val);
>   
>   	if (msr_val & RMID_VAL_ERROR)
> @@ -233,14 +271,17 @@ void resctrl_arch_reset_rmid(struct rdt_resource *r, struct rdt_mon_domain *d,
>   			     enum resctrl_event_id eventid)
>   {
>   	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>   	struct arch_mbm_state *am;
> +	u32 prmid;
>   
>   	am = get_arch_mbm_state(hw_dom, rmid, eventid);
>   	if (am) {
>   		memset(am, 0, sizeof(*am));
>   
> +		prmid = logical_rmid_to_physical_rmid(cpu, rmid);
>   		/* Record any initial, non-zero count value. */
> -		__rmid_read(rmid, eventid, &am->prev_msr);
> +		__rmid_read_phys(prmid, eventid, &am->prev_msr);
>   	}
>   }
>   
> @@ -275,8 +316,10 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   {
>   	struct rdt_hw_mon_domain *hw_dom = resctrl_to_arch_mon_dom(d);
>   	struct rdt_hw_resource *hw_res = resctrl_to_arch_res(r);
> +	int cpu = cpumask_any(&d->hdr.cpu_mask);
>   	struct arch_mbm_state *am;
>   	u64 msr_val, chunks;
> +	u32 prmid;
>   	int ret;
>   
>   	resctrl_arch_rmid_read_context_check();
> @@ -284,7 +327,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_mon_domain *d,
>   	if (!cpumask_test_cpu(smp_processor_id(), &d->hdr.cpu_mask))
>   		return -EINVAL;
>   
> -	ret = __rmid_read(rmid, eventid, &msr_val);
> +	prmid = logical_rmid_to_physical_rmid(cpu, rmid);
> +	ret = __rmid_read_phys(prmid, eventid, &msr_val);
>   	if (ret)
>   		return ret;
>   
> @@ -1022,8 +1066,8 @@ int __init rdt_get_mon_l3_config(struct rdt_resource *r)
>   	int ret;
>   
>   	resctrl_rmid_realloc_limit = boot_cpu_data.x86_cache_size * 1024;
> -	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale;
> -	r->num_rmid = boot_cpu_data.x86_cache_max_rmid + 1;
> +	hw_res->mon_scale = boot_cpu_data.x86_cache_occ_scale / snc_nodes_per_l3_cache;
> +	r->num_rmid = (boot_cpu_data.x86_cache_max_rmid + 1) / snc_nodes_per_l3_cache;
>   	hw_res->mbm_width = MBM_CNTR_WIDTH_BASE;
>   
>   	if (mbm_offset > 0 && mbm_offset <= MBM_CNTR_WIDTH_OFFSET_MAX)

Apart from the two spacing related nits this looks good to me.

| Reviewed-by: Reinette Chatre <reinette.chatre@...el.com>

Reinette