linux-kernel - Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when mbm_cntr

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <283b3726-1ecb-4331-9b81-36016ee02767@amd.com>
Date: Mon, 10 Feb 2025 11:27:06 -0600
From: "Moger, Babu" <babu.moger@....com>
To: Reinette Chatre <reinette.chatre@...el.com>, corbet@....net,
 tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 dave.hansen@...ux.intel.com, tony.luck@...el.com, peternewman@...gle.com
Cc: x86@...nel.org, hpa@...or.com, paulmck@...nel.org,
 akpm@...ux-foundation.org, thuth@...hat.com, rostedt@...dmis.org,
 xiongwei.song@...driver.com, pawan.kumar.gupta@...ux.intel.com,
 daniel.sneddon@...ux.intel.com, jpoimboe@...nel.org, perry.yuan@....com,
 sandipan.das@....com, kai.huang@...el.com, xiaoyao.li@...el.com,
 seanjc@...gle.com, xin3.li@...el.com, andrew.cooper3@...rix.com,
 ebiggers@...gle.com, mario.limonciello@....com, james.morse@....com,
 tan.shaopeng@...itsu.com, linux-doc@...r.kernel.org,
 linux-kernel@...r.kernel.org, maciej.wieczor-retman@...el.com,
 eranian@...gle.com
Subject: Re: [PATCH v11 17/23] x86/resctrl: Auto assign/unassign counters when
 mbm_cntr_assign is enabled

Hi Reinette,

On 2/6/25 12:03, Reinette Chatre wrote:
> Hi Babu,
> 
> On 1/22/25 12:20 PM, Babu Moger wrote:
>> Assign/unassign counters on resctrl group creation/deletion. Two counters
>> are required per group, one for MBM total event and one for MBM local
>> event.
>>
>> There are a limited number of counters available for assignment. If these
>> counters are exhausted, the kernel will display the error message: "Out of
>> MBM assignable counters". However, it is not necessary to fail the
>> creation of a group due to assignment failures. Users have the flexibility
>> to modify the assignments at a later time.
>>
>> Signed-off-by: Babu Moger <babu.moger@....com>
>> ---
> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index b6d188d0f9b7..118b39fbb01e 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -1557,3 +1557,30 @@ int resctrl_unassign_cntr_event(struct rdt_resource *r, struct rdt_mon_domain *d
>>  
>>  	return ret;
>>  }
>> +
>> +void mbm_cntr_reset(struct rdt_resource *r)
>> +{
>> +	u32 idx_limit = resctrl_arch_system_num_rmid_idx();
>> +	struct rdt_mon_domain *dom;
>> +
>> +	/*
>> +	 * Reset the domain counter configuration. Hardware counters
>> +	 * will reset after switching the monitor mode. So, reset the
>> +	 * architectural amd non-architectural state so that reading
> 
> "amd" -> "and"

Sure.

> 
>> +	 * of hardware counter is not considered as an overflow in the
>> +	 * next update.
>> +	 */
>> +	if (is_mbm_enabled() && r->mon.mbm_cntr_assignable) {
>> +		list_for_each_entry(dom, &r->mon_domains, hdr.list) {
>> +			memset(dom->cntr_cfg, 0,
>> +			       sizeof(*dom->cntr_cfg) * r->mon.num_mbm_cntrs);
>> +			if (is_mbm_total_enabled())
>> +				memset(dom->mbm_total, 0,
>> +				       sizeof(struct mbm_state) * idx_limit);
>> +			if (is_mbm_local_enabled())
>> +				memset(dom->mbm_local, 0,
>> +				       sizeof(struct mbm_state) * idx_limit);
>> +			resctrl_arch_reset_rmid_all(r, dom);
>> +		}
>> +	}
>> +}
> 
> I looked back at the previous versions to better understand how this function
> came about and I do not think it actually solves the problem it aims to solve.
> 
> rdtgroup_unassign_cntrs() can fail and when it does the counter is not free'd. That
> leaves a monitoring domain's array with an entry that points to a resource group
> that no longer exists (unless it is the default resource group) since
> rdtgroup_unassign_cntrs() does not check the return and proceeds to remove the
> resource group. mbm_cntr_reset() is called on umount of resctrl but
> rdtgroup_unassign_cntrs() is called on every  group remove and those scenarios
> are not handled.
> 
> To address this I believe that I need to go back on a previous request to have
> resctrl_arch_config_cntr() return an error code. AMD does not need this and
> it is difficult to predict what will work for MPAM. I originally wanted to be
> flexible here but this appears to be impractical. With a new requirement that 
> resctrl_arch_config_cntr() always succeeds the counter will in turn always
> be free'd and not leave dangling pointers. I believe doing so eliminates
> the need for mbm_cntr_reset() as used in this patch. My apologies for the
> misdirection. We can re-evaluate these flows if MPAM needs anything different.

So, new requirement is to free the counter even if the
resctrl_arch_config_cntr() call fails. That way after calling
rdtgroup_unassign_cntrs() the counter is freed and it is in clean state.
So, we dont need to call mbm_cntr_reset() in the end to clear all the entries.

Here is the call sequence.

rdtgroup_unassign_cntrs() -> resctrl_unassign_cntr_event() ->
resctrl_free_config_cntr() -> resctrl_config_cntr() ->
resctrl_arch_config_cntr().

So, only change here is.

/*
 * Unassign and free the counter if assigned else return success.
 */
static int resctrl_free_config_cntr(struct rdt_resource *r,
           struct rdt_mon_domain *d, struct rdtgroup *rdtgrp,
           enum resctrl_event_id evtid)
{
        int cntr_id, ret = 0;

        cntr_id = mbm_cntr_get(r, d, rdtgrp, evtid);
	if (cntr_id < 0)
 		return ret;

        /* Unassign and free the counter*/
        ret = resctrl_config_cntr(r, d, evtid, rdtgrp->mon.rmid,
                                          rdtgrp->closid, cntr_id, false);
        mbm_cntr_free(d, cntr_id);

        return ret;
}


> 
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index 2b86124c336b..f61f0cd032ef 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -2687,6 +2687,46 @@ static void schemata_list_destroy(void)
>>  	}
>>  }
>>  
>> +/*
>> + * Called when a new group is created. If "mbm_cntr_assign" mode is enabled,
>> + * counters are automatically assigned. Each group can accommodate two counters:
>> + * one for the total event and one for the local event. Assignments may fail
>> + * due to the limited number of counters. However, it is not necessary to fail
>> + * the group creation and thus no failure is returned. Users have the option
>> + * to modify the counter assignments after the group has been created.
>> + */
>> +static void rdtgroup_assign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +		return;
>> +
>> +	if (is_mbm_total_enabled())
>> +		resctrl_assign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +
>> +	if (is_mbm_local_enabled())
>> +		resctrl_assign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +}
>> +
>> +/*
>> + * Called when a group is deleted. Counters are unassigned if it was in
>> + * assigned state.
>> + */
>> +static void rdtgroup_unassign_cntrs(struct rdtgroup *rdtgrp)
>> +{
>> +	struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
>> +
>> +	if (!resctrl_arch_mbm_cntr_assign_enabled(r))
>> +		return;
>> +
> 
> It looks to me as though there are a couple of places (rmdir_all_sub(), rdt_kill_sb(),
> and rdtgroup_rmdir_ctrl()) where rdtgroup_unassign_cntrs() could be called on a system that
> does not support monitoring and/or only supports cache allocation monitoring.
> 
> In these paths it is only the architecture's resctrl_arch_mbm_cntr_assign_enabled(r) that
> gates the resctrl flow. I think rdtgroup_unassign_cntrs() and to match rdtgroup_assign_cntrs()
> can do with at least a r->mon_capable check.

ok. Will add following check.

if (!r->mon_capable || !resctrl_arch_mbm_cntr_assign_enabled(r))
   return;

> 
>> +	if (is_mbm_total_enabled())
>> +		resctrl_unassign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_TOTAL_EVENT_ID);
>> +
>> +	if (is_mbm_local_enabled())
>> +		resctrl_unassign_cntr_event(r, NULL, rdtgrp, QOS_L3_MBM_LOCAL_EVENT_ID);
>> +}
>> +
>>  static int rdt_get_tree(struct fs_context *fc)
>>  {
>>  	struct rdt_fs_context *ctx = rdt_fc2context(fc);
>> @@ -2741,6 +2781,8 @@ static int rdt_get_tree(struct fs_context *fc)
>>  		if (ret < 0)
>>  			goto out_info;
>>  
>> +		rdtgroup_assign_cntrs(&rdtgroup_default);
>> +
>>  		ret = mkdir_mondata_all(rdtgroup_default.kn,
>>  					&rdtgroup_default, &kn_mondata);
>>  		if (ret < 0)
>> @@ -2779,8 +2821,10 @@ static int rdt_get_tree(struct fs_context *fc)
>>  	if (resctrl_arch_mon_capable())
>>  		kernfs_remove(kn_mondata);
>>  out_mongrp:
>> -	if (resctrl_arch_mon_capable())
>> +	if (resctrl_arch_mon_capable()) {
>> +		rdtgroup_unassign_cntrs(&rdtgroup_default);
>>  		kernfs_remove(kn_mongrp);
>> +	}
>>  out_info:
>>  	kernfs_remove(kn_info);
>>  out_schemata_free:
>> @@ -2956,6 +3000,7 @@ static void free_all_child_rdtgrp(struct rdtgroup *rdtgrp)
>>  
>>  	head = &rdtgrp->mon.crdtgrp_list;
>>  	list_for_each_entry_safe(sentry, stmp, head, mon.crdtgrp_list) {
>> +		rdtgroup_unassign_cntrs(sentry);
>>  		free_rmid(sentry->closid, sentry->mon.rmid);
>>  		list_del(&sentry->mon.crdtgrp_list);
>>  
>> @@ -2996,6 +3041,8 @@ static void rmdir_all_sub(void)
>>  		cpumask_or(&rdtgroup_default.cpu_mask,
>>  			   &rdtgroup_default.cpu_mask, &rdtgrp->cpu_mask);
>>  
>> +		rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>  		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>  
>>  		kernfs_remove(rdtgrp->kn);
>> @@ -3027,6 +3074,8 @@ static void rdt_kill_sb(struct super_block *sb)
>>  	for_each_alloc_capable_rdt_resource(r)
>>  		reset_all_ctrls(r);
>>  	rmdir_all_sub();
>> +	rdtgroup_unassign_cntrs(&rdtgroup_default);
>> +	mbm_cntr_reset(&rdt_resources_all[RDT_RESOURCE_L3].r_resctrl);
>>  	rdt_pseudo_lock_release();
>>  	rdtgroup_default.mode = RDT_MODE_SHAREABLE;
>>  	schemata_list_destroy();
>> @@ -3490,9 +3539,12 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
>>  	}
>>  	rdtgrp->mon.rmid = ret;
>>  
>> +	rdtgroup_assign_cntrs(rdtgrp);
>> +
>>  	ret = mkdir_mondata_all(rdtgrp->kn, rdtgrp, &rdtgrp->mon.mon_data_kn);
>>  	if (ret) {
>>  		rdt_last_cmd_puts("kernfs subdir error\n");
>> +		rdtgroup_unassign_cntrs(rdtgrp);
>>  		free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>  		return ret;
>>  	}
>> @@ -3502,8 +3554,10 @@ static int mkdir_rdt_prepare_rmid_alloc(struct rdtgroup *rdtgrp)
>>  
>>  static void mkdir_rdt_prepare_rmid_free(struct rdtgroup *rgrp)
>>  {
>> -	if (resctrl_arch_mon_capable())
>> +	if (resctrl_arch_mon_capable()) {
>> +		rdtgroup_unassign_cntrs(rgrp);
>>  		free_rmid(rgrp->closid, rgrp->mon.rmid);
>> +	}
>>  }
>>  
>>  static int mkdir_rdt_prepare(struct kernfs_node *parent_kn,
>> @@ -3764,6 +3818,9 @@ static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>  	update_closid_rmid(tmpmask, NULL);
>>  
>>  	rdtgrp->flags = RDT_DELETED;
>> +
>> +	rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>  	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>  
>>  	/*
>> @@ -3810,6 +3867,8 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
>>  	cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
>>  	update_closid_rmid(tmpmask, NULL);
>>  
>> +	rdtgroup_unassign_cntrs(rdtgrp);
>> +
>>  	free_rmid(rdtgrp->closid, rdtgrp->mon.rmid);
>>  	closid_free(rdtgrp->closid);
>>  
> 
> Reinette
> 

-- 
Thanks
Babu Moger