linux-kernel - Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1701171421150.15892@vshiva-Udesk>
Date:   Tue, 17 Jan 2017 14:30:09 -0800 (PST)
From:   Shivappa Vikas <vikas.shivappa@...el.com>
To:     Thomas Gleixner <tglx@...utronix.de>
cc:     Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
        vikas.shivappa@...el.com, davidcc@...gle.com, eranian@...gle.com,
        linux-kernel@...r.kernel.org, x86@...nel.org, hpa@...or.com,
        mingo@...nel.org, peterz@...radead.org, ravi.v.shankar@...el.com,
        tony.luck@...el.com, fenghua.yu@...el.com, andi.kleen@...el.com,
        h.peter.anvin@...el.com
Subject: Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update



On Tue, 17 Jan 2017, Thomas Gleixner wrote:

> On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>> Introduce a scheduling hook finish_arch_pre_lock_switch which is
>> called just after the perf sched_in during context switch. This method
>> handles both cat and cqm sched in scenarios.
>
> Sure, we need yet another special hook. What's wrong with
> finish_arch_post_lock_switch()?
>
> And again. This wants to be a seperate patch to the core code with a proper
> justification for that hook. Dammit. Changelogs are supposed to explain WHY
> not WHAT. How often do I have to explain that again?

Will fix. Will split this into three.
The sched hook (using the finish_arch_post_lock_switch) patch , perf sched in 
patch, actual write to msr.

>
>> The IA32_PQR_ASSOC MSR is used by cat(cache allocation) and cqm and this
>> patch integrates the two msr writes to one. The common sched_in patch
>> checks if the per cpu cache has a different RMID or CLOSid than the task
>> and does the MSR write.
>>
>> During sched_in the task uses the task RMID if the task is monitored or
>> else uses the task's cgroup rmid.
>
> And that's relevant for that patch because it explains the existing
> behaviour of the RMID, right?
>
> Darn, again you create a unreviewable hodgepodge of changes. The whole
> split of the RMID handling into a perf part and the actual RMID update can
> be done as a seperate patch before switching over to the combined
> RMID/COSID update mechanism.
>
>> +DEFINE_STATIC_KEY_FALSE(cqm_enable_key);
>> +
>>  /*
>>   * Groups of events that have the same target(s), one RMID per group.
>>   */
>> @@ -108,7 +103,7 @@ struct sample {
>>   * Likewise, an rmid value of -1 is used to indicate "no rmid currently
>>   * assigned" and is used as part of the rotation code.
>>   */
>> -static inline bool __rmid_valid(u32 rmid)
>> +bool __rmid_valid(u32 rmid)
>
> And once more this becomes global because there is no user outside of cqm.c.
>
>>  {
>>  	if (!rmid || rmid > cqm_max_rmid)
>>  		return false;
>> @@ -161,7 +156,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
>>   *
>>   * We expect to be called with cache_mutex held.
>>   */
>> -static u32 __get_rmid(int domain)
>> +u32 __get_rmid(int domain)
>
> Ditto.

Will fix the unnecessary globals. This should have been removed in this version 
as all of this should have gone with the removal of continuous monitoring.

>
>>  {
>>  	struct list_head *cqm_flist;
>>  	struct cqm_rmid_entry *entry;
>> @@ -368,6 +363,23 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
>>  	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
>>  }
>>
>> +#ifdef CONFIG_CGROUP_PERF
>> +struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
>> +{
>> +	struct cgrp_cqm_info *ccinfo = NULL;
>> +	struct perf_cgroup *pcgrp;
>> +
>> +	pcgrp = perf_cgroup_from_task(tsk, NULL);
>> +
>> +	if (!pcgrp)
>> +		return NULL;
>> +	else
>> +		ccinfo = cgrp_to_cqm_info(pcgrp);
>> +
>> +	return ccinfo;
>
> What the heck? Either you do:
>
> 	struct cgrp_cqm_info *ccinfo = NULL;
> 	struct perf_cgroup *pcgrp;
>
> 	pcgrp = perf_cgroup_from_task(tsk, NULL);
> 	if (pcgrp)
> 		ccinfo = cgrp_to_cqm_info(pcgrp);
>
> 	return ccinfo;

Will fix.

>
> or
>
> 	struct perf_cgroup *pcgrp;
>
> 	pcgrp = perf_cgroup_from_task(tsk, NULL);
> 	if (pcgrp)
> 		return cgrp_to_cqm_info(pcgrp);
> 	return NULL;
>
> But the above combination does not make any sense at all. Hack at it until
> it compiles and works by chance is not a really good engineering principle.
>
>> +}
>> +#endif
>> +
>>  static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
>>  {
>>  	if (rmid != NULL) {
>> @@ -713,26 +725,27 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
>>  static void intel_cqm_event_start(struct perf_event *event, int mode)
>>  {
>>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> -	u32 rmid;
>>
>>  	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
>>  		return;
>>
>>  	event->hw.cqm_state &= ~PERF_HES_STOPPED;
>>
>> -	alloc_needed_pkg_rmid(event->hw.cqm_rmid);
>> -
>> -	rmid = event->hw.cqm_rmid[pkg_id];
>> -	state->rmid = rmid;
>> -	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
>> +	if (is_task_event(event)) {
>> +		alloc_needed_pkg_rmid(event->hw.cqm_rmid);
>> +		state->next_task_rmid = event->hw.cqm_rmid[pkg_id];
>
> Huch? When is this going to be evaluated? Assume the task is running on a
> CPU in NOHZ full mode in user space w/o ever going through schedule. How is
> that supposed to activate the event ever? Not, AFAICT.
>
>> +	}
>>  }
>>
>>  static void intel_cqm_event_stop(struct perf_event *event, int mode)
>>  {
>> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +
>>  	if (event->hw.cqm_state & PERF_HES_STOPPED)
>>  		return;
>>
>>  	event->hw.cqm_state |= PERF_HES_STOPPED;
>> +	state->next_task_rmid = 0;
>
> Ditto.
>
>>  }
>>
>>  static int intel_cqm_event_add(struct perf_event *event, int mode)
>> @@ -1366,6 +1379,8 @@ static int __init intel_cqm_init(void)
>>  	if (mbm_enabled)
>>  		pr_info("Intel MBM enabled\n");
>>
>> +	static_branch_enable(&cqm_enable_key);
>> +
>> +++ b/arch/x86/include/asm/intel_pqr_common.h
>> @@ -0,0 +1,38 @@
>> +#ifndef _ASM_X86_INTEL_PQR_COMMON_H
>> +#define _ASM_X86_INTEL_PQR_COMMON_H
>> +
>> +#ifdef CONFIG_INTEL_RDT
>> +
>> +#include <linux/jump_label.h>
>> +#include <linux/types.h>
>
> Missing new line to seperate include blocks
>
>> +#include <asm/percpu.h>
>> +#include <asm/msr.h>
>> +#include <asm/intel_rdt_common.h>
>> +
>> +void __intel_rdt_sched_in(void);
>> +
>> +/*
>> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>> + *
>> + * Following considerations are made so that this has minimal impact
>> + * on scheduler hot path:
>> + * - This will stay as no-op unless we are running on an Intel SKU
>> + *   which supports resource control and we enable by mounting the
>> + *   resctrl file system.
>> + * - Caches the per cpu CLOSid values and does the MSR write only
>> + *   when a task with a different CLOSid is scheduled in.
>> + */
>> +static inline void intel_rdt_sched_in(void)
>> +{
>> +	if (static_branch_likely(&rdt_enable_key) ||
>> +		static_branch_unlikely(&cqm_enable_key)) {
>
> Groan. Why do you need TWO keys in order to make this decision? Just
> because the context switch path is not yet bloated enough?

Will fix to a common rdt enable for both. Also will disable this when no event 
is monitored which is most likely the case.

>
>> +		__intel_rdt_sched_in();
>> +	}
>> +}
>> +
>> +#else
>> +
>> +static inline void intel_rdt_sched_in(void) {}
>> +
>> +#endif
>> +#endif
>> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
>> index 95ce5c8..3b4a099 100644
>> --- a/arch/x86/include/asm/intel_rdt.h
>> +++ b/arch/x86/include/asm/intel_rdt.h
>> @@ -5,7 +5,6 @@
>>
>>  #include <linux/kernfs.h>
>>  #include <linux/jump_label.h>
>> -
>
> Darn. Your random whitespace changes are annoying. And this one is actually
> wrong. The new line is there on purpose to seperate linux and asm includes
> visually.

Will fix.

>
>>  #include <asm/intel_rdt_common.h>
>>
>
>> diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
>> index e11ed5e..544acaa 100644
>> --- a/arch/x86/include/asm/intel_rdt_common.h
>> +++ b/arch/x86/include/asm/intel_rdt_common.h
>> @@ -18,12 +18,23 @@
>>   */
>>  struct intel_pqr_state {
>>  	u32			rmid;
>> +	u32			next_task_rmid;
>>  	u32			closid;
>>  	int			rmid_usecnt;
>>  };
>>
>>  DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>>
>> +u32 __get_rmid(int domain);
>> +bool __rmid_valid(u32 rmid);
>> +void alloc_needed_pkg_rmid(u32 *cqm_rmid);
>> +struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);
>> +
>> +extern struct cgrp_cqm_info cqm_rootcginfo;
>
> Yet another completely pointless export.
>
>> +DECLARE_STATIC_KEY_FALSE(cqm_enable_key);
>> +DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
>> +
>>  /**
>>   * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
>>   * @cont_mon     Continuous monitoring flag
>
>> diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
>> new file mode 100644
>> index 0000000..c3c50cd
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/intel_rdt_common.c
>> @@ -0,0 +1,98 @@
>> +#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
>> +
>> +#include <linux/slab.h>
>> +#include <linux/err.h>
>> +#include <linux/cacheinfo.h>
>> +#include <linux/cpuhotplug.h>
>
> See above.
>
>> +#include <asm/intel-family.h>
>> +#include <asm/intel_rdt.h>
>> +
>> +/*
>> + * The cached intel_pqr_state is strictly per CPU and can never be
>> + * updated from a remote CPU. Both functions which modify the state
>> + * (intel_cqm_event_start and intel_cqm_event_stop) are called with
>> + * interrupts disabled, which is sufficient for the protection.
>
> Just copying comments without actually fixing them is lazy at best. There
> are other functions which modify that state.
>
>> + */
>> +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
>> +
>> +#define pkg_id	topology_physical_package_id(smp_processor_id())
>
> How often are you replicating this macro?  Get rid of it completely
> everywhere instead of spreading this crap all over the place.

Will fix

>
>> +#ifdef CONFIG_INTEL_RDT_M
>> +static inline int get_cgroup_sched_rmid(void)
>> +{
>> +#ifdef CONFIG_CGROUP_PERF
>> +	struct cgrp_cqm_info *ccinfo = NULL;
>> +
>> +	ccinfo = cqminfo_from_tsk(current);
>> +
>> +	if (!ccinfo)
>> +		return 0;
>> +
>> +	/*
>> +	 * A cgroup is always monitoring for itself or
>> +	 * for an ancestor(default is root).
>> +	 */
>> +	if (ccinfo->mon_enabled) {
>> +		alloc_needed_pkg_rmid(ccinfo->rmid);
>> +		return ccinfo->rmid[pkg_id];
>> +	} else {
>> +		alloc_needed_pkg_rmid(ccinfo->mfa->rmid);
>> +		return ccinfo->mfa->rmid[pkg_id];
>> +	}
>> +#endif
>> +
>> +	return 0;
>> +}
>> +
>> +static inline int get_sched_in_rmid(void)
>> +{
>> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +	u32 rmid = 0;
>> +
>> +	rmid = state->next_task_rmid;
>> +
>> +	return rmid ? rmid : get_cgroup_sched_rmid();
>
> So this effectively splits the perf sched in work into two parts. One is
> handling the perf sched in mechanics and the other is doing the eventual
> cgroup muck including rmid allocation.
>
> Why on earth can't the perf sched in work update state->rmid right away and
> this extra sched in function just handle CLOSID/RMID updates to the MSR ?
> Because that would actually make sense, right?
>
> I really can;t see how all of this hackery which goes through loops and
> hoops over and over will make context switches faster. I'm inclined to bet
> that the current code, which is halfways sane is faster even with two
> writes.

Yes will fix this.
I was terrible to forgot to change this when i removed the 
contiuous monitoring. It indeed should make the sched code simpler which is 
always better.

Originally wanted because perf sched_in calls in all the cgroup ancestor events 
for each event in the hierarchy but we cannot have two different RMIDs (in cases where 
multiple cgroups are monitored in the same hierarchy) to be monitored at the 
same time. So we just introduce the _NO_RECURSIOn flag for perf to call only the 
event and not its ancestors.

>
>> +}
>> +#endif
>> +
>> +/*
>> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
>> + *
>> + * Following considerations are made so that this has minimal impact
>> + * on scheduler hot path:
>> + * - This will stay as no-op unless we are running on an Intel SKU
>> + *   which supports resource control and we enable by mounting the
>> + *   resctrl file system or it supports resource monitoring.
>> + * - Caches the per cpu CLOSid/RMID values and does the MSR write only
>> + *   when a task with a different CLOSid/RMID is scheduled in.
>> + */
>> +void __intel_rdt_sched_in(void)
>> +{
>> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
>> +	int closid = 0;
>> +	u32 rmid = 0;
>
> So rmid is a u32 and closid is int. Are you rolling a dice when chosing
> data types?

Will fix..

Thanks,
Vikas

>
> Thanks,
>
> 	tglx
>