linux-kernel - Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.20.1701171509310.3495@nanos>
Date:   Tue, 17 Jan 2017 22:58:38 +0100 (CET)
From:   Thomas Gleixner <tglx@...utronix.de>
To:     Vikas Shivappa <vikas.shivappa@...ux.intel.com>
cc:     vikas.shivappa@...el.com, davidcc@...gle.com, eranian@...gle.com,
        linux-kernel@...r.kernel.org, x86@...nel.org, hpa@...or.com,
        mingo@...nel.org, peterz@...radead.org, ravi.v.shankar@...el.com,
        tony.luck@...el.com, fenghua.yu@...el.com, andi.kleen@...el.com,
        h.peter.anvin@...el.com
Subject: Re: [PATCH 07/12] x86/rdt,cqm: Scheduling support update

On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> Introduce a scheduling hook finish_arch_pre_lock_switch which is
> called just after the perf sched_in during context switch. This method
> handles both cat and cqm sched in scenarios.

Sure, we need yet another special hook. What's wrong with
finish_arch_post_lock_switch()?

And again. This wants to be a seperate patch to the core code with a proper
justification for that hook. Dammit. Changelogs are supposed to explain WHY
not WHAT. How often do I have to explain that again?

> The IA32_PQR_ASSOC MSR is used by cat(cache allocation) and cqm and this
> patch integrates the two msr writes to one. The common sched_in patch
> checks if the per cpu cache has a different RMID or CLOSid than the task
> and does the MSR write.
> 
> During sched_in the task uses the task RMID if the task is monitored or
> else uses the task's cgroup rmid.

And that's relevant for that patch because it explains the existing
behaviour of the RMID, right?

Darn, again you create a unreviewable hodgepodge of changes. The whole
split of the RMID handling into a perf part and the actual RMID update can
be done as a seperate patch before switching over to the combined
RMID/COSID update mechanism.

> +DEFINE_STATIC_KEY_FALSE(cqm_enable_key);
> +
>  /*
>   * Groups of events that have the same target(s), one RMID per group.
>   */
> @@ -108,7 +103,7 @@ struct sample {
>   * Likewise, an rmid value of -1 is used to indicate "no rmid currently
>   * assigned" and is used as part of the rotation code.
>   */
> -static inline bool __rmid_valid(u32 rmid)
> +bool __rmid_valid(u32 rmid)

And once more this becomes global because there is no user outside of cqm.c.

>  {
>  	if (!rmid || rmid > cqm_max_rmid)
>  		return false;
> @@ -161,7 +156,7 @@ static inline struct cqm_rmid_entry *__rmid_entry(u32 rmid, int domain)
>   *
>   * We expect to be called with cache_mutex held.
>   */
> -static u32 __get_rmid(int domain)
> +u32 __get_rmid(int domain)

Ditto.

>  {
>  	struct list_head *cqm_flist;
>  	struct cqm_rmid_entry *entry;
> @@ -368,6 +363,23 @@ static void init_mbm_sample(u32 *rmid, u32 evt_type)
>  	on_each_cpu_mask(&cqm_cpumask, __intel_mbm_event_init, &rr, 1);
>  }
>  
> +#ifdef CONFIG_CGROUP_PERF
> +struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk)
> +{
> +	struct cgrp_cqm_info *ccinfo = NULL;
> +	struct perf_cgroup *pcgrp;
> +
> +	pcgrp = perf_cgroup_from_task(tsk, NULL);
> +
> +	if (!pcgrp)
> +		return NULL;
> +	else
> +		ccinfo = cgrp_to_cqm_info(pcgrp);
> +
> +	return ccinfo;

What the heck? Either you do:

	struct cgrp_cqm_info *ccinfo = NULL;
	struct perf_cgroup *pcgrp;

	pcgrp = perf_cgroup_from_task(tsk, NULL);
	if (pcgrp)
		ccinfo = cgrp_to_cqm_info(pcgrp);

	return ccinfo;

or

	struct perf_cgroup *pcgrp;

	pcgrp = perf_cgroup_from_task(tsk, NULL);
	if (pcgrp)
		return cgrp_to_cqm_info(pcgrp);
	return NULL;

But the above combination does not make any sense at all. Hack at it until
it compiles and works by chance is not a really good engineering principle.

> +}
> +#endif
> +
>  static inline void cqm_enable_mon(struct cgrp_cqm_info *cqm_info, u32 *rmid)
>  {
>  	if (rmid != NULL) {
> @@ -713,26 +725,27 @@ void alloc_needed_pkg_rmid(u32 *cqm_rmid)
>  static void intel_cqm_event_start(struct perf_event *event, int mode)
>  {
>  	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> -	u32 rmid;
>  
>  	if (!(event->hw.cqm_state & PERF_HES_STOPPED))
>  		return;
>  
>  	event->hw.cqm_state &= ~PERF_HES_STOPPED;
>  
> -	alloc_needed_pkg_rmid(event->hw.cqm_rmid);
> -
> -	rmid = event->hw.cqm_rmid[pkg_id];
> -	state->rmid = rmid;
> -	wrmsr(MSR_IA32_PQR_ASSOC, rmid, state->closid);
> +	if (is_task_event(event)) {
> +		alloc_needed_pkg_rmid(event->hw.cqm_rmid);
> +		state->next_task_rmid = event->hw.cqm_rmid[pkg_id];

Huch? When is this going to be evaluated? Assume the task is running on a
CPU in NOHZ full mode in user space w/o ever going through schedule. How is
that supposed to activate the event ever? Not, AFAICT.

> +	}
>  }
>  
>  static void intel_cqm_event_stop(struct perf_event *event, int mode)
>  {
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +
>  	if (event->hw.cqm_state & PERF_HES_STOPPED)
>  		return;
>  
>  	event->hw.cqm_state |= PERF_HES_STOPPED;
> +	state->next_task_rmid = 0;

Ditto.

>  }
>  
>  static int intel_cqm_event_add(struct perf_event *event, int mode)
> @@ -1366,6 +1379,8 @@ static int __init intel_cqm_init(void)
>  	if (mbm_enabled)
>  		pr_info("Intel MBM enabled\n");
>  
> +	static_branch_enable(&cqm_enable_key);
> +
> +++ b/arch/x86/include/asm/intel_pqr_common.h
> @@ -0,0 +1,38 @@
> +#ifndef _ASM_X86_INTEL_PQR_COMMON_H
> +#define _ASM_X86_INTEL_PQR_COMMON_H
> +
> +#ifdef CONFIG_INTEL_RDT
> +
> +#include <linux/jump_label.h>
> +#include <linux/types.h>

Missing new line to seperate include blocks

> +#include <asm/percpu.h>
> +#include <asm/msr.h>
> +#include <asm/intel_rdt_common.h>
> +
> +void __intel_rdt_sched_in(void);
> +
> +/*
> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> + *
> + * Following considerations are made so that this has minimal impact
> + * on scheduler hot path:
> + * - This will stay as no-op unless we are running on an Intel SKU
> + *   which supports resource control and we enable by mounting the
> + *   resctrl file system.
> + * - Caches the per cpu CLOSid values and does the MSR write only
> + *   when a task with a different CLOSid is scheduled in.
> + */
> +static inline void intel_rdt_sched_in(void)
> +{
> +	if (static_branch_likely(&rdt_enable_key) ||
> +		static_branch_unlikely(&cqm_enable_key)) {

Groan. Why do you need TWO keys in order to make this decision? Just
because the context switch path is not yet bloated enough?

> +		__intel_rdt_sched_in();
> +	}
> +}
> +
> +#else
> +
> +static inline void intel_rdt_sched_in(void) {}
> +
> +#endif
> +#endif
> diff --git a/arch/x86/include/asm/intel_rdt.h b/arch/x86/include/asm/intel_rdt.h
> index 95ce5c8..3b4a099 100644
> --- a/arch/x86/include/asm/intel_rdt.h
> +++ b/arch/x86/include/asm/intel_rdt.h
> @@ -5,7 +5,6 @@
>  
>  #include <linux/kernfs.h>
>  #include <linux/jump_label.h>
> -

Darn. Your random whitespace changes are annoying. And this one is actually
wrong. The new line is there on purpose to seperate linux and asm includes
visually.

>  #include <asm/intel_rdt_common.h>
>  

> diff --git a/arch/x86/include/asm/intel_rdt_common.h b/arch/x86/include/asm/intel_rdt_common.h
> index e11ed5e..544acaa 100644
> --- a/arch/x86/include/asm/intel_rdt_common.h
> +++ b/arch/x86/include/asm/intel_rdt_common.h
> @@ -18,12 +18,23 @@
>   */
>  struct intel_pqr_state {
>  	u32			rmid;
> +	u32			next_task_rmid;
>  	u32			closid;
>  	int			rmid_usecnt;
>  };
>  
>  DECLARE_PER_CPU(struct intel_pqr_state, pqr_state);
>  
> +u32 __get_rmid(int domain);
> +bool __rmid_valid(u32 rmid);
> +void alloc_needed_pkg_rmid(u32 *cqm_rmid);
> +struct cgrp_cqm_info *cqminfo_from_tsk(struct task_struct *tsk);
> +
> +extern struct cgrp_cqm_info cqm_rootcginfo;

Yet another completely pointless export.

> +DECLARE_STATIC_KEY_FALSE(cqm_enable_key);
> +DECLARE_STATIC_KEY_FALSE(rdt_enable_key);
> +
>  /**
>   * struct cgrp_cqm_info - perf_event cgroup metadata for cqm
>   * @cont_mon     Continuous monitoring flag

> diff --git a/arch/x86/kernel/cpu/intel_rdt_common.c b/arch/x86/kernel/cpu/intel_rdt_common.c
> new file mode 100644
> index 0000000..c3c50cd
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/intel_rdt_common.c
> @@ -0,0 +1,98 @@
> +#define pr_fmt(fmt)	KBUILD_MODNAME ": " fmt
> +
> +#include <linux/slab.h>
> +#include <linux/err.h>
> +#include <linux/cacheinfo.h>
> +#include <linux/cpuhotplug.h>

See above.

> +#include <asm/intel-family.h>
> +#include <asm/intel_rdt.h>
> +
> +/*
> + * The cached intel_pqr_state is strictly per CPU and can never be
> + * updated from a remote CPU. Both functions which modify the state
> + * (intel_cqm_event_start and intel_cqm_event_stop) are called with
> + * interrupts disabled, which is sufficient for the protection.

Just copying comments without actually fixing them is lazy at best. There
are other functions which modify that state.

> + */
> +DEFINE_PER_CPU(struct intel_pqr_state, pqr_state);
> +
> +#define pkg_id	topology_physical_package_id(smp_processor_id())

How often are you replicating this macro?  Get rid of it completely
everywhere instead of spreading this crap all over the place.

> +#ifdef CONFIG_INTEL_RDT_M
> +static inline int get_cgroup_sched_rmid(void)
> +{
> +#ifdef CONFIG_CGROUP_PERF
> +	struct cgrp_cqm_info *ccinfo = NULL;
> +
> +	ccinfo = cqminfo_from_tsk(current);
> +
> +	if (!ccinfo)
> +		return 0;
> +
> +	/*
> +	 * A cgroup is always monitoring for itself or
> +	 * for an ancestor(default is root).
> +	 */
> +	if (ccinfo->mon_enabled) {
> +		alloc_needed_pkg_rmid(ccinfo->rmid);
> +		return ccinfo->rmid[pkg_id];
> +	} else {
> +		alloc_needed_pkg_rmid(ccinfo->mfa->rmid);
> +		return ccinfo->mfa->rmid[pkg_id];
> +	}
> +#endif
> +
> +	return 0;
> +}
> +
> +static inline int get_sched_in_rmid(void)
> +{
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +	u32 rmid = 0;
> +
> +	rmid = state->next_task_rmid;
> +
> +	return rmid ? rmid : get_cgroup_sched_rmid();

So this effectively splits the perf sched in work into two parts. One is
handling the perf sched in mechanics and the other is doing the eventual
cgroup muck including rmid allocation.

Why on earth can't the perf sched in work update state->rmid right away and
this extra sched in function just handle CLOSID/RMID updates to the MSR ?
Because that would actually make sense, right?

I really can;t see how all of this hackery which goes through loops and
hoops over and over will make context switches faster. I'm inclined to bet
that the current code, which is halfways sane is faster even with two
writes.

> +}
> +#endif
> +
> +/*
> + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> + *
> + * Following considerations are made so that this has minimal impact
> + * on scheduler hot path:
> + * - This will stay as no-op unless we are running on an Intel SKU
> + *   which supports resource control and we enable by mounting the
> + *   resctrl file system or it supports resource monitoring.
> + * - Caches the per cpu CLOSid/RMID values and does the MSR write only
> + *   when a task with a different CLOSid/RMID is scheduled in.
> + */
> +void __intel_rdt_sched_in(void)
> +{
> +	struct intel_pqr_state *state = this_cpu_ptr(&pqr_state);
> +	int closid = 0;
> +	u32 rmid = 0;

So rmid is a u32 and closid is int. Are you rolling a dice when chosing
data types?

Thanks,

	tglx