[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1411211323520.30781@vshiva-Udesk>
Date: Fri, 21 Nov 2014 13:25:36 -0800 (PST)
From: Vikas Shivappa <vikas.shivappa@...el.com>
To: Vikas Shivappa <vikas.shivappa@...ux.intel.com>
cc: linux-kernel@...r.kernel.org, vikas.shivappa@...el.com,
hpa@...or.com, tglx@...utronix.de, mingo@...nel.org, tj@...nel.org,
Matt Fleming <matt.fleming@...el.com>,
"Auld, Will" <will.auld@...el.com>, peterz@...radead.org
Subject: Re: [PATCH] x86: Intel Cache Allocation Technology support
Correcting email address for Matt.
On Wed, 19 Nov 2014, Vikas Shivappa wrote:
> What is Cache Allocation Technology ( CAT )
> -------------------------------------------
>
> Cache Allocation Technology provides a way for the Software (OS/VMM) to
> restrict cache allocation to a defined 'subset' of cache which may be
> overlapping with other 'subsets'. This feature is used when allocating
> a line in cache ie when pulling new data into the cache. The
> programming of the h/w is done via programming MSRs.
>
> The different cache subsets are identified by CLOS identifier (class of
> service) and each CLOS has a CBM (cache bit mask). The CBM is a
> contiguous set of bits which defines the amount of cache resource that
> is available for each 'subset'.
>
> Why is CAT (cache allocation technology) needed
> ------------------------------------------------
>
> The CAT enables more cache resources to be made available for higher
> priority applications based on guidance from the execution
> environment.
>
> The architecture also allows dynamically changing these subsets during
> runtime to further optimize the performance of the higher priority
> application with minimal degradation to the low priority app.
> Additionally, resources can be rebalanced for system throughput benefit.
>
> This technique may be useful in managing large computer systems which
> large LLC. Examples may be large servers running instances of
> webservers or database servers. In such complex systems, these subsets
> can be used for more careful placing of the available cache resources.
>
> The CAT kernel patch would provide a basic kernel framework for users to
> be able to implement such cache subsets.
>
> Kernel Implementation
> ---------------------
>
> This patch implements a cgroup subsystem to support cache allocation.
> Each cgroup has a CLOSid <-> CBM(cache bit mask) mapping. A
> CLOS(Class of service) is represented by a CLOSid.CLOSid is internal
> to the kernel and not exposed to user. Each cgroup would have one CBM
> and would just represent one cache 'subset'.
>
> The cgroup follows cgroup hierarchy ,mkdir and adding tasks to the
> cgroup never fails. When a child cgroup is created it inherits the
> CLOSid and the CBM from its parent. When a user changes the default
> CBM for a cgroup, a new CLOSid may be allocated if the CBM was not
> used before. The changing of 'cbm' may fail with -ERRNOSPC once the
> kernel runs out of maximum CLOSids it can support.
> User can create as many cgroups as he wants but having different CBMs
> at the same time is restricted by the maximum number of CLOSids
> (multiple cgroups can have the same CBM).
> Kernel maintains a CLOSid<->cbm mapping which keeps reference counter
> for each cgroup using a CLOSid.
>
> The tasks in the cgroup would get to fill the LLC cache represented by
> the cgroup's 'cbm' file.
>
> Root directory would have all available bits set in 'cbm' file by
> default.
>
> Assignment of CBM,CLOS
> ---------------------------------
>
> The 'cbm' needs to be a subset of the parent node's 'cbm'. Any
> contiguous subset of these bits(with a minimum of 2 bits) maybe set to
> indicate the cache mapping desired. The 'cbm' between 2 directories can
> overlap. The 'cbm' would represent the cache 'subset' of the CAT cgroup.
> For ex: on a system with 16 bits of max cbm bits, if the directory has
> the least significant 4 bits set in its 'cbm' file(meaning the 'cbm' is
> just 0xf), it would be allocated the right quarter of the Last level
> cache which means the tasks belonging to this CAT cgroup can use the
> right quarter of the cache to fill. If it has the most significant 8
> bits set ,it would be allocated the left half of the cache(8 bits out
> of 16 represents 50%).
>
> The cache portion defined in the CBM file is available to all tasks
> within the cgroup to fill and these task are not allowed to allocate
> space in other parts of the cache.
>
> Scheduling and Context Switch
> ------------------------------
>
> During context switch kernel implements this by writing the CLOSid
> (internally maintained by kernel) of the cgroup to which the task
> belongs to the CPU's IA32_PQR_ASSOC MSR.
>
> Reviewed-by: Matt Flemming <matt.flemming@...el.com>
> Tested-by: Priya Autee <priya.v.autee@...el.com>
> Signed-off-by: Vikas Shivappa <vikas.shivappa@...ux.intel.com>
> ---
> arch/x86/include/asm/cacheqe.h | 144 +++++++++++
> arch/x86/include/asm/cpufeature.h | 4 +
> arch/x86/include/asm/processor.h | 5 +-
> arch/x86/kernel/cpu/Makefile | 5 +
> arch/x86/kernel/cpu/cacheqe.c | 487 ++++++++++++++++++++++++++++++++++++++
> arch/x86/kernel/cpu/common.c | 21 ++
> include/linux/cgroup_subsys.h | 5 +
> init/Kconfig | 22 ++
> kernel/sched/core.c | 4 +-
> kernel/sched/sched.h | 24 ++
> 10 files changed, 718 insertions(+), 3 deletions(-)
> create mode 100644 arch/x86/include/asm/cacheqe.h
> create mode 100644 arch/x86/kernel/cpu/cacheqe.c
>
> diff --git a/arch/x86/include/asm/cacheqe.h b/arch/x86/include/asm/cacheqe.h
> new file mode 100644
> index 0000000..91d175e
> --- /dev/null
> +++ b/arch/x86/include/asm/cacheqe.h
> @@ -0,0 +1,144 @@
> +#ifndef _CACHEQE_H_
> +#define _CACHEQE_H_
> +
> +#include <linux/cgroup.h>
> +#include <linux/slab.h>
> +#include <linux/percpu.h>
> +#include <linux/spinlock.h>
> +#include <linux/cpumask.h>
> +#include <linux/seq_file.h>
> +#include <linux/rcupdate.h>
> +#include <linux/kernel_stat.h>
> +#include <linux/err.h>
> +
> +#ifdef CONFIG_CGROUP_CACHEQE
> +
> +#define IA32_PQR_ASSOC 0xc8f
> +#define IA32_PQR_MASK(x) (x << 32)
> +
> +/* maximum possible cbm length */
> +#define MAX_CBM_LENGTH 32
> +
> +#define IA32_CBMMAX_MASK(x) (0xffffffff & (~((u64)(1 << x) - 1)))
> +
> +#define IA32_CBM_MASK 0xffffffff
> +#define IA32_L3_CBM_BASE 0xc90
> +#define CQECBMMSR(x) (IA32_L3_CBM_BASE + x)
> +
> +#ifdef CONFIG_CACHEQE_DEBUG
> +#define CQE_DEBUG(X) do { pr_info X; } while (0)
> +#else
> +#define CQE_DEBUG(X)
> +#endif
> +
> +extern bool cqe_genable;
> +
> +struct cacheqe_subsys_info {
> + unsigned long *closmap;
> +};
> +
> +struct cacheqe {
> + struct cgroup_subsys_state css;
> +
> + /* class of service for the group*/
> + unsigned int clos;
> + /* corresponding cache bit mask*/
> + unsigned long *cbm;
> +
> +};
> +
> +struct closcbm_map {
> + unsigned long cbm;
> + unsigned int ref;
> +};
> +
> +extern struct cacheqe root_cqe_group;
> +
> +/*
> + * Return cacheqos group corresponding to this container.
> + */
> +static inline struct cacheqe *css_cacheqe(struct cgroup_subsys_state *css)
> +{
> + return css ? container_of(css, struct cacheqe, css) : NULL;
> +}
> +
> +static inline struct cacheqe *parent_cqe(struct cacheqe *cq)
> +{
> + return css_cacheqe(cq->css.parent);
> +}
> +
> +/*
> + * Return cacheqe group to which this task belongs.
> + */
> +static inline struct cacheqe *task_cacheqe(struct task_struct *task)
> +{
> + return css_cacheqe(task_css(task, cacheqe_cgrp_id));
> +}
> +
> +static inline void cacheqe_sched_in(struct task_struct *task)
> +{
> + struct cacheqe *cq;
> + unsigned int clos;
> + unsigned int l, h;
> +
> + if (!cqe_genable)
> + return;
> +
> + rdmsr(IA32_PQR_ASSOC, l, h);
> +
> + rcu_read_lock();
> + cq = task_cacheqe(task);
> +
> + if (cq == NULL || cq->clos == h) {
> + rcu_read_unlock();
> + return;
> + }
> +
> + clos = cq->clos;
> +
> + /*
> + * After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
> + * In unified scheduling , write the PQR each time.
> + */
> + wrmsr(IA32_PQR_ASSOC, l, clos);
> + rcu_read_unlock();
> +
> + CQE_DEBUG(("schedule in clos :0x%x,task cpu:%u, currcpu: %u,pid:%u\n",
> + clos, task_cpu(task), smp_processor_id(), task->pid));
> +
> +}
> +
> +static inline void cacheqe_sched_out(struct task_struct *task)
> +{
> + unsigned int l, h;
> +
> + if (!cqe_genable)
> + return;
> +
> + rdmsr(IA32_PQR_ASSOC, l, h);
> +
> + if (h == 0)
> + return;
> +
> + /*
> + *After finding the cacheqe of the task , write the PQR for the proc.
> + * We are assuming the current core is the one its scheduled to.
> + * Write zero when scheduling out so that we get a more accurate
> + * cache allocation.
> + */
> +
> + wrmsr(IA32_PQR_ASSOC, l, 0);
> +
> + CQE_DEBUG(("schedule out done cpu :%u,curr cpu:%u, pid:%u\n",
> + task_cpu(task), smp_processor_id(), task->pid));
> +
> +}
> +
> +#else
> +static inline void cacheqe_sched_in(struct task_struct *task) {}
> +
> +static inline void cacheqe_sched_out(struct task_struct *task) {}
> +
> +#endif
> +#endif
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index 0bb1335..21290ac 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -221,6 +221,7 @@
> #define X86_FEATURE_INVPCID ( 9*32+10) /* Invalidate Processor Context ID */
> #define X86_FEATURE_RTM ( 9*32+11) /* Restricted Transactional Memory */
> #define X86_FEATURE_MPX ( 9*32+14) /* Memory Protection Extension */
> +#define X86_FEATURE_CQE (9*32+15) /* Cache QOS Enforcement */
> #define X86_FEATURE_AVX512F ( 9*32+16) /* AVX-512 Foundation */
> #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction */
> #define X86_FEATURE_ADX ( 9*32+19) /* The ADCX and ADOX instructions */
> @@ -236,6 +237,9 @@
> #define X86_FEATURE_XGETBV1 (10*32+ 2) /* XGETBV with ECX = 1 */
> #define X86_FEATURE_XSAVES (10*32+ 3) /* XSAVES/XRSTORS */
>
> +/* Intel-defined CPU features, CPUID level 0x0000000A:0 (ebx), word 10 */
> +#define X86_FEATURE_CQE_L3 (10*32 + 1)
> +
> /*
> * BUG word(s)
> */
> diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
> index eb71ec7..6be953f 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -111,8 +111,11 @@ struct cpuinfo_x86 {
> int x86_cache_alignment; /* In bytes */
> int x86_power;
> unsigned long loops_per_jiffy;
> + /* Cache QOS Enforement values */
> + int x86_cqe_cbmlength;
> + int x86_cqe_closs;
> /* cpuid returned max cores value: */
> - u16 x86_max_cores;
> + u16 x86_max_cores;
> u16 apicid;
> u16 initial_apicid;
> u16 x86_clflush_size;
> diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
> index e27b49d..c2b0a6b 100644
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -8,6 +8,10 @@ CFLAGS_REMOVE_common.o = -pg
> CFLAGS_REMOVE_perf_event.o = -pg
> endif
>
> +ifdef CONFIG_CACHEQE_DEBUG
> +CFLAGS_cacheqe.o := -DDEBUG
> +endif
> +
> # Make sure load_percpu_segment has no stackprotector
> nostackp := $(call cc-option, -fno-stack-protector)
> CFLAGS_common.o := $(nostackp)
> @@ -47,6 +51,7 @@ obj-$(CONFIG_PERF_EVENTS_INTEL_UNCORE) += perf_event_intel_uncore.o \
> perf_event_intel_uncore_nhmex.o
> endif
>
> +obj-$(CONFIG_CGROUP_CACHEQE) += cacheqe.o
>
> obj-$(CONFIG_X86_MCE) += mcheck/
> obj-$(CONFIG_MTRR) += mtrr/
> diff --git a/arch/x86/kernel/cpu/cacheqe.c b/arch/x86/kernel/cpu/cacheqe.c
> new file mode 100644
> index 0000000..2ac3d4e
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/cacheqe.c
> @@ -0,0 +1,487 @@
> +
> +/*
> + * kernel/cacheqe.c
> + *
> + * Processor Cache Allocation code
> + * (Also called cache quality enforcement - cqe)
> + *
> + * Copyright (c) 2014, Intel Corporation.
> + *
> + * 2014-10-15 Written by Vikas Shivappa
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <asm/cacheqe.h>
> +
> +struct cacheqe root_cqe_group;
> +static DEFINE_MUTEX(cqe_group_mutex);
> +
> +bool cqe_genable;
> +
> +/* ccmap maintains 1:1 mapping between CLOSid and cbm.*/
> +
> +static struct closcbm_map *ccmap;
> +static struct cacheqe_subsys_info *cqess_info;
> +
> +char hsw_brandstrs[5][64] = {
> + "Intel(R) Xeon(R) CPU E5-2658 v3 @ 2.20GHz",
> + "Intel(R) Xeon(R) CPU E5-2648L v3 @ 1.80GHz",
> + "Intel(R) Xeon(R) CPU E5-2628L v3 @ 2.00GHz",
> + "Intel(R) Xeon(R) CPU E5-2618L v3 @ 2.30GHz",
> + "Intel(R) Xeon(R) CPU E5-2608L v3 @ 2.00GHz"
> +};
> +
> +#define cacheqe_for_each_child(child_cq, pos_css, parent_cq) \
> + css_for_each_child((pos_css), \
> + &(parent_cq)->css)
> +
> +#if CONFIG_CACHEQE_DEBUG
> +
> +/*DUMP the closid-cbm map.*/
> +
> +static inline void cbmmap_dump(void)
> +{
> +
> + int i;
> +
> + pr_debug("CBMMAP\n");
> + for (i = 0; i < boot_cpu_data.x86_cqe_closs; i++)
> + pr_debug("cbm: 0x%x,ref: %u\n",
> + (unsigned int)ccmap[i].cbm, ccmap[i].ref);
> +
> +}
> +
> +#else
> +
> +static inline void cbmmap_dump(void) {}
> +
> +#endif
> +
> +static inline bool cqe_enabled(struct cpuinfo_x86 *c)
> +{
> +
> + int i;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3))
> + return true;
> +
> + /*
> + * Hard code the checks and values for HSW SKUs.
> + * Unfortunately! have to check against only these brand name strings.
> + */
> +
> + for (i = 0; i < 5; i++)
> + if (!strcmp(hsw_brandstrs[i], c->x86_model_id)) {
> + c->x86_cqe_closs = 4;
> + c->x86_cqe_cbmlength = 20;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +
> +static int __init cqe_late_init(void)
> +{
> +
> + struct cpuinfo_x86 *c = &boot_cpu_data;
> + size_t sizeb;
> + int maxid = boot_cpu_data.x86_cqe_closs;
> +
> + cqe_genable = false;
> +
> + /*
> + * Need the cqe_genable hint helps decide if the
> + * kernel has enabled cache allocation.
> + */
> +
> + if (!cqe_enabled(c)) {
> +
> + root_cqe_group.css.ss->disabled = 1;
> + return -ENODEV;
> +
> + } else {
> +
> + cqess_info =
> + kzalloc(sizeof(struct cacheqe_subsys_info),
> + GFP_KERNEL);
> +
> + if (!cqess_info)
> + return -ENOMEM;
> +
> + sizeb = BITS_TO_LONGS(c->x86_cqe_closs) * sizeof(long);
> + cqess_info->closmap =
> + kzalloc(sizeb, GFP_KERNEL);
> +
> + if (!cqess_info->closmap) {
> + kfree(cqess_info);
> + return -ENOMEM;
> + }
> +
> + sizeb = maxid * sizeof(struct closcbm_map);
> + ccmap = kzalloc(sizeb, GFP_KERNEL);
> +
> + if (!ccmap)
> + return -ENOMEM;
> +
> + /* Allocate the CLOS for root.*/
> + set_bit(0, cqess_info->closmap);
> + root_cqe_group.clos = 0;
> +
> + /*
> + * The cbmlength expected be atleast 1.
> + * All bits are set for the root cbm.
> + */
> +
> + ccmap[root_cqe_group.clos].cbm =
> + (u32)((u64)(1 << c->x86_cqe_cbmlength) - 1);
> + root_cqe_group.cbm = &ccmap[root_cqe_group.clos].cbm;
> + ccmap[root_cqe_group.clos].ref++;
> +
> + barrier();
> + cqe_genable = true;
> +
> + pr_info("CQE enabled cbmlength is %u\ncqe Closs : %u ",
> + c->x86_cqe_cbmlength, c->x86_cqe_closs);
> +
> + }
> +
> + return 0;
> +
> +}
> +
> +late_initcall(cqe_late_init);
> +
> +/*
> + * Allocates a new closid from unused list of closids.
> + * Called with the cqe_group_mutex held.
> + */
> +
> +static int cqe_alloc_closid(struct cacheqe *cq)
> +{
> + unsigned int tempid;
> + unsigned int maxid;
> + int err;
> +
> + maxid = boot_cpu_data.x86_cqe_closs;
> +
> + tempid = find_next_zero_bit(cqess_info->closmap, maxid, 0);
> +
> + if (tempid == maxid) {
> + err = -ENOSPC;
> + goto closidallocfail;
> + }
> +
> + set_bit(tempid, cqess_info->closmap);
> + ccmap[tempid].ref++;
> + cq->clos = tempid;
> +
> + pr_debug("cqe : Allocated a directory.closid:%u\n", cq->clos);
> +
> + return 0;
> +
> +closidallocfail:
> +
> + return err;
> +
> +}
> +
> +/*
> +* Called with the cqe_group_mutex held.
> +*/
> +
> +static void cqe_free_closid(struct cacheqe *cq)
> +{
> +
> + pr_debug("cqe :Freeing closid:%u\n", cq->clos);
> +
> + ccmap[cq->clos].ref--;
> +
> + if (!ccmap[cq->clos].ref)
> + clear_bit(cq->clos, cqess_info->closmap);
> +
> + return;
> +
> +}
> +
> +/* Create a new cacheqe cgroup.*/
> +static struct cgroup_subsys_state *
> +cqe_css_alloc(struct cgroup_subsys_state *parent_css)
> +{
> + struct cacheqe *parent = css_cacheqe(parent_css);
> + struct cacheqe *cq;
> +
> + /* This is the call before the feature is detected */
> + if (!parent) {
> + root_cqe_group.clos = 0;
> + return &root_cqe_group.css;
> + }
> +
> + /* To check if cqe is enabled.*/
> + if (!cqe_genable)
> + return ERR_PTR(-ENODEV);
> +
> + cq = kzalloc(sizeof(struct cacheqe), GFP_KERNEL);
> + if (!cq)
> + return ERR_PTR(-ENOMEM);
> +
> + /*
> + * Child inherits the ClosId and cbm from parent.
> + */
> +
> + cq->clos = parent->clos;
> + mutex_lock(&cqe_group_mutex);
> + ccmap[parent->clos].ref++;
> + mutex_unlock(&cqe_group_mutex);
> +
> + cq->cbm = parent->cbm;
> +
> + pr_debug("cqe : Allocated cgroup closid:%u,ref:%u\n",
> + cq->clos, ccmap[parent->clos].ref);
> +
> + return &cq->css;
> +
> +}
> +
> +/* Destroy an existing CAT cgroup.*/
> +static void cqe_css_free(struct cgroup_subsys_state *css)
> +{
> + struct cacheqe *cq = css_cacheqe(css);
> + int len = boot_cpu_data.x86_cqe_cbmlength;
> +
> + pr_debug("cqe : In cacheqe_css_free\n");
> +
> + mutex_lock(&cqe_group_mutex);
> +
> + /* Reset the CBM for the cgroup.Should be all 1s by default !*/
> +
> + wrmsrl(CQECBMMSR(cq->clos), ((1 << len) - 1));
> + cqe_free_closid(cq);
> + kfree(cq);
> +
> + mutex_unlock(&cqe_group_mutex);
> +
> +}
> +
> +/*
> + * Called during do_exit() syscall during a task exit.
> + * This assumes that the thread is running on the current
> + * cpu.
> + */
> +
> +static void cqe_exit(struct cgroup_subsys_state *css,
> + struct cgroup_subsys_state *old_css,
> + struct task_struct *task)
> +{
> +
> + cacheqe_sched_out(task);
> +
> +}
> +
> +static inline bool cbm_minbits(unsigned long var)
> +{
> +
> + unsigned long i;
> +
> + /*Minimum of 2 bits must be set.*/
> +
> + i = var & (var - 1);
> + if (!i || !var)
> + return false;
> +
> + return true;
> +
> +}
> +
> +/*
> + * Tests if only contiguous bits are set.
> + */
> +
> +static inline bool cbm_iscontiguous(unsigned long var)
> +{
> +
> + unsigned long i;
> +
> + /* Reset the least significant bit.*/
> + i = var & (var - 1);
> +
> + /*
> + * We would have a set of non-contiguous bits when
> + * there is at least one zero
> + * between the most significant 1 and least significant 1.
> + * In the below '&' operation,(var <<1) would have zero in
> + * at least 1 bit position in var apart from least
> + * significant bit if it does not have contiguous bits.
> + * Multiple sets of contiguous bits wont succeed in the below
> + * case as well.
> + */
> +
> + if (i != (var & (var << 1)))
> + return false;
> +
> + return true;
> +
> +}
> +
> +static int cqe_cbm_read(struct seq_file *m, void *v)
> +{
> + struct cacheqe *cq = css_cacheqe(seq_css(m));
> +
> + pr_debug("cqe : In cqe_cqemode_read\n");
> + seq_printf(m, "0x%x\n", (unsigned int)*(cq->cbm));
> +
> + return 0;
> +
> +}
> +
> +static int validate_cbm(struct cacheqe *cq, unsigned long cbmvalue)
> +{
> + struct cacheqe *par, *c;
> + struct cgroup_subsys_state *css;
> +
> + if (!cbm_minbits(cbmvalue) || !cbm_iscontiguous(cbmvalue)) {
> + pr_info("CQE error: minimum bits not set or non contiguous mask\n");
> + return -EINVAL;
> + }
> +
> + /*
> + * Needs to be a subset of its parent.
> + */
> + par = parent_cqe(cq);
> +
> + if (!bitmap_subset(&cbmvalue, par->cbm, MAX_CBM_LENGTH))
> + return -EINVAL;
> +
> + rcu_read_lock();
> +
> + /*
> + * Each of children should be a subset of the mask.
> + */
> +
> + cacheqe_for_each_child(c, css, cq) {
> + c = css_cacheqe(css);
> + if (!bitmap_subset(c->cbm, &cbmvalue, MAX_CBM_LENGTH)) {
> + pr_debug("cqe : Children's cbm not a subset\n");
> + return -EINVAL;
> + }
> + }
> +
> + rcu_read_unlock();
> +
> + return 0;
> +
> +}
> +
> +static bool cbm_search(unsigned long cbm, int *closid)
> +{
> +
> + int maxid = boot_cpu_data.x86_cqe_closs;
> + unsigned int i;
> +
> + for (i = 0; i < maxid; i++)
> + if (bitmap_equal(&cbm, &ccmap[i].cbm, MAX_CBM_LENGTH)) {
> + *closid = i;
> + return true;
> + }
> +
> + return false;
> +
> +}
> +
> +static int cqe_cbm_write(struct cgroup_subsys_state *css,
> + struct cftype *cft, u64 cbmvalue)
> +{
> + struct cacheqe *cq = css_cacheqe(css);
> + ssize_t err = 0;
> + unsigned long cbm;
> + unsigned int closid;
> +
> + pr_debug("cqe : In cqe_cbm_write\n");
> +
> + if (!cqe_genable)
> + return -ENODEV;
> +
> + if (cq == &root_cqe_group || !cq)
> + return -EPERM;
> +
> + /*
> + * Need global mutex as cbm write may allocate the closid.
> + */
> +
> + mutex_lock(&cqe_group_mutex);
> + cbm = (cbmvalue & IA32_CBM_MASK);
> +
> + if (bitmap_equal(&cbm, cq->cbm, MAX_CBM_LENGTH))
> + goto cbmwriteend;
> +
> + err = validate_cbm(cq, cbm);
> + if (err)
> + goto cbmwriteend;
> +
> + /*
> + * Need to assign a CLOSid to the cgroup
> + * if it has a new cbm , or reuse.
> + * This takes care to allocate only
> + * the number of CLOSs available.
> + */
> +
> + cqe_free_closid(cq);
> +
> + if (cbm_search(cbm, &closid)) {
> + cq->clos = closid;
> + ccmap[cq->clos].ref++;
> +
> + } else {
> +
> + err = cqe_alloc_closid(cq);
> +
> + if (err)
> + goto cbmwriteend;
> +
> + wrmsrl(CQECBMMSR(cq->clos), cbm);
> +
> + }
> +
> + /*
> + * Finally store the cbm in cbm map
> + * and store a reference in the cq.
> + */
> +
> + ccmap[cq->clos].cbm = cbm;
> + cq->cbm = &ccmap[cq->clos].cbm;
> +
> + cbmmap_dump();
> +
> +cbmwriteend:
> +
> + mutex_unlock(&cqe_group_mutex);
> + return err;
> +
> +}
> +
> +static struct cftype cqe_files[] = {
> + {
> + .name = "cbm",
> + .seq_show = cqe_cbm_read,
> + .write_u64 = cqe_cbm_write,
> + .mode = 0666,
> + },
> + { } /* terminate */
> +};
> +
> +struct cgroup_subsys cacheqe_cgrp_subsys = {
> + .name = "cacheqe",
> + .css_alloc = cqe_css_alloc,
> + .css_free = cqe_css_free,
> + .exit = cqe_exit,
> + .base_cftypes = cqe_files,
> +};
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 4b4f78c..a9b277a 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -633,6 +633,27 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
> c->x86_capability[9] = ebx;
> }
>
> +/* Additional Intel-defined flags: level 0x00000010 */
> + if (c->cpuid_level >= 0x00000010) {
> + u32 eax, ebx, ecx, edx;
> +
> + cpuid_count(0x00000010, 0, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_capability[10] = ebx;
> +
> + if (cpu_has(c, X86_FEATURE_CQE_L3)) {
> +
> + u32 eax, ebx, ecx, edx;
> +
> + cpuid_count(0x00000010, 1, &eax, &ebx, &ecx, &edx);
> +
> + c->x86_cqe_closs = (edx & 0xffff) + 1;
> + c->x86_cqe_cbmlength = (eax & 0xf) + 1;
> +
> + }
> +
> + }
> +
> /* Extended state features: level 0x0000000d */
> if (c->cpuid_level >= 0x0000000d) {
> u32 eax, ebx, ecx, edx;
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index 98c4f9b..a131c1e 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -53,6 +53,11 @@ SUBSYS(hugetlb)
> #if IS_ENABLED(CONFIG_CGROUP_DEBUG)
> SUBSYS(debug)
> #endif
> +
> +#if IS_ENABLED(CONFIG_CGROUP_CACHEQE)
> +SUBSYS(cacheqe)
> +#endif
> +
> /*
> * DO NOT ADD ANY SUBSYSTEM WITHOUT EXPLICIT ACKS FROM CGROUP MAINTAINERS.
> */
> diff --git a/init/Kconfig b/init/Kconfig
> index 2081a4d..bec92a4 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -968,6 +968,28 @@ config CPUSETS
>
> Say N if unsure.
>
> +config CGROUP_CACHEQE
> + bool "Cache QoS Enforcement cgroup subsystem"
> + depends on X86 || X86_64
> + help
> + This option provides framework to allocate Cache cache lines when
> + applications fill cache.
> + This can be used by users to configure how much cache that can be
> + allocated to different PIDs.
> +
> + Say N if unsure.
> +
> +config CACHEQE_DEBUG
> + bool "Cache QoS Enforcement cgroup subsystem debug"
> + depends on X86 || X86_64
> + help
> + This option provides framework to allocate Cache cache lines when
> + applications fill cache.
> + This can be used by users to configure how much cache that can be
> + allocated to different PIDs.Enables debug
> +
> + Say N if unsure.
> +
> config PROC_PID_CPUSET
> bool "Include legacy /proc/<pid>/cpuset file"
> depends on CPUSETS
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 240157c..afa2897 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2215,7 +2215,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> perf_event_task_sched_out(prev, next);
> fire_sched_out_preempt_notifiers(prev, next);
> prepare_lock_switch(rq, next);
> - prepare_arch_switch(next);
> + prepare_arch_switch(prev);
> }
>
> /**
> @@ -2254,7 +2254,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
> */
> prev_state = prev->state;
> vtime_task_switch(prev);
> - finish_arch_switch(prev);
> + finish_arch_switch(current);
> perf_event_task_sched_in(prev, current);
> finish_lock_switch(rq, prev);
> finish_arch_post_lock_switch();
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 24156c84..79b9ff6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -965,12 +965,36 @@ static inline int task_on_rq_migrating(struct task_struct *p)
> return p->on_rq == TASK_ON_RQ_MIGRATING;
> }
>
> +#ifdef CONFIG_X86_64
> +#ifdef CONFIG_CGROUP_CACHEQE
> +
> +#include <asm/cacheqe.h>
> +
> +# define prepare_arch_switch(prev) cacheqe_sched_out(prev)
> +# define finish_arch_switch(current) cacheqe_sched_in(current)
> +
> +#else
> +
> #ifndef prepare_arch_switch
> # define prepare_arch_switch(next) do { } while (0)
> #endif
> #ifndef finish_arch_switch
> # define finish_arch_switch(prev) do { } while (0)
> #endif
> +
> +#endif
> +#else
> +
> +#ifndef prepare_arch_switch
> +# define prepare_arch_switch(prev) do { } while (0)
> +#endif
> +
> +#ifndef finish_arch_switch
> +# define finish_arch_switch(current) do { } while (0)
> +#endif
> +
> +#endif
> +
> #ifndef finish_arch_post_lock_switch
> # define finish_arch_post_lock_switch() do { } while (0)
> #endif
> --
> 1.9.1
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists