linux-kernel - [PATCH v2 09/32] perf/x86/intel/cqm: basic RMID hierarchy with per package RMIDs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1463007752-116802-10-git-send-email-davidcc@google.com>
Date:	Wed, 11 May 2016 16:02:09 -0700
From:	David Carrillo-Cisneros <davidcc@...gle.com>
To:	Peter Zijlstra <peterz@...radead.org>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Ingo Molnar <mingo@...hat.com>
Cc:	Vikas Shivappa <vikas.shivappa@...ux.intel.com>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Tony Luck <tony.luck@...el.com>,
	Stephane Eranian <eranian@...gle.com>,
	Paul Turner <pjt@...gle.com>,
	David Carrillo-Cisneros <davidcc@...gle.com>, x86@...nel.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH v2 09/32] perf/x86/intel/cqm: basic RMID hierarchy with per package RMIDs

Cgroups and/or tasks that require to be monitored using a RMID are
abstracted as a MOnitored Resources (monr's). A CQM event is associated
to a monr in order to monitor and read llc_occupancy (and in the future
other attributes, such as memory bandwidth).

The monrs form a hierarchy that captures the dependency between the
monitored cgroups and/or tasks/threads. The monr of a cgroup A which
contains another monitored cgroup, B, is an ancestor of B's monr.

Each monr contains one Package MONitored Resource (pmonr) per package.
The monitoring of a monr in a package starts when its corresponding
pmonr receives an RMID for that package (a prmid).

The prmids are lazily assigned to a pmonr the first time a thread
using the monr is scheduled in the package. When a pmonr with a
valid prmid is scheduled, that pmonr's prmid's RMID is written to the
msr MSR_IA32_PQR_ASSOC. If no prmid is available, the prmid of the lowest
ancestor in the monr hierarchy with a valid prmid for that package is
used instead.

A pmonr can be in one of following three states:
  - (A)ctive: When it has a prmid available.
  - (I)nherited: When no prmid is available. In this state, it "borrows"
    the prmid of its lowest ancestor in (A)ctive state during sched in
    (writes its ancestor's RMID into hw while any associated thread is
    executed). But, since the "borrowed" prmid do not monitor the
    occupancy of this monr, the monr cannot report occupancy individually.
  - (U)nused: When the monr does not have a prmid yet and have no failed
    acquiring one (either because no thread has been scheduled while
    monitoring for this pmonr is active or because it has been completed
    a transition to (U)state, ie. termination of the associated
    event/cgroup).

To avoid synchronization overhead, each prmid contains a prmid_summary.
The union prmid_summary is a concise representation of the prmid state
and its raw RMIDs. Due to its size, the prmid_summary can be read
atomically without a LOCK. Every state transition atomically updates the
prmid_summary. This avoids locking during sched in and out of threads,
except in the cases that a prmid needs to be allocated, but this only
occurs the first time a monr is scheduled in a package.

This patch introduces a first iteration of the monr hierarchy
that maintains two levels: the root monr, at top, and all other monrs
as leaves. The root monr is always (A)ctive.

This patch also implements the essential mechanism of per-package lazy
allocation of RMID.

The (I)state and the transitions from and to it are introduced in the
next patch in this series.

Reviewed-by: Stephane Eranian <eranian@...gle.com>
Signed-off-by: David Carrillo-Cisneros <davidcc@...gle.com>
---
 arch/x86/events/intel/cqm.c | 546 +++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/events/intel/cqm.h | 158 +++++++++++++
 2 files changed, 702 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/cqm.c b/arch/x86/events/intel/cqm.c
index 225b0c8..5f2969b 100644
--- a/arch/x86/events/intel/cqm.c
+++ b/arch/x86/events/intel/cqm.c
@@ -72,6 +72,11 @@ static inline int __cqm_prmid_update(struct prmid *prmid,
 	return 1;
 }
 
+static inline int cqm_prmid_update(struct prmid *prmid)
+{
+	return __cqm_prmid_update(prmid, __rmid_min_update_time);
+}
+
 /*
  * A cache groups is a group of perf_events with the same target (thread,
  * cgroup, CPU or system-wide). Each cache group receives has one RMID.
@@ -80,8 +85,68 @@ static inline int __cqm_prmid_update(struct prmid *prmid,
 static LIST_HEAD(cache_groups);
 static DEFINE_MUTEX(cqm_mutex);
 
+struct monr *monr_hrchy_root;
+
 struct pkg_data **cqm_pkgs_data;
 
+static inline bool __pmonr__in_astate(struct pmonr *pmonr)
+{
+	lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+	return pmonr->prmid;
+}
+
+static inline bool __pmonr__in_ustate(struct pmonr *pmonr)
+{
+	lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+	return !pmonr->prmid;
+}
+
+static inline bool monr__is_root(struct monr *monr)
+{
+	return monr_hrchy_root == monr;
+}
+
+static inline bool monr__is_mon_active(struct monr *monr)
+{
+	return monr->flags & MONR_MON_ACTIVE;
+}
+
+static inline void __monr__set_summary_read_rmid(struct monr *monr, u32 rmid)
+{
+	int i;
+	struct pmonr *pmonr;
+	union prmid_summary summary;
+
+	monr_hrchy_assert_held_raw_spin_locks();
+
+	cqm_pkg_id_for_each_online(i) {
+		pmonr = monr->pmonrs[i];
+		WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
+		summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+		summary.read_rmid = rmid;
+		atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+	}
+}
+
+static inline void __monr__set_mon_active(struct monr *monr)
+{
+	monr_hrchy_assert_held_raw_spin_locks();
+	__monr__set_summary_read_rmid(monr, 0);
+	monr->flags |= MONR_MON_ACTIVE;
+}
+
+/*
+ * All pmonrs must be in (U)state.
+ * clearing MONR_MON_ACTIVE prevents (U)state prmids from transitioning
+ * to another state.
+ */
+static inline void __monr__clear_mon_active(struct monr *monr)
+{
+	monr_hrchy_assert_held_raw_spin_locks();
+	__monr__set_summary_read_rmid(monr, INVALID_RMID);
+	monr->flags &= ~MONR_MON_ACTIVE;
+}
+
 static inline bool __valid_pkg_id(u16 pkg_id)
 {
 	return pkg_id < topology_max_packages();
@@ -125,6 +190,10 @@ static int pkg_data_init_cpu(int cpu)
 	}
 
 	INIT_LIST_HEAD(&pkg_data->free_prmids_pool);
+	INIT_LIST_HEAD(&pkg_data->active_prmids_pool);
+	INIT_LIST_HEAD(&pkg_data->nopmonr_limbo_prmids_pool);
+
+	INIT_LIST_HEAD(&pkg_data->astate_pmonrs_lru);
 
 	mutex_init(&pkg_data->pkg_data_mutex);
 	raw_spin_lock_init(&pkg_data->pkg_data_lock);
@@ -136,12 +205,156 @@ static int pkg_data_init_cpu(int cpu)
 	return 0;
 }
 
+static inline bool __valid_rmid(u16 pkg_id, u32 rmid)
+{
+	return rmid <= cqm_pkgs_data[pkg_id]->max_rmid;
+}
+
+static inline bool __valid_prmid(u16 pkg_id, struct prmid *prmid)
+{
+	struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
+	bool valid = __valid_rmid(pkg_id, prmid->rmid);
+
+	WARN_ON_ONCE(valid && pkg_data->prmids_by_rmid[
+			prmid->rmid]->rmid != prmid->rmid);
+	return valid;
+}
+
+static inline struct prmid *
+__prmid_from_rmid(u16 pkg_id, u32 rmid)
+{
+	struct prmid *prmid;
+
+	if (!__valid_rmid(pkg_id, rmid))
+		return NULL;
+	prmid = cqm_pkgs_data[pkg_id]->prmids_by_rmid[rmid];
+	WARN_ON_ONCE(!__valid_prmid(pkg_id, prmid));
+	return prmid;
+}
+
+static struct pmonr *pmonr_alloc(int cpu)
+{
+	struct pmonr *pmonr;
+	union prmid_summary summary;
+
+	pmonr = kmalloc_node(sizeof(struct pmonr),
+			     GFP_KERNEL, cpu_to_node(cpu));
+	if (!pmonr)
+		return ERR_PTR(-ENOMEM);
+
+	pmonr->prmid = NULL;
+
+	pmonr->monr = NULL;
+	INIT_LIST_HEAD(&pmonr->rotation_entry);
+
+	pmonr->pkg_id = topology_physical_package_id(cpu);
+	summary.sched_rmid = INVALID_RMID;
+	summary.read_rmid = INVALID_RMID;
+	atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+
+	return pmonr;
+}
+
+static void pmonr_dealloc(struct pmonr *pmonr)
+{
+	kfree(pmonr);
+}
+
+/*
+ * @root: Common ancestor.
+ * a bust be distinct to b.
+ * @true if a is ancestor of b.
+ */
+static inline bool
+__monr_hrchy_is_ancestor(struct monr *root,
+			 struct monr *a, struct monr *b)
+{
+	WARN_ON_ONCE(!root || !a || !b);
+	WARN_ON_ONCE(a == b);
+
+	if (root == a)
+		return true;
+	if (root == b)
+		return false;
+
+	b = b->parent;
+	/* Break at the root */
+	while (b != root) {
+		WARN_ON_ONCE(!b);
+		if (a == b)
+			return true;
+		b = b->parent;
+	}
+	return false;
+}
+
+/* helper function to finish transition to astate. */
+static inline void
+__pmonr__finish_to_astate(struct pmonr *pmonr, struct prmid *prmid)
+{
+	union prmid_summary summary;
+
+	lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+
+	pmonr->prmid = prmid;
+
+	list_move_tail(
+		&prmid->pool_entry, &__pkg_data(pmonr, active_prmids_pool));
+	list_move_tail(
+		&pmonr->rotation_entry, &__pkg_data(pmonr, astate_pmonrs_lru));
+
+	summary.sched_rmid = pmonr->prmid->rmid;
+	summary.read_rmid = pmonr->prmid->rmid;
+	atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+}
+
+static inline void
+__pmonr__ustate_to_astate(struct pmonr *pmonr, struct prmid *prmid)
+{
+	lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+	__pmonr__finish_to_astate(pmonr, prmid);
+}
+
+static inline void
+__pmonr__to_ustate(struct pmonr *pmonr)
+{
+	union prmid_summary summary;
+
+	lockdep_assert_held(&__pkg_data(pmonr, pkg_data_lock));
+
+	/* Do not warn on re-enter state for (U)state, to simplify cleanup
+	 * of initialized states that were not scheduled.
+	 */
+	if (__pmonr__in_ustate(pmonr))
+		return;
+
+	if (__pmonr__in_astate(pmonr)) {
+		WARN_ON_ONCE(!pmonr->prmid);
+
+		list_move_tail(&pmonr->prmid->pool_entry,
+			       &__pkg_data(pmonr, nopmonr_limbo_prmids_pool));
+		pmonr->prmid =  NULL;
+	} else {
+		WARN_ON_ONCE(true);
+		return;
+	}
+	list_del_init(&pmonr->rotation_entry);
+
+	summary.sched_rmid = INVALID_RMID;
+	summary.read_rmid  =
+		monr__is_mon_active(pmonr->monr) ? 0 : INVALID_RMID;
+
+	atomic64_set(&pmonr->prmid_summary_atomic, summary.value);
+	WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
+}
+
 static int intel_cqm_setup_pkg_prmid_pools(u16 pkg_id)
 {
 	int r;
 	unsigned long flags;
 	struct prmid *prmid;
 	struct pkg_data *pkg_data = cqm_pkgs_data[pkg_id];
+	struct pmonr *root_pmonr;
 
 	if (!__valid_pkg_id(pkg_id))
 		return -EINVAL;
@@ -163,8 +376,13 @@ static int intel_cqm_setup_pkg_prmid_pools(u16 pkg_id)
 			&pkg_data->pkg_data_lock, flags, pkg_id);
 		pkg_data->prmids_by_rmid[r] = prmid;
 
+		list_add_tail(&prmid->pool_entry, &pkg_data->free_prmids_pool);
 
 		/* RMID 0 is special and makes the root of rmid hierarchy. */
+		if (r == 0) {
+			root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
+			__pmonr__ustate_to_astate(root_pmonr, prmid);
+		}
 		raw_spin_unlock_irqrestore(&pkg_data->pkg_data_lock, flags);
 	}
 	return 0;
@@ -180,6 +398,238 @@ fail:
 }
 
 
+/* Alloc monr with all pmonrs in (U)state. */
+static struct monr *monr_alloc(void)
+{
+	int i;
+	struct pmonr *pmonr;
+	struct monr *monr;
+
+	monr = kmalloc(sizeof(struct monr), GFP_KERNEL);
+
+	if (!monr)
+		return ERR_PTR(-ENOMEM);
+
+	monr->flags = 0;
+	monr->parent = NULL;
+	INIT_LIST_HEAD(&monr->children);
+	INIT_LIST_HEAD(&monr->parent_entry);
+	monr->mon_event_group = NULL;
+
+	monr->pmonrs = kmalloc(
+		sizeof(struct pmonr *) * topology_max_packages(), GFP_KERNEL);
+
+	if (!monr->pmonrs)
+		return ERR_PTR(-ENOMEM);
+
+	/* Iterate over all pkgs, even unitialized ones. */
+	for (i = 0; i < topology_max_packages(); i++) {
+		/* Do not create pmonrs for unitialized packages. */
+		if (!cqm_pkgs_data[i]) {
+			monr->pmonrs[i] = NULL;
+			continue;
+		}
+		/* Rotation cpu is on pmonr's package. */
+		pmonr = pmonr_alloc(cqm_pkgs_data[i]->rotation_cpu);
+		if (IS_ERR(pmonr))
+			goto clean_pmonrs;
+		pmonr->monr = monr;
+		monr->pmonrs[i] = pmonr;
+	}
+	return monr;
+
+clean_pmonrs:
+	while (i--) {
+		if (cqm_pkgs_data[i])
+			kfree(monr->pmonrs[i]);
+	}
+	kfree(monr);
+	return ERR_CAST(pmonr);
+}
+
+/* Only can dealloc monrs with all pmonrs in (U)state. */
+static void monr_dealloc(struct monr *monr)
+{
+	int i;
+
+	cqm_pkg_id_for_each_online(i)
+		pmonr_dealloc(monr->pmonrs[i]);
+
+	kfree(monr);
+}
+
+/*
+ * Wrappers for monr manipulation in events.
+ *
+ */
+static inline struct monr *monr_from_event(struct perf_event *event)
+{
+	return (struct monr *) READ_ONCE(event->hw.cqm_monr);
+}
+
+static inline void event_set_monr(struct perf_event *event, struct monr *monr)
+{
+	WRITE_ONCE(event->hw.cqm_monr, monr);
+}
+
+/*
+ * Always finds a rmid_entry to schedule. To be called during scheduler.
+ * A fast path that only uses read_lock for common case when rmid for current
+ * package has been used before.
+ * On failure, verify that monr is active, if it is, try to obtain a free rmid
+ * and set pmonr to (A)state.
+ * On failure, transverse up monr_hrchy until finding one prmid for this
+ * pkg_id and set pmonr to (I)state.
+ * Called during task switch, it will set pmonr's prmid_summary to reflect the
+ * sched and read rmids that reflect pmonr's state.
+ */
+static inline void
+monr_hrchy_get_next_prmid_summary(struct pmonr *pmonr)
+{
+	union prmid_summary summary;
+
+	/*
+	 * First, do lock-free fastpath.
+	 */
+	summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+	if (summary.sched_rmid != INVALID_RMID)
+		return;
+
+	if (!prmid_summary__is_mon_active(summary))
+		return;
+
+	/*
+	 * Lock-free path failed at first attempt. Now acquire lock and repeat
+	 * in case the monr was modified in the mean time.
+	 * This time try to obtain free rmid and update pmonr accordingly,
+	 * instead of failing fast.
+	 */
+	raw_spin_lock_nested(&__pkg_data(pmonr, pkg_data_lock), pmonr->pkg_id);
+
+	summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+	if (summary.sched_rmid != INVALID_RMID) {
+		raw_spin_unlock(&__pkg_data(pmonr, pkg_data_lock));
+		return;
+	}
+
+	/* Do not try to obtain RMID if monr is not active. */
+	if (!prmid_summary__is_mon_active(summary)) {
+		raw_spin_unlock(&__pkg_data(pmonr, pkg_data_lock));
+		return;
+	}
+
+	/*
+	 * Can only fail if it was in (U)state.
+	 * Try to obtain a free prmid and go to (A)state, if not possible,
+	 * it should go to (I)state.
+	 */
+	WARN_ON_ONCE(!__pmonr__in_ustate(pmonr));
+
+	if (list_empty(&__pkg_data(pmonr, free_prmids_pool))) {
+		/* Failed to obtain an valid rmid in this package for this
+		 * monr. In next patches it will transition to (I)state.
+		 * For now, stay in (U)state (do nothing).
+		 */
+	} else {
+		/* Transition to (A)state using free prmid. */
+		__pmonr__ustate_to_astate(
+			pmonr,
+			list_first_entry(&__pkg_data(pmonr, free_prmids_pool),
+				struct prmid, pool_entry));
+	}
+	raw_spin_unlock(&__pkg_data(pmonr, pkg_data_lock));
+}
+
+static inline void __assert_monr_is_leaf(struct monr *monr)
+{
+	int i;
+
+	monr_hrchy_assert_held_mutexes();
+	monr_hrchy_assert_held_raw_spin_locks();
+
+	cqm_pkg_id_for_each_online(i)
+		WARN_ON_ONCE(!__pmonr__in_ustate(monr->pmonrs[i]));
+
+	WARN_ON_ONCE(!list_empty(&monr->children));
+}
+
+static inline void
+__monr_hrchy_insert_leaf(struct monr *monr, struct monr *parent)
+{
+	monr_hrchy_assert_held_mutexes();
+	monr_hrchy_assert_held_raw_spin_locks();
+
+	__assert_monr_is_leaf(monr);
+
+	list_add_tail(&monr->parent_entry, &parent->children);
+	monr->parent = parent;
+}
+
+static inline void
+__monr_hrchy_remove_leaf(struct monr *monr)
+{
+	/* Since root cannot be removed, monr must have a parent */
+	WARN_ON_ONCE(!monr->parent);
+
+	monr_hrchy_assert_held_mutexes();
+	monr_hrchy_assert_held_raw_spin_locks();
+
+	__assert_monr_is_leaf(monr);
+
+	list_del_init(&monr->parent_entry);
+	monr->parent = NULL;
+}
+
+static int __monr_hrchy_attach_cpu_event(struct perf_event *event)
+{
+	lockdep_assert_held(&cqm_mutex);
+	WARN_ON_ONCE(monr_from_event(event));
+
+	event_set_monr(event, monr_hrchy_root);
+	return 0;
+}
+
+/* task events are always leaves in the monr_hierarchy */
+static int __monr_hrchy_attach_task_event(struct perf_event *event,
+					  struct monr *parent_monr)
+{
+	struct monr *monr;
+	unsigned long flags;
+	int i;
+
+	lockdep_assert_held(&cqm_mutex);
+
+	monr = monr_alloc();
+	if (IS_ERR(monr))
+		return PTR_ERR(monr);
+	event_set_monr(event, monr);
+	monr->mon_event_group = event;
+
+	monr_hrchy_acquire_locks(flags, i);
+	__monr_hrchy_insert_leaf(monr, parent_monr);
+	__monr__set_mon_active(monr);
+	monr_hrchy_release_locks(flags, i);
+
+	return 0;
+}
+
+/*
+ * Find appropriate position in hierarchy and set monr. Create new
+ * monr if necessary.
+ * Locks rmid hrchy.
+ */
+static int monr_hrchy_attach_event(struct perf_event *event)
+{
+	struct monr *monr_parent;
+
+	if (!event->cgrp && !(event->attach_state & PERF_ATTACH_TASK))
+		return __monr_hrchy_attach_cpu_event(event);
+
+	/* Two-levels hierarchy: Root and all event monr underneath it. */
+	monr_parent = monr_hrchy_root;
+	return __monr_hrchy_attach_task_event(event, monr_parent);
+}
+
 /*
  * Determine if @a and @b measure the same set of tasks.
  *
@@ -228,42 +678,105 @@ static int
 intel_cqm_setup_event(struct perf_event *event, struct perf_event **group)
 {
 	struct perf_event *iter;
+	struct monr *monr;
+	*group = NULL;
 
+	lockdep_assert_held(&cqm_mutex);
 
 	list_for_each_entry(iter, &cache_groups, hw.cqm_event_groups_entry) {
+		monr = monr_from_event(iter);
 		if (__match_event(iter, event)) {
+			/* All tasks in a group share an monr. */
+			event_set_monr(event, monr);
 			*group = iter;
 			return 0;
 		}
 	}
-	return 0;
+	/*
+	 * Since no match was found, create a new monr and set this
+	 * event as head of a new cache group. All events in this cache group
+	 * will share the monr.
+	 */
+	return monr_hrchy_attach_event(event);
 }
 
 /* Read current package immediately and remote pkg (if any) from cache. */
 static void intel_cqm_event_read(struct perf_event *event)
 {
+	union prmid_summary summary;
+	struct prmid *prmid;
+	u16 pkg_id = topology_physical_package_id(smp_processor_id());
+	struct pmonr *pmonr = monr_from_event(event)->pmonrs[pkg_id];
+
+	summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+	prmid = __prmid_from_rmid(pkg_id, summary.read_rmid);
+	cqm_prmid_update(prmid);
+	local64_set(&event->count, atomic64_read(&prmid->last_read_value));
 }
 
-static void intel_cqm_event_start(struct perf_event *event, int mode)
+static inline void __intel_cqm_event_start(
+	struct perf_event *event, union prmid_summary summary)
 {
 	if (!(event->hw.state & PERF_HES_STOPPED))
 		return;
 
 	event->hw.state &= ~PERF_HES_STOPPED;
+	pqr_update_rmid(summary.sched_rmid);
+}
+
+static void intel_cqm_event_start(struct perf_event *event, int mode)
+{
+	union prmid_summary summary;
+	u16 pkg_id = topology_physical_package_id(smp_processor_id());
+	struct pmonr *pmonr = monr_from_event(event)->pmonrs[pkg_id];
+
+	/* Utilize most up to date pmonr summary. */
+	monr_hrchy_get_next_prmid_summary(pmonr);
+	summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+	__intel_cqm_event_start(event, summary);
 }
 
 static void intel_cqm_event_stop(struct perf_event *event, int mode)
 {
+	union prmid_summary summary;
+	u16 pkg_id = topology_physical_package_id(smp_processor_id());
+	struct pmonr *root_pmonr = monr_hrchy_root->pmonrs[pkg_id];
+
 	if (event->hw.state & PERF_HES_STOPPED)
 		return;
 
 	event->hw.state |= PERF_HES_STOPPED;
+
+	summary.value = atomic64_read(&root_pmonr->prmid_summary_atomic);
+	/* Occupancy of CQM events is obtained at read. No need to read
+	 * when event is stopped since read on inactive cpus succeed.
+	 */
+	pqr_update_rmid(summary.sched_rmid);
 }
 
 static int intel_cqm_event_add(struct perf_event *event, int mode)
 {
+	struct monr *monr;
+	struct pmonr *pmonr;
+	union prmid_summary summary;
+	u16 pkg_id = topology_physical_package_id(smp_processor_id());
+
+	monr = monr_from_event(event);
+	pmonr = monr->pmonrs[pkg_id];
+
 	event->hw.state = PERF_HES_STOPPED;
 
+	/* Utilize most up to date pmonr summary. */
+	monr_hrchy_get_next_prmid_summary(pmonr);
+	summary.value = atomic64_read(&pmonr->prmid_summary_atomic);
+
+	if (!prmid_summary__is_mon_active(summary))
+		return -1;
+
+	if (mode & PERF_EF_START)
+		__intel_cqm_event_start(event, summary);
+
+	/* (I)state pmonrs cannot report occupancy for themselves. */
 	return 0;
 }
 
@@ -275,6 +788,9 @@ static inline bool cqm_group_leader(struct perf_event *event)
 static void intel_cqm_event_destroy(struct perf_event *event)
 {
 	struct perf_event *group_other = NULL;
+	struct monr *monr;
+	int i;
+	unsigned long flags;
 
 	mutex_lock(&cqm_mutex);
 	/*
@@ -292,12 +808,15 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 	if (!cqm_group_leader(event))
 		goto exit;
 
+	monr = monr_from_event(event);
+
 	/*
 	 * If there was a group_other, make that leader, otherwise
 	 * destroy the group and return the RMID.
 	 */
 	if (group_other) {
 		/* Update monr reference to group head. */
+		monr->mon_event_group = group_other;
 		list_replace(&event->hw.cqm_event_groups_entry,
 			     &group_other->hw.cqm_event_groups_entry);
 		goto exit;
@@ -307,8 +826,24 @@ static void intel_cqm_event_destroy(struct perf_event *event)
 	 * Event is the only event in cache group.
 	 */
 
+	event_set_monr(event, NULL);
 	list_del(&event->hw.cqm_event_groups_entry);
 
+	if (monr__is_root(monr))
+		goto exit;
+
+	/* Transition all pmonrs to (U)state. */
+	monr_hrchy_acquire_locks(flags, i);
+
+	cqm_pkg_id_for_each_online(i)
+		__pmonr__to_ustate(monr->pmonrs[i]);
+
+	__monr__clear_mon_active(monr);
+	monr->mon_event_group = NULL;
+	__monr_hrchy_remove_leaf(monr);
+	monr_hrchy_release_locks(flags, i);
+
+	monr_dealloc(monr);
 exit:
 	mutex_unlock(&cqm_mutex);
 }
@@ -562,6 +1097,12 @@ static int __init intel_cqm_init(void)
 			goto error;
 	}
 
+	monr_hrchy_root = monr_alloc();
+	if (IS_ERR(monr_hrchy_root)) {
+		ret = PTR_ERR(monr_hrchy_root);
+		goto error;
+	}
+
 	/* Select the minimum of the maximum rmids to use as limit for
 	 * threshold. XXX: per-package threshold.
 	 */
@@ -570,6 +1111,7 @@ static int __init intel_cqm_init(void)
 			min_max_rmid = cqm_pkgs_data[i]->max_rmid;
 		intel_cqm_setup_pkg_prmid_pools(i);
 	}
+	monr_hrchy_root->flags |= MONR_MON_ACTIVE;
 
 	/*
 	 * A reasonable upper limit on the max threshold is the number
diff --git a/arch/x86/events/intel/cqm.h b/arch/x86/events/intel/cqm.h
index 7837db0..81c7af1 100644
--- a/arch/x86/events/intel/cqm.h
+++ b/arch/x86/events/intel/cqm.h
@@ -41,11 +41,117 @@ struct prmid {
 };
 
 /*
+ * Minimum time elapsed between reads of occupancy value for an RMID when
+ * transversing the monr hierarchy.
+ */
+#define RMID_DEFAULT_MIN_UPDATE_TIME 20	/* ms */
+static unsigned int __rmid_min_update_time = RMID_DEFAULT_MIN_UPDATE_TIME;
+
+static inline int cqm_prmid_update(struct prmid *prmid);
+
+/*
+ * union prmid_summary: Machine-size summary of a pmonr's prmid state.
+ * @value:		One word accesor.
+ * @rmid:		rmid for prmid.
+ * @sched_rmid:		The rmid to write in the PQR MSR.
+ * @read_rmid:		The rmid to read occupancy from.
+ *
+ * The prmid_summarys are read atomically and without the need of LOCK
+ * instructions during event and group scheduling in task context switch.
+ * They are set when a prmid change state and allow lock-free fast paths for
+ * RMID scheduling and RMID read for the common case when prmid does not need
+ * to change state.
+ * The combination of values in sched_rmid and read_rmid indicate the state of
+ * the associated pmonr (see pmonr comments) as follows:
+ *					pmonr state
+ *	      |	 (A)state	    (U)state
+ * ----------------------------------------------------------------------------
+ * sched_rmid |	pmonr.prmid	   INVALID_RMID
+ *  read_rmid |	pmonr.prmid	   INVALID_RMID
+ *				      (or 0)
+ *
+ * The combination sched_rmid == INVALID_RMID and read_rmid == 0 for (U)state
+ * denotes that the flag MONR_MON_ACTIVE is set in the monr associated with
+ * the pmonr for this prmid_summary.
+ */
+union prmid_summary {
+	long long	value;
+	struct {
+		u32	sched_rmid;
+		u32	read_rmid;
+	};
+};
+
+/* A pmonr in (U)state has no sched_rmid, read_rmid can be 0 or INVALID_RMID
+ * depending on whether monitoring is active or not.
+ */
+inline bool prmid_summary__is_ustate(union prmid_summary summ)
+{
+	return summ.sched_rmid == INVALID_RMID;
+}
+
+inline bool prmid_summary__is_mon_active(union prmid_summary summ)
+{
+	/* If not in (U)state, then MONR_MON_ACTIVE must be set. */
+	return summ.sched_rmid != INVALID_RMID ||
+	       summ.read_rmid == 0;
+}
+
+struct monr;
+
+/* struct pmonr: Node of per-package hierarchy of MONitored Resources.
+ * @prmid:			The prmid of this pmonr -when in (A)state-.
+ * @rotation_entry:		List entry to attach to pmonr rotation lists in
+ *				pkg_data.
+ * @monr:			The monr that contains this pmonr.
+ * @pkg_id:			Auxiliar variable with pkg id for this pmonr.
+ * @prmid_summary_atomic:	Atomic accesor to store a union prmid_summary
+ *				that represent the state of this pmonr.
+ *
+ * A pmonr forms a per-package hierarchy of prmids. Each one represents a
+ * resource to be monitored and can hold a prmid. Due to rmid scarcity,
+ * rmids can be recycled and rotated. When a rmid is not available for this
+ * pmonr, the pmonr utilizes the rmid of its ancestor.
+ * A pmonr is always in one of the following states:
+ *   - (A)ctive:	Has @prmid assigned, @ancestor_pmonr must be NULL.
+ *   - (U)nused:	No @ancestor_pmonr and no @prmid, hence no available
+ *			prmid and no inhering one either. Not in rotation list.
+ *			This state is unschedulable and a prmid
+ *			should be found (either o free one or ancestor's) before
+ *			scheduling a thread with (U)state pmonr in
+ *			a cpu in this package.
+ *
+ * The state transitions are:
+ *   (U) : The initial state. Starts there after allocation.
+ *   (U) -> (A): If on first sched (or initialization) pmonr receives a prmid.
+ *   (A) -> (U): On destruction of monr.
+ *
+ * Each pmonr is contained by a monr.
+ */
+struct pmonr {
+
+	struct prmid				*prmid;
+
+	struct monr				*monr;
+	struct list_head			rotation_entry;
+
+	u16					pkg_id;
+
+	/* all writers are sync'ed by package's lock. */
+	atomic64_t				prmid_summary_atomic;
+};
+
+/*
  * struct pkg_data: Per-package CQM data.
  * @max_rmid:			Max rmid valid for cpus in this package.
  * @prmids_by_rmid:		Utility mapping between rmid values and prmids.
  *				XXX: Make it an array of prmids.
  * @free_prmid_pool:		Free prmids.
+ * @active_prmid_pool:		prmids associated with a (A)state pmonr.
+ * @nopmonr_limbo_prmid_pool:	prmids in limbo state that are not referenced
+ *				by a pmonr.
+ * @astate_pmonrs_lru:		pmonrs in (A)state. LRU in increasing order of
+ *				pmonr.last_enter_astate.
  * @pkg_data_mutex:		Hold for stability when modifying pmonrs
  *				hierarchy.
  * @pkg_data_lock:		Hold to protect variables that may be accessed
@@ -64,6 +170,12 @@ struct pkg_data {
 	 * Pools of prmids used in rotation logic.
 	 */
 	struct list_head	free_prmids_pool;
+	/* Can be modified during task switch with (U)state -> (A)state. */
+	struct list_head	active_prmids_pool;
+	/* Only modified during rotation logic and deletion. */
+	struct list_head	nopmonr_limbo_prmids_pool;
+
+	struct list_head	astate_pmonrs_lru;
 
 	struct mutex		pkg_data_mutex;
 	raw_spinlock_t		pkg_data_lock;
@@ -71,6 +183,52 @@ struct pkg_data {
 	int			rotation_cpu;
 };
 
+/*
+ * Flags for monr.
+ */
+#define MONR_MON_ACTIVE		0x1
+
+/*
+ * struct monr: MONitored Resource.
+ * @flags:		Flags field for monr (XXX: More flags will be added
+ *			with MBM).
+ * @mon_event_group:	The head of event's group that use this monr, if any.
+ * @parent:		Parent in monr hierarchy.
+ * @children:		List of children in monr hierarchy.
+ * @parent_entry:	Entry in parent's children list.
+ * @pmonrs:		Per-package pmonr for this monr.
+ *
+ * Each cgroup or thread that requires a RMID will have a corresponding
+ * monr in the system-wide hierarchy reflecting it's position in the
+ * cgroup/thread hierarchy.
+ * An monr is assigned to every CQM event and/or monitored cgroups when
+ * monitoring is activated and that instance's address do not change during
+ * the lifetime of the event or cgroup.
+ *
+ * On creation, the monr has flags cleared and all its pmonrs in (U)state.
+ * The flag MONR_MON_ACTIVE must be set to enable any transition out of
+ * (U)state to occur.
+ */
+struct monr {
+	u16				flags;
+	/* Back reference pointers */
+	struct perf_event		*mon_event_group;
+
+	struct monr			*parent;
+	struct list_head		children;
+	struct list_head		parent_entry;
+	struct pmonr			**pmonrs;
+};
+
+/*
+ * Root for system-wide hierarchy of monr.
+ * A per-package raw_spin_lock protects changes to the per-pkg elements of
+ * the monr hierarchy.
+ * To modify the monr hierarchy, must hold all locks in each package
+ * using packaged-id as nesting parameter.
+ */
+extern struct monr *monr_hrchy_root;
+
 extern struct pkg_data **cqm_pkgs_data;
 
 static inline u16 __cqm_pkgs_data_next_online(u16 pkg_id)
-- 
2.8.0.rc3.226.g39d4020