linux-kernel - Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 1 Sep 2017 14:17:17 +0300
From:   Alexey Budankov <alexey.budankov@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Kan Liang <kan.liang@...el.com>,
        Dmitri Prokhorov <Dmitry.Prohorov@...el.com>,
        Valery Cherepennikov <valery.cherepennikov@...el.com>,
        Mark Rutland <mark.rutland@....com>,
        Stephane Eranian <eranian@...gle.com>,
        David Carrillo-Cisneros <davidcc@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Vince Weaver <vince@...ter.net>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping

On 31.08.2017 20:18, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
>> On 22.08.2017 23:47, Peter Zijlstra wrote:
>>> On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>>>> The key thing in the patch is explicit updating of tstamp fields for
>>>> INACTIVE events in update_event_times().
>>>
>>>> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>>>>  	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>>>>  		return;
>>>>  
>>>> +	if (event->state == PERF_EVENT_STATE_INACTIVE)
>>>> +		perf_event_tstamp_update(event);
>>>> +
>>>>  	/*
>>>>  	 * in cgroup mode, time_enabled represents
>>>>  	 * the time the event was enabled AND active
>>>
>>> But why!? I thought the whole point was to not need to do this.
>>
>> update_event_times() is not called from timer interrupt handler 
>> thus it is not on the critical path which is optimized in this patch set.
>>
>> But update_event_times() is called in the context of read() syscall so
>> this is the place where we may update event times for INACTIVE events 
>> instead of timer interrupt.
>>
>> Also update_event_times() is called on thread context switch out so
>> we get event times also updated when the thread migrates to other CPU.
>>
>>>
>>> The thing I outlined earlier would only need to update timestamps when
>>> events change state and at no other point in time.
>>
>> But we still may request times while event is in INACTIVE state 
>> thru read() syscall and event timings need to be up-to-date. 
> 
> Sure, read() also updates.
> 
> So the below completely rewrites timekeeping (and probably breaks
> world) but does away with the need to touch events that don't get
> scheduled.
> 
> Esp the cgroup stuff is entirely untested since I simply don't know how
> to operate that. I did run Vince's tests on it, and I think it doesn't
> regress, but I'm near a migraine so I can't really see straight atm.
> 
> Vince, Stephane, could you guys have a peek?
> 
> (There's a few other bits in, I'll break up into patches and write
> comments and Changelogs later, I think its can be split in some 5
> patches).
> 
> The basic idea is really simple, we have a single timestamp and
> depending on the state we update enabled/running. This obviously only
> requires updates when we change state and when we need up-to-date
> timestamps (read).
> 
> No more weird and wonderful mind bending interaction between 3 different
> timestamps with arcane update rules.
> 
> ---
>  include/linux/perf_event.h |  25 +-
>  kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
>  2 files changed, 192 insertions(+), 384 deletions(-)
> 

Tried to apply on top of this:

perf/core 1b2f76d77a277bb70d38ad0991ed7f16bbc115a9 [origin/perf/core] Merge tag 'perf-core-for-mingo-4.14-20170829' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core

but failed:

Checking patch include/linux/perf_event.h...
Hunk #1 succeeded at 498 (offset 13 lines).
Hunk #2 succeeded at 591 (offset 13 lines).
Hunk #3 succeeded at 601 (offset 13 lines).
Hunk #4 succeeded at 696 (offset 13 lines).
Checking patch kernel/events/core.c...
error: while searching for:
	return event_cpu;
}

/*
 * Cross CPU call to read the hardware event
 */
static void __perf_event_read(void *info)
{
	struct perf_read_data *data = info;
	struct perf_event *sub, *event = data->event;
	struct perf_event_context *ctx = event->ctx;
	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
	struct pmu *pmu = event->pmu;

	/*
	 * If this is a task context, we need to check whether it is
	 * the current task context of this cpu.  If not it has been
	 * scheduled out before the smp call arrived.  In that case
	 * event->count would have been updated to a recent sample
	 * when the event was scheduled out.
	 */
	if (ctx->task && cpuctx->task_ctx != ctx)
		return;

	raw_spin_lock(&ctx->lock);
	if (ctx->is_active) {
		update_context_time(ctx);
		update_cgrp_time_from_event(event);
	}

	update_event_times(event);
	if (event->state != PERF_EVENT_STATE_ACTIVE)
		goto unlock;

	if (!data->group) {
		pmu->read(event);
		data->ret = 0;
		goto unlock;
	}

	pmu->start_txn(pmu, PERF_PMU_TXN_READ);

	pmu->read(event);

	list_for_each_entry(sub, &event->sibling_list, group_entry) {
		update_event_times(sub);
		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
			/*
			 * Use sibling's PMU rather than @event's since
			 * sibling could be on different (eg: software) PMU.
			 */
			sub->pmu->read(sub);
		}
	}

	data->ret = pmu->commit_txn(pmu);

unlock:
	raw_spin_unlock(&ctx->lock);
}

static inline u64 perf_event_count(struct perf_event *event)
{
	return local64_read(&event->count) + atomic64_read(&event->child_count);

error: patch failed: kernel/events/core.c:3613
error: kernel/events/core.c: patch does not apply

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8e22f24ded6a..2a6ae48a1a96 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -485,9 +485,9 @@ struct perf_addr_filters_head {
>  };
>  
>  /**
> - * enum perf_event_active_state - the states of a event
> + * enum perf_event_state - the states of a event
>   */
> -enum perf_event_active_state {
> +enum perf_event_state {
>  	PERF_EVENT_STATE_DEAD		= -4,
>  	PERF_EVENT_STATE_EXIT		= -3,
>  	PERF_EVENT_STATE_ERROR		= -2,
> @@ -578,7 +578,7 @@ struct perf_event {
>  	struct pmu			*pmu;
>  	void				*pmu_private;
>  
> -	enum perf_event_active_state	state;
> +	enum perf_event_state		state;
>  	unsigned int			attach_state;
>  	local64_t			count;
>  	atomic64_t			child_count;
> @@ -588,26 +588,10 @@ struct perf_event {
>  	 * has been enabled (i.e. eligible to run, and the task has
>  	 * been scheduled in, if this is a per-task event)
>  	 * and running (scheduled onto the CPU), respectively.
> -	 *
> -	 * They are computed from tstamp_enabled, tstamp_running and
> -	 * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
>  	 */
>  	u64				total_time_enabled;
>  	u64				total_time_running;
> -
> -	/*
> -	 * These are timestamps used for computing total_time_enabled
> -	 * and total_time_running when the event is in INACTIVE or
> -	 * ACTIVE state, measured in nanoseconds from an arbitrary point
> -	 * in time.
> -	 * tstamp_enabled: the notional time when the event was enabled
> -	 * tstamp_running: the notional time when the event was scheduled on
> -	 * tstamp_stopped: in INACTIVE state, the notional time when the
> -	 *	event was scheduled off.
> -	 */
> -	u64				tstamp_enabled;
> -	u64				tstamp_running;
> -	u64				tstamp_stopped;
> +	u64				tstamp;
>  
>  	/*
>  	 * timestamp shadows the actual context timing but it can
> @@ -699,7 +683,6 @@ struct perf_event {
>  
>  #ifdef CONFIG_CGROUP_PERF
>  	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
> -	int				cgrp_defer_enabled;
>  #endif
>  
>  	struct list_head		sb_list;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 294f1927f944..e968b3eab9c7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
>  	return event->clock();
>  }
>  
> +/*
> + * XXX comment about timekeeping goes here
> + */
> +
> +static __always_inline enum perf_event_state
> +__perf_effective_state(struct perf_event *event)
> +{
> +	struct perf_event *leader = event->group_leader;
> +
> +	if (leader->state <= PERF_EVENT_STATE_OFF)
> +		return leader->state;
> +
> +	return event->state;
> +}
> +
> +static __always_inline void
> +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
> +{
> +	enum perf_event_state state = __perf_effective_state(event);
> +	u64 delta = now - event->tstamp;
> +
> +	*enabled = event->total_time_enabled;
> +	if (state >= PERF_EVENT_STATE_INACTIVE)
> +		*enabled += delta;
> +
> +	*running = event->total_time_running;
> +	if (state >= PERF_EVENT_STATE_ACTIVE)
> +		*running += delta;
> +}
> +
> +static void perf_event_update_time(struct perf_event *event)
> +{
> +	u64 now = perf_event_time(event);
> +
> +	__perf_update_times(event, now, &event->total_time_enabled,
> +					&event->total_time_running);
> +	event->tstamp = now;
> +}
> +
> +static void perf_event_update_sibling_time(struct perf_event *leader)
> +{
> +	struct perf_event *sibling;
> +
> +	list_for_each_entry(sibling, &leader->sibling_list, group_entry)
> +		perf_event_update_time(sibling);
> +}
> +
> +static void
> +perf_event_set_state(struct perf_event *event, enum perf_event_state state)
> +{
> +	if (event->state == state)
> +		return;
> +
> +	perf_event_update_time(event);
> +	/*
> +	 * If a group leader gets enabled/disabled all its siblings
> +	 * are affected too.
> +	 */
> +	if ((event->state < 0) ^ (state < 0))
> +		perf_event_update_sibling_time(event);
> +
> +	WRITE_ONCE(event->state, state);
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>  
>  static inline bool
> @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
>  	event->shadow_ctx_time = now - t->timestamp;
>  }
>  
> -static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -	/*
> -	 * when the current task's perf cgroup does not match
> -	 * the event's, we need to remember to call the
> -	 * perf_mark_enable() function the first time a task with
> -	 * a matching perf cgroup is scheduled in.
> -	 */
> -	if (is_cgroup_event(event) && !perf_cgroup_match(event))
> -		event->cgrp_defer_enabled = 1;
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -			 struct perf_event_context *ctx)
> -{
> -	struct perf_event *sub;
> -	u64 tstamp = perf_event_time(event);
> -
> -	if (!event->cgrp_defer_enabled)
> -		return;
> -
> -	event->cgrp_defer_enabled = 0;
> -
> -	event->tstamp_enabled = tstamp - event->total_time_enabled;
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
> -			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
> -			sub->cgrp_defer_enabled = 0;
> -		}
> -	}
> -}
> -
>  /*
>   * Update cpuctx->cgrp so that it is set when first cgroup event is added and
>   * cleared when last cgroup event is removed.
> @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  }
>  
>  static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -			 struct perf_event_context *ctx)
> -{
> -}
> -
> -static inline void
>  list_update_cgroup_event(struct perf_event *event,
>  			 struct perf_event_context *ctx, bool add)
>  {
> @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event)
>  	return ctx ? ctx->time : 0;
>  }
>  
> -/*
> - * Update the total_time_enabled and total_time_running fields for a event.
> - */
> -static void update_event_times(struct perf_event *event)
> -{
> -	struct perf_event_context *ctx = event->ctx;
> -	u64 run_end;
> -
> -	lockdep_assert_held(&ctx->lock);
> -
> -	if (event->state < PERF_EVENT_STATE_INACTIVE ||
> -	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
> -		return;
> -
> -	/*
> -	 * in cgroup mode, time_enabled represents
> -	 * the time the event was enabled AND active
> -	 * tasks were in the monitored cgroup. This is
> -	 * independent of the activity of the context as
> -	 * there may be a mix of cgroup and non-cgroup events.
> -	 *
> -	 * That is why we treat cgroup events differently
> -	 * here.
> -	 */
> -	if (is_cgroup_event(event))
> -		run_end = perf_cgroup_event_time(event);
> -	else if (ctx->is_active)
> -		run_end = ctx->time;
> -	else
> -		run_end = event->tstamp_stopped;
> -
> -	event->total_time_enabled = run_end - event->tstamp_enabled;
> -
> -	if (event->state == PERF_EVENT_STATE_INACTIVE)
> -		run_end = event->tstamp_stopped;
> -	else
> -		run_end = perf_event_time(event);
> -
> -	event->total_time_running = run_end - event->tstamp_running;
> -
> -}
> -
> -/*
> - * Update total_time_enabled and total_time_running for all events in a group.
> - */
> -static void update_group_times(struct perf_event *leader)
> -{
> -	struct perf_event *event;
> -
> -	update_event_times(leader);
> -	list_for_each_entry(event, &leader->sibling_list, group_entry)
> -		update_event_times(event);
> -}
> -
>  static enum event_type_t get_event_type(struct perf_event *event)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>  	WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
>  	event->attach_state |= PERF_ATTACH_CONTEXT;
>  
> +	event->tstamp = perf_event_time(event);
> +
>  	/*
>  	 * If we're a stand alone event or group leader, we go to the context
>  	 * list, group events are kept attached to the group so that
> @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  	if (event->group_leader == event)
>  		list_del_init(&event->group_entry);
>  
> -	update_group_times(event);
> -
>  	/*
>  	 * If event was in error state, then keep it
>  	 * that way, otherwise bogus counts will be
> @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  	 * of the event
>  	 */
>  	if (event->state > PERF_EVENT_STATE_OFF)
> -		event->state = PERF_EVENT_STATE_OFF;
> +		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  
>  	ctx->generation++;
>  }
> @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event,
>  		  struct perf_cpu_context *cpuctx,
>  		  struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
> -	u64 delta;
> +	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
>  
>  	WARN_ON_ONCE(event->ctx != ctx);
>  	lockdep_assert_held(&ctx->lock);
>  
> -	/*
> -	 * An event which could not be activated because of
> -	 * filter mismatch still needs to have its timings
> -	 * maintained, otherwise bogus information is return
> -	 * via read() for time_enabled, time_running:
> -	 */
> -	if (event->state == PERF_EVENT_STATE_INACTIVE &&
> -	    !event_filter_match(event)) {
> -		delta = tstamp - event->tstamp_stopped;
> -		event->tstamp_running += delta;
> -		event->tstamp_stopped = tstamp;
> -	}
> -
>  	if (event->state != PERF_EVENT_STATE_ACTIVE)
>  		return;
>  
>  	perf_pmu_disable(event->pmu);
>  
> -	event->tstamp_stopped = tstamp;
>  	event->pmu->del(event, 0);
>  	event->oncpu = -1;
> -	event->state = PERF_EVENT_STATE_INACTIVE;
> +
>  	if (event->pending_disable) {
>  		event->pending_disable = 0;
> -		event->state = PERF_EVENT_STATE_OFF;
> +		state = PERF_EVENT_STATE_OFF;
>  	}
> +	perf_event_set_state(event, state);
>  
>  	if (!is_software_event(event))
>  		cpuctx->active_oncpu--;
> @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event,
>  		struct perf_event_context *ctx)
>  {
>  	struct perf_event *event;
> -	int state = group_event->state;
> +
> +	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
> +		return;
>  
>  	perf_pmu_disable(ctx->pmu);
>  
> @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event,
>  
>  	perf_pmu_enable(ctx->pmu);
>  
> -	if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
> +	if (group_event->attr.exclusive)
>  		cpuctx->exclusive = 0;
>  }
>  
> @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event,
>  {
>  	unsigned long flags = (unsigned long)info;
>  
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
> +
>  	event_sched_out(event, cpuctx, ctx);
>  	if (flags & DETACH_GROUP)
>  		perf_group_detach(event);
> @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event,
>  	if (event->state < PERF_EVENT_STATE_INACTIVE)
>  		return;
>  
> -	update_context_time(ctx);
> -	update_cgrp_time_from_event(event);
> -	update_group_times(event);
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
> +
>  	if (event == event->group_leader)
>  		group_sched_out(event, cpuctx, ctx);
>  	else
>  		event_sched_out(event, cpuctx, ctx);
> -	event->state = PERF_EVENT_STATE_OFF;
> +
> +	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  }
>  
>  /*
> @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
>  }
>  
>  static void perf_set_shadow_time(struct perf_event *event,
> -				 struct perf_event_context *ctx,
> -				 u64 tstamp)
> +				 struct perf_event_context *ctx)
>  {
>  	/*
>  	 * use the correct time source for the time snapshot
> @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event,
>  	 * is cleaner and simpler to understand.
>  	 */
>  	if (is_cgroup_event(event))
> -		perf_cgroup_set_shadow_time(event, tstamp);
> +		perf_cgroup_set_shadow_time(event, event->tstamp);
>  	else
> -		event->shadow_ctx_time = tstamp - ctx->timestamp;
> +		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
>  }
>  
>  #define MAX_INTERRUPTS (~0ULL)
> @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event,
>  		 struct perf_cpu_context *cpuctx,
>  		 struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
>  	int ret = 0;
>  
>  	lockdep_assert_held(&ctx->lock);
> @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event,
>  	 * is visible.
>  	 */
>  	smp_wmb();
> -	WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
> +	perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
>  
>  	/*
>  	 * Unthrottle events, since we scheduled we might have missed several
> @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event,
>  		event->hw.interrupts = 0;
>  	}
>  
> -	/*
> -	 * The new state must be visible before we turn it on in the hardware:
> -	 */
> -	smp_wmb();
> -
>  	perf_pmu_disable(event->pmu);
>  
> -	perf_set_shadow_time(event, ctx, tstamp);
> +	perf_set_shadow_time(event, ctx);
>  
>  	perf_log_itrace_start(event);
>  
>  	if (event->pmu->add(event, PERF_EF_START)) {
> -		event->state = PERF_EVENT_STATE_INACTIVE;
> +		perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  		event->oncpu = -1;
>  		ret = -EAGAIN;
>  		goto out;
>  	}
>  
> -	event->tstamp_running += tstamp - event->tstamp_stopped;
> -
>  	if (!is_software_event(event))
>  		cpuctx->active_oncpu++;
>  	if (!ctx->nr_active++)
> @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event,
>  {
>  	struct perf_event *event, *partial_group = NULL;
>  	struct pmu *pmu = ctx->pmu;
> -	u64 now = ctx->time;
> -	bool simulate = false;
>  
>  	if (group_event->state == PERF_EVENT_STATE_OFF)
>  		return 0;
> @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event,
>  	/*
>  	 * Groups can be scheduled in as one unit only, so undo any
>  	 * partial group before returning:
> -	 * The events up to the failed event are scheduled out normally,
> -	 * tstamp_stopped will be updated.
> -	 *
> -	 * The failed events and the remaining siblings need to have
> -	 * their timings updated as if they had gone thru event_sched_in()
> -	 * and event_sched_out(). This is required to get consistent timings
> -	 * across the group. This also takes care of the case where the group
> -	 * could never be scheduled by ensuring tstamp_stopped is set to mark
> -	 * the time the event was actually stopped, such that time delta
> -	 * calculation in update_event_times() is correct.
> +	 * The events up to the failed event are scheduled out normally.
>  	 */
>  	list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>  		if (event == partial_group)
> -			simulate = true;
> +			break;
>  
> -		if (simulate) {
> -			event->tstamp_running += now - event->tstamp_stopped;
> -			event->tstamp_stopped = now;
> -		} else {
> -			event_sched_out(event, cpuctx, ctx);
> -		}
> +		event_sched_out(event, cpuctx, ctx);
>  	}
>  	event_sched_out(group_event, cpuctx, ctx);
>  
> @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event,
>  	return can_add_hw;
>  }
>  
> -/*
> - * Complement to update_event_times(). This computes the tstamp_* values to
> - * continue 'enabled' state from @now, and effectively discards the time
> - * between the prior tstamp_stopped and now (as we were in the OFF state, or
> - * just switched (context) time base).
> - *
> - * This further assumes '@...nt->state == INACTIVE' (we just came from OFF) and
> - * cannot have been scheduled in yet. And going into INACTIVE state means
> - * '@...nt->tstamp_stopped = @now'.
> - *
> - * Thus given the rules of update_event_times():
> - *
> - *   total_time_enabled = tstamp_stopped - tstamp_enabled
> - *   total_time_running = tstamp_stopped - tstamp_running
> - *
> - * We can insert 'tstamp_stopped == now' and reverse them to compute new
> - * tstamp_* values.
> - */
> -static void __perf_event_enable_time(struct perf_event *event, u64 now)
> -{
> -	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
> -
> -	event->tstamp_stopped = now;
> -	event->tstamp_enabled = now - event->total_time_enabled;
> -	event->tstamp_running = now - event->total_time_running;
> -}
> -
>  static void add_event_to_ctx(struct perf_event *event,
>  			       struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
> -
>  	list_add_event(event, ctx);
>  	perf_group_attach(event);
> -	/*
> -	 * We can be called with event->state == STATE_OFF when we create with
> -	 * .disabled = 1. In that case the IOC_ENABLE will call this function.
> -	 */
> -	if (event->state == PERF_EVENT_STATE_INACTIVE)
> -		__perf_event_enable_time(event, tstamp);
>  }
>  
>  static void ctx_sched_out(struct perf_event_context *ctx,
> @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx,
>  }
>  
>  /*
> - * Put a event into inactive state and update time fields.
> - * Enabling the leader of a group effectively enables all
> - * the group members that aren't explicitly disabled, so we
> - * have to update their ->tstamp_enabled also.
> - * Note: this works for group members as well as group leaders
> - * since the non-leader members' sibling_lists will be empty.
> - */
> -static void __perf_event_mark_enabled(struct perf_event *event)
> -{
> -	struct perf_event *sub;
> -	u64 tstamp = perf_event_time(event);
> -
> -	event->state = PERF_EVENT_STATE_INACTIVE;
> -	__perf_event_enable_time(event, tstamp);
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		/* XXX should not be > INACTIVE if event isn't */
> -		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
> -			__perf_event_enable_time(sub, tstamp);
> -	}
> -}
> -
> -/*
>   * Cross CPU call to enable a performance event
>   */
>  static void __perf_event_enable(struct perf_event *event,
> @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event,
>  	if (ctx->is_active)
>  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  
> -	__perf_event_mark_enabled(event);
> +	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  
>  	if (!ctx->is_active)
>  		return;
>  
>  	if (!event_filter_match(event)) {
> -		if (is_cgroup_event(event))
> -			perf_cgroup_defer_enabled(event);
>  		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>  		return;
>  	}
> @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
>  	 * we know the event must be on the current CPU, therefore we
>  	 * don't need to use it.
>  	 */
> -	switch (event->state) {
> -	case PERF_EVENT_STATE_ACTIVE:
> +	if (event->state == PERF_EVENT_STATE_ACTIVE)
>  		event->pmu->read(event);
> -		/* fall-through */
>  
> -	case PERF_EVENT_STATE_INACTIVE:
> -		update_event_times(event);
> -		break;
> -
> -	default:
> -		break;
> -	}
> +	perf_event_update_time(event);
>  
>  	/*
>  	 * In order to keep per-task stats reliable we need to flip the event
> @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>  		if (!event_filter_match(event))
>  			continue;
>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> -
>  		if (group_can_go_on(event, cpuctx, 1))
>  			group_sched_in(event, cpuctx, ctx);
>  
> @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>  		 * If this pinned group hasn't been scheduled,
>  		 * put it in error state.
>  		 */
> -		if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -			update_group_times(event);
> -			event->state = PERF_EVENT_STATE_ERROR;
> -		}
> +		if (event->state == PERF_EVENT_STATE_INACTIVE)
> +			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>  	}
>  }
>  
> @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>  		if (!event_filter_match(event))
>  			continue;
>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> -
>  		if (group_can_go_on(event, cpuctx, can_add_hw)) {
>  			if (group_sched_in(event, cpuctx, ctx))
>  				can_add_hw = 0;
> @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event,
>  	if (event->state >= PERF_EVENT_STATE_INACTIVE)
>  		return 0;
>  
> -	__perf_event_mark_enabled(event);
> +	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  
>  	return 1;
>  }
> @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn)
>  		put_ctx(clone_ctx);
>  }
>  
> -struct perf_read_data {
> -	struct perf_event *event;
> -	bool group;
> -	int ret;
> -};
> -
>  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  {
>  	u16 local_pkg, event_pkg;
> @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  	return event_cpu;
>  }
>  
> -/*
> - * Cross CPU call to read the hardware event
> - */
> -static void __perf_event_read(void *info)
> -{
> -	struct perf_read_data *data = info;
> -	struct perf_event *sub, *event = data->event;
> -	struct perf_event_context *ctx = event->ctx;
> -	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> -	struct pmu *pmu = event->pmu;
> -
> -	/*
> -	 * If this is a task context, we need to check whether it is
> -	 * the current task context of this cpu.  If not it has been
> -	 * scheduled out before the smp call arrived.  In that case
> -	 * event->count would have been updated to a recent sample
> -	 * when the event was scheduled out.
> -	 */
> -	if (ctx->task && cpuctx->task_ctx != ctx)
> -		return;
> -
> -	raw_spin_lock(&ctx->lock);
> -	if (ctx->is_active) {
> -		update_context_time(ctx);
> -		update_cgrp_time_from_event(event);
> -	}
> -
> -	update_event_times(event);
> -	if (event->state != PERF_EVENT_STATE_ACTIVE)
> -		goto unlock;
> -
> -	if (!data->group) {
> -		pmu->read(event);
> -		data->ret = 0;
> -		goto unlock;
> -	}
> -
> -	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> -
> -	pmu->read(event);
> -
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		update_event_times(sub);
> -		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> -			/*
> -			 * Use sibling's PMU rather than @event's since
> -			 * sibling could be on different (eg: software) PMU.
> -			 */
> -			sub->pmu->read(sub);
> -		}
> -	}
> -
> -	data->ret = pmu->commit_txn(pmu);
> -
> -unlock:
> -	raw_spin_unlock(&ctx->lock);
> -}
> -
>  static inline u64 perf_event_count(struct perf_event *event)
>  {
>  	return local64_read(&event->count) + atomic64_read(&event->child_count);
> @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>  	return ret;
>  }
>  
> -static int perf_event_read(struct perf_event *event, bool group)
> +struct perf_read_data {
> +	struct perf_event *event;
> +	bool group;
> +	int ret;
> +};
> +
> +static void __perf_event_read(struct perf_event *event,
> +			      struct perf_cpu_context *cpuctx,
> +			      struct perf_event_context *ctx,
> +			      void *data)
>  {
> -	int event_cpu, ret = 0;
> +	struct perf_read_data *prd = data;
> +	struct pmu *pmu = event->pmu;
> +	struct perf_event *sibling;
>  
> -	/*
> -	 * If event is enabled and currently active on a CPU, update the
> -	 * value in the event structure:
> -	 */
> -	if (event->state == PERF_EVENT_STATE_ACTIVE) {
> -		struct perf_read_data data = {
> -			.event = event,
> -			.group = group,
> -			.ret = 0,
> -		};
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
>  
> -		event_cpu = READ_ONCE(event->oncpu);
> -		if ((unsigned)event_cpu >= nr_cpu_ids)
> -			return 0;
> +	perf_event_update_time(event);
> +	if (prd->group)
> +		perf_event_update_sibling_time(event);
>  
> -		preempt_disable();
> -		event_cpu = __perf_event_read_cpu(event, event_cpu);
> +	if (event->state != PERF_EVENT_STATE_ACTIVE)
> +		return;
>  
> +	if (!prd->group) {
> +		pmu->read(event);
> +		prd->ret = 0;
> +		return;
> +	}
> +
> +	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> +
> +	pmu->read(event);
> +	list_for_each_entry(sibling, &event->sibling_list, group_entry) {
> +		if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
> +			/*
> +			 * Use sibling's PMU rather than @event's since
> +			 * sibling could be on different (eg: software) PMU.
> +			 */
> +			sibling->pmu->read(sibling);
> +		}
> +	}
> +
> +	prd->ret = pmu->commit_txn(pmu);
> +}
> +
> +static int perf_event_read(struct perf_event *event, bool group)
> +{
> +	struct perf_read_data prd = {
> +		.event = event,
> +		.group = group,
> +		.ret = 0,
> +	};
> +
> +	if (event->ctx->task) {
> +		event_function_call(event, __perf_event_read, &prd);
> +	} else {
>  		/*
> -		 * Purposely ignore the smp_call_function_single() return
> -		 * value.
> -		 *
> -		 * If event_cpu isn't a valid CPU it means the event got
> -		 * scheduled out and that will have updated the event count.
> -		 *
> -		 * Therefore, either way, we'll have an up-to-date event count
> -		 * after this.
> -		 */
> -		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
> -		preempt_enable();
> -		ret = data.ret;
> -	} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -		struct perf_event_context *ctx = event->ctx;
> -		unsigned long flags;
> -
> -		raw_spin_lock_irqsave(&ctx->lock, flags);
> -		/*
> -		 * may read while context is not active
> -		 * (e.g., thread is blocked), in that case
> -		 * we cannot update context time
> +		 * For uncore events (which are per definition per-cpu)
> +		 * allow a different read CPU from event->cpu.
>  		 */
> -		if (ctx->is_active) {
> -			update_context_time(ctx);
> -			update_cgrp_time_from_event(event);
> -		}
> -		if (group)
> -			update_group_times(event);
> -		else
> -			update_event_times(event);
> -		raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +		struct event_function_struct efs = {
> +			.event = event,
> +			.func = __perf_event_read,
> +			.data = &prd,
> +		};
> +		int cpu = __perf_event_read_cpu(event, event->cpu);
> +
> +		cpu_function_call(cpu, event_function, &efs);
>  	}
>  
> -	return ret;
> +	return prd.ret;
>  }
>  
>  /*
> @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  {
>  	struct perf_event *child;
>  	u64 total = 0;
> @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  
>  	return total;
>  }
> +
> +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +{
> +	struct perf_event_context *ctx;
> +	u64 count;
> +
> +	ctx = perf_event_ctx_lock(event);
> +	count = __perf_event_read_value(event, enabled, running);
> +	perf_event_ctx_unlock(event, ctx);
> +
> +	return count;
> +}
>  EXPORT_SYMBOL_GPL(perf_event_read_value);
>  
>  static int __perf_read_group_add(struct perf_event *leader,
> @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader,
>  	if (ret)
>  		return ret;
>  
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
>  	/*
>  	 * Since we co-schedule groups, {enabled,running} times of siblings
>  	 * will be identical to those of the leader, so we only publish one
> @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader,
>  	if (read_format & PERF_FORMAT_ID)
>  		values[n++] = primary_event_id(leader);
>  
> -	raw_spin_lock_irqsave(&ctx->lock, flags);
> -
>  	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>  		values[n++] += perf_event_count(sub);
>  		if (read_format & PERF_FORMAT_ID)
> @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event,
>  	u64 values[4];
>  	int n = 0;
>  
> -	values[n++] = perf_event_read_value(event, &enabled, &running);
> +	values[n++] = __perf_event_read_value(event, &enabled, &running);
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
>  		values[n++] = enabled;
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event,
>  
>  	*now = perf_clock();
>  	ctx_time = event->shadow_ctx_time + *now;
> -	*enabled = ctx_time - event->tstamp_enabled;
> -	*running = ctx_time - event->tstamp_running;
> +	__perf_update_times(event, ctx_time, enabled, running);
>  }
>  
>  static void perf_event_init_userpage(struct perf_event *event)
> @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event,
>  	if (parent_event)
>  		perf_group_detach(child_event);
>  	list_del_event(child_event, child_ctx);
> -	child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
> +	perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
>  	raw_spin_unlock_irq(&child_ctx->lock);
>  
>  	/*
> @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event,
>  	      struct perf_event *group_leader,
>  	      struct perf_event_context *child_ctx)
>  {
> -	enum perf_event_active_state parent_state = parent_event->state;
> +	enum perf_event_state parent_state = parent_event->state;
>  	struct perf_event *child_event;
>  	unsigned long flags;
>  
> @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info)
>  	struct perf_event *event;
>  
>  	raw_spin_lock(&ctx->lock);
> +	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  	list_for_each_entry(event, &ctx->event_list, event_entry)
>  		__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
>  	raw_spin_unlock(&ctx->lock);
>