linux-kernel - Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f4e43e9c-5c5e-e418-01db-d581554acd8a@linux.intel.com>
Date:   Fri, 1 Sep 2017 13:45:17 +0300
From:   Alexey Budankov <alexey.budankov@...ux.intel.com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Andi Kleen <ak@...ux.intel.com>,
        Kan Liang <kan.liang@...el.com>,
        Dmitri Prokhorov <Dmitry.Prohorov@...el.com>,
        Valery Cherepennikov <valery.cherepennikov@...el.com>,
        Mark Rutland <mark.rutland@....com>,
        Stephane Eranian <eranian@...gle.com>,
        David Carrillo-Cisneros <davidcc@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Vince Weaver <vince@...ter.net>,
        Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping

On 31.08.2017 20:18, Peter Zijlstra wrote:
> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote:
>> On 22.08.2017 23:47, Peter Zijlstra wrote:
>>> On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote:
>>>> The key thing in the patch is explicit updating of tstamp fields for
>>>> INACTIVE events in update_event_times().
>>>
>>>> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event)
>>>>  	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
>>>>  		return;
>>>>  
>>>> +	if (event->state == PERF_EVENT_STATE_INACTIVE)
>>>> +		perf_event_tstamp_update(event);
>>>> +
>>>>  	/*
>>>>  	 * in cgroup mode, time_enabled represents
>>>>  	 * the time the event was enabled AND active
>>>
>>> But why!? I thought the whole point was to not need to do this.
>>
>> update_event_times() is not called from timer interrupt handler 
>> thus it is not on the critical path which is optimized in this patch set.
>>
>> But update_event_times() is called in the context of read() syscall so
>> this is the place where we may update event times for INACTIVE events 
>> instead of timer interrupt.
>>
>> Also update_event_times() is called on thread context switch out so
>> we get event times also updated when the thread migrates to other CPU.
>>
>>>
>>> The thing I outlined earlier would only need to update timestamps when
>>> events change state and at no other point in time.
>>
>> But we still may request times while event is in INACTIVE state 
>> thru read() syscall and event timings need to be up-to-date. 
> 
> Sure, read() also updates.
> 
> So the below completely rewrites timekeeping (and probably breaks
> world) but does away with the need to touch events that don't get
> scheduled.
> 
> Esp the cgroup stuff is entirely untested since I simply don't know how
> to operate that. I did run Vince's tests on it, and I think it doesn't
> regress, but I'm near a migraine so I can't really see straight atm.
> 
> Vince, Stephane, could you guys have a peek?
> 
> (There's a few other bits in, I'll break up into patches and write
> comments and Changelogs later, I think its can be split in some 5
> patches).
> 
> The basic idea is really simple, we have a single timestamp and
> depending on the state we update enabled/running. This obviously only
> requires updates when we change state and when we need up-to-date
> timestamps (read).
> 
> No more weird and wonderful mind bending interaction between 3 different
> timestamps with arcane update rules.

Well, this looks like an "opposite" approach to event timekeeping in 
comparison to what we currently have. 

Do you want this rework before or after the current patch set?

> 
> ---
>  include/linux/perf_event.h |  25 +-
>  kernel/events/core.c       | 551 ++++++++++++++++-----------------------------
>  2 files changed, 192 insertions(+), 384 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 8e22f24ded6a..2a6ae48a1a96 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -485,9 +485,9 @@ struct perf_addr_filters_head {
>  };
>  
>  /**
> - * enum perf_event_active_state - the states of a event
> + * enum perf_event_state - the states of a event
>   */
> -enum perf_event_active_state {
> +enum perf_event_state {
>  	PERF_EVENT_STATE_DEAD		= -4,
>  	PERF_EVENT_STATE_EXIT		= -3,
>  	PERF_EVENT_STATE_ERROR		= -2,
> @@ -578,7 +578,7 @@ struct perf_event {
>  	struct pmu			*pmu;
>  	void				*pmu_private;
>  
> -	enum perf_event_active_state	state;
> +	enum perf_event_state		state;
>  	unsigned int			attach_state;
>  	local64_t			count;
>  	atomic64_t			child_count;
> @@ -588,26 +588,10 @@ struct perf_event {
>  	 * has been enabled (i.e. eligible to run, and the task has
>  	 * been scheduled in, if this is a per-task event)
>  	 * and running (scheduled onto the CPU), respectively.
> -	 *
> -	 * They are computed from tstamp_enabled, tstamp_running and
> -	 * tstamp_stopped when the event is in INACTIVE or ACTIVE state.
>  	 */
>  	u64				total_time_enabled;
>  	u64				total_time_running;
> -
> -	/*
> -	 * These are timestamps used for computing total_time_enabled
> -	 * and total_time_running when the event is in INACTIVE or
> -	 * ACTIVE state, measured in nanoseconds from an arbitrary point
> -	 * in time.
> -	 * tstamp_enabled: the notional time when the event was enabled
> -	 * tstamp_running: the notional time when the event was scheduled on
> -	 * tstamp_stopped: in INACTIVE state, the notional time when the
> -	 *	event was scheduled off.
> -	 */
> -	u64				tstamp_enabled;
> -	u64				tstamp_running;
> -	u64				tstamp_stopped;
> +	u64				tstamp;
>  
>  	/*
>  	 * timestamp shadows the actual context timing but it can
> @@ -699,7 +683,6 @@ struct perf_event {
>  
>  #ifdef CONFIG_CGROUP_PERF
>  	struct perf_cgroup		*cgrp; /* cgroup event is attach to */
> -	int				cgrp_defer_enabled;
>  #endif
>  
>  	struct list_head		sb_list;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 294f1927f944..e968b3eab9c7 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event)
>  	return event->clock();
>  }
>  
> +/*
> + * XXX comment about timekeeping goes here
> + */
> +
> +static __always_inline enum perf_event_state
> +__perf_effective_state(struct perf_event *event)
> +{
> +	struct perf_event *leader = event->group_leader;
> +
> +	if (leader->state <= PERF_EVENT_STATE_OFF)
> +		return leader->state;
> +
> +	return event->state;
> +}
> +
> +static __always_inline void
> +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running)
> +{
> +	enum perf_event_state state = __perf_effective_state(event);
> +	u64 delta = now - event->tstamp;
> +
> +	*enabled = event->total_time_enabled;
> +	if (state >= PERF_EVENT_STATE_INACTIVE)
> +		*enabled += delta;
> +
> +	*running = event->total_time_running;
> +	if (state >= PERF_EVENT_STATE_ACTIVE)
> +		*running += delta;
> +}
> +
> +static void perf_event_update_time(struct perf_event *event)
> +{
> +	u64 now = perf_event_time(event);
> +
> +	__perf_update_times(event, now, &event->total_time_enabled,
> +					&event->total_time_running);
> +	event->tstamp = now;
> +}
> +
> +static void perf_event_update_sibling_time(struct perf_event *leader)
> +{
> +	struct perf_event *sibling;
> +
> +	list_for_each_entry(sibling, &leader->sibling_list, group_entry)
> +		perf_event_update_time(sibling);
> +}
> +
> +static void
> +perf_event_set_state(struct perf_event *event, enum perf_event_state state)
> +{
> +	if (event->state == state)
> +		return;
> +
> +	perf_event_update_time(event);
> +	/*
> +	 * If a group leader gets enabled/disabled all its siblings
> +	 * are affected too.
> +	 */
> +	if ((event->state < 0) ^ (state < 0))
> +		perf_event_update_sibling_time(event);
> +
> +	WRITE_ONCE(event->state, state);
> +}
> +
>  #ifdef CONFIG_CGROUP_PERF
>  
>  static inline bool
> @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now)
>  	event->shadow_ctx_time = now - t->timestamp;
>  }
>  
> -static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -	/*
> -	 * when the current task's perf cgroup does not match
> -	 * the event's, we need to remember to call the
> -	 * perf_mark_enable() function the first time a task with
> -	 * a matching perf cgroup is scheduled in.
> -	 */
> -	if (is_cgroup_event(event) && !perf_cgroup_match(event))
> -		event->cgrp_defer_enabled = 1;
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -			 struct perf_event_context *ctx)
> -{
> -	struct perf_event *sub;
> -	u64 tstamp = perf_event_time(event);
> -
> -	if (!event->cgrp_defer_enabled)
> -		return;
> -
> -	event->cgrp_defer_enabled = 0;
> -
> -	event->tstamp_enabled = tstamp - event->total_time_enabled;
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		if (sub->state >= PERF_EVENT_STATE_INACTIVE) {
> -			sub->tstamp_enabled = tstamp - sub->total_time_enabled;
> -			sub->cgrp_defer_enabled = 0;
> -		}
> -	}
> -}
> -
>  /*
>   * Update cpuctx->cgrp so that it is set when first cgroup event is added and
>   * cleared when last cgroup event is removed.
> @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event)
>  }
>  
>  static inline void
> -perf_cgroup_defer_enabled(struct perf_event *event)
> -{
> -}
> -
> -static inline void
> -perf_cgroup_mark_enabled(struct perf_event *event,
> -			 struct perf_event_context *ctx)
> -{
> -}
> -
> -static inline void
>  list_update_cgroup_event(struct perf_event *event,
>  			 struct perf_event_context *ctx, bool add)
>  {
> @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event)
>  	return ctx ? ctx->time : 0;
>  }
>  
> -/*
> - * Update the total_time_enabled and total_time_running fields for a event.
> - */
> -static void update_event_times(struct perf_event *event)
> -{
> -	struct perf_event_context *ctx = event->ctx;
> -	u64 run_end;
> -
> -	lockdep_assert_held(&ctx->lock);
> -
> -	if (event->state < PERF_EVENT_STATE_INACTIVE ||
> -	    event->group_leader->state < PERF_EVENT_STATE_INACTIVE)
> -		return;
> -
> -	/*
> -	 * in cgroup mode, time_enabled represents
> -	 * the time the event was enabled AND active
> -	 * tasks were in the monitored cgroup. This is
> -	 * independent of the activity of the context as
> -	 * there may be a mix of cgroup and non-cgroup events.
> -	 *
> -	 * That is why we treat cgroup events differently
> -	 * here.
> -	 */
> -	if (is_cgroup_event(event))
> -		run_end = perf_cgroup_event_time(event);
> -	else if (ctx->is_active)
> -		run_end = ctx->time;
> -	else
> -		run_end = event->tstamp_stopped;
> -
> -	event->total_time_enabled = run_end - event->tstamp_enabled;
> -
> -	if (event->state == PERF_EVENT_STATE_INACTIVE)
> -		run_end = event->tstamp_stopped;
> -	else
> -		run_end = perf_event_time(event);
> -
> -	event->total_time_running = run_end - event->tstamp_running;
> -
> -}
> -
> -/*
> - * Update total_time_enabled and total_time_running for all events in a group.
> - */
> -static void update_group_times(struct perf_event *leader)
> -{
> -	struct perf_event *event;
> -
> -	update_event_times(leader);
> -	list_for_each_entry(event, &leader->sibling_list, group_entry)
> -		update_event_times(event);
> -}
> -
>  static enum event_type_t get_event_type(struct perf_event *event)
>  {
>  	struct perf_event_context *ctx = event->ctx;
> @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
>  	WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
>  	event->attach_state |= PERF_ATTACH_CONTEXT;
>  
> +	event->tstamp = perf_event_time(event);
> +
>  	/*
>  	 * If we're a stand alone event or group leader, we go to the context
>  	 * list, group events are kept attached to the group so that
> @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  	if (event->group_leader == event)
>  		list_del_init(&event->group_entry);
>  
> -	update_group_times(event);
> -
>  	/*
>  	 * If event was in error state, then keep it
>  	 * that way, otherwise bogus counts will be
> @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
>  	 * of the event
>  	 */
>  	if (event->state > PERF_EVENT_STATE_OFF)
> -		event->state = PERF_EVENT_STATE_OFF;
> +		perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  
>  	ctx->generation++;
>  }
> @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event,
>  		  struct perf_cpu_context *cpuctx,
>  		  struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
> -	u64 delta;
> +	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
>  
>  	WARN_ON_ONCE(event->ctx != ctx);
>  	lockdep_assert_held(&ctx->lock);
>  
> -	/*
> -	 * An event which could not be activated because of
> -	 * filter mismatch still needs to have its timings
> -	 * maintained, otherwise bogus information is return
> -	 * via read() for time_enabled, time_running:
> -	 */
> -	if (event->state == PERF_EVENT_STATE_INACTIVE &&
> -	    !event_filter_match(event)) {
> -		delta = tstamp - event->tstamp_stopped;
> -		event->tstamp_running += delta;
> -		event->tstamp_stopped = tstamp;
> -	}
> -
>  	if (event->state != PERF_EVENT_STATE_ACTIVE)
>  		return;
>  
>  	perf_pmu_disable(event->pmu);
>  
> -	event->tstamp_stopped = tstamp;
>  	event->pmu->del(event, 0);
>  	event->oncpu = -1;
> -	event->state = PERF_EVENT_STATE_INACTIVE;
> +
>  	if (event->pending_disable) {
>  		event->pending_disable = 0;
> -		event->state = PERF_EVENT_STATE_OFF;
> +		state = PERF_EVENT_STATE_OFF;
>  	}
> +	perf_event_set_state(event, state);
>  
>  	if (!is_software_event(event))
>  		cpuctx->active_oncpu--;
> @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event,
>  		struct perf_event_context *ctx)
>  {
>  	struct perf_event *event;
> -	int state = group_event->state;
> +
> +	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
> +		return;
>  
>  	perf_pmu_disable(ctx->pmu);
>  
> @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event,
>  
>  	perf_pmu_enable(ctx->pmu);
>  
> -	if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive)
> +	if (group_event->attr.exclusive)
>  		cpuctx->exclusive = 0;
>  }
>  
> @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event,
>  {
>  	unsigned long flags = (unsigned long)info;
>  
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
> +
>  	event_sched_out(event, cpuctx, ctx);
>  	if (flags & DETACH_GROUP)
>  		perf_group_detach(event);
> @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event,
>  	if (event->state < PERF_EVENT_STATE_INACTIVE)
>  		return;
>  
> -	update_context_time(ctx);
> -	update_cgrp_time_from_event(event);
> -	update_group_times(event);
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
> +
>  	if (event == event->group_leader)
>  		group_sched_out(event, cpuctx, ctx);
>  	else
>  		event_sched_out(event, cpuctx, ctx);
> -	event->state = PERF_EVENT_STATE_OFF;
> +
> +	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
>  }
>  
>  /*
> @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event)
>  }
>  
>  static void perf_set_shadow_time(struct perf_event *event,
> -				 struct perf_event_context *ctx,
> -				 u64 tstamp)
> +				 struct perf_event_context *ctx)
>  {
>  	/*
>  	 * use the correct time source for the time snapshot
> @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event,
>  	 * is cleaner and simpler to understand.
>  	 */
>  	if (is_cgroup_event(event))
> -		perf_cgroup_set_shadow_time(event, tstamp);
> +		perf_cgroup_set_shadow_time(event, event->tstamp);
>  	else
> -		event->shadow_ctx_time = tstamp - ctx->timestamp;
> +		event->shadow_ctx_time = event->tstamp - ctx->timestamp;
>  }
>  
>  #define MAX_INTERRUPTS (~0ULL)
> @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event,
>  		 struct perf_cpu_context *cpuctx,
>  		 struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
>  	int ret = 0;
>  
>  	lockdep_assert_held(&ctx->lock);
> @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event,
>  	 * is visible.
>  	 */
>  	smp_wmb();
> -	WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE);
> +	perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE);
>  
>  	/*
>  	 * Unthrottle events, since we scheduled we might have missed several
> @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event,
>  		event->hw.interrupts = 0;
>  	}
>  
> -	/*
> -	 * The new state must be visible before we turn it on in the hardware:
> -	 */
> -	smp_wmb();
> -
>  	perf_pmu_disable(event->pmu);
>  
> -	perf_set_shadow_time(event, ctx, tstamp);
> +	perf_set_shadow_time(event, ctx);
>  
>  	perf_log_itrace_start(event);
>  
>  	if (event->pmu->add(event, PERF_EF_START)) {
> -		event->state = PERF_EVENT_STATE_INACTIVE;
> +		perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  		event->oncpu = -1;
>  		ret = -EAGAIN;
>  		goto out;
>  	}
>  
> -	event->tstamp_running += tstamp - event->tstamp_stopped;
> -
>  	if (!is_software_event(event))
>  		cpuctx->active_oncpu++;
>  	if (!ctx->nr_active++)
> @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event,
>  {
>  	struct perf_event *event, *partial_group = NULL;
>  	struct pmu *pmu = ctx->pmu;
> -	u64 now = ctx->time;
> -	bool simulate = false;
>  
>  	if (group_event->state == PERF_EVENT_STATE_OFF)
>  		return 0;
> @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event,
>  	/*
>  	 * Groups can be scheduled in as one unit only, so undo any
>  	 * partial group before returning:
> -	 * The events up to the failed event are scheduled out normally,
> -	 * tstamp_stopped will be updated.
> -	 *
> -	 * The failed events and the remaining siblings need to have
> -	 * their timings updated as if they had gone thru event_sched_in()
> -	 * and event_sched_out(). This is required to get consistent timings
> -	 * across the group. This also takes care of the case where the group
> -	 * could never be scheduled by ensuring tstamp_stopped is set to mark
> -	 * the time the event was actually stopped, such that time delta
> -	 * calculation in update_event_times() is correct.
> +	 * The events up to the failed event are scheduled out normally.
>  	 */
>  	list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>  		if (event == partial_group)
> -			simulate = true;
> +			break;
>  
> -		if (simulate) {
> -			event->tstamp_running += now - event->tstamp_stopped;
> -			event->tstamp_stopped = now;
> -		} else {
> -			event_sched_out(event, cpuctx, ctx);
> -		}
> +		event_sched_out(event, cpuctx, ctx);
>  	}
>  	event_sched_out(group_event, cpuctx, ctx);
>  
> @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event,
>  	return can_add_hw;
>  }
>  
> -/*
> - * Complement to update_event_times(). This computes the tstamp_* values to
> - * continue 'enabled' state from @now, and effectively discards the time
> - * between the prior tstamp_stopped and now (as we were in the OFF state, or
> - * just switched (context) time base).
> - *
> - * This further assumes '@...nt->state == INACTIVE' (we just came from OFF) and
> - * cannot have been scheduled in yet. And going into INACTIVE state means
> - * '@...nt->tstamp_stopped = @now'.
> - *
> - * Thus given the rules of update_event_times():
> - *
> - *   total_time_enabled = tstamp_stopped - tstamp_enabled
> - *   total_time_running = tstamp_stopped - tstamp_running
> - *
> - * We can insert 'tstamp_stopped == now' and reverse them to compute new
> - * tstamp_* values.
> - */
> -static void __perf_event_enable_time(struct perf_event *event, u64 now)
> -{
> -	WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE);
> -
> -	event->tstamp_stopped = now;
> -	event->tstamp_enabled = now - event->total_time_enabled;
> -	event->tstamp_running = now - event->total_time_running;
> -}
> -
>  static void add_event_to_ctx(struct perf_event *event,
>  			       struct perf_event_context *ctx)
>  {
> -	u64 tstamp = perf_event_time(event);
> -
>  	list_add_event(event, ctx);
>  	perf_group_attach(event);
> -	/*
> -	 * We can be called with event->state == STATE_OFF when we create with
> -	 * .disabled = 1. In that case the IOC_ENABLE will call this function.
> -	 */
> -	if (event->state == PERF_EVENT_STATE_INACTIVE)
> -		__perf_event_enable_time(event, tstamp);
>  }
>  
>  static void ctx_sched_out(struct perf_event_context *ctx,
> @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx,
>  }
>  
>  /*
> - * Put a event into inactive state and update time fields.
> - * Enabling the leader of a group effectively enables all
> - * the group members that aren't explicitly disabled, so we
> - * have to update their ->tstamp_enabled also.
> - * Note: this works for group members as well as group leaders
> - * since the non-leader members' sibling_lists will be empty.
> - */
> -static void __perf_event_mark_enabled(struct perf_event *event)
> -{
> -	struct perf_event *sub;
> -	u64 tstamp = perf_event_time(event);
> -
> -	event->state = PERF_EVENT_STATE_INACTIVE;
> -	__perf_event_enable_time(event, tstamp);
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		/* XXX should not be > INACTIVE if event isn't */
> -		if (sub->state >= PERF_EVENT_STATE_INACTIVE)
> -			__perf_event_enable_time(sub, tstamp);
> -	}
> -}
> -
> -/*
>   * Cross CPU call to enable a performance event
>   */
>  static void __perf_event_enable(struct perf_event *event,
> @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event,
>  	if (ctx->is_active)
>  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  
> -	__perf_event_mark_enabled(event);
> +	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  
>  	if (!ctx->is_active)
>  		return;
>  
>  	if (!event_filter_match(event)) {
> -		if (is_cgroup_event(event))
> -			perf_cgroup_defer_enabled(event);
>  		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
>  		return;
>  	}
> @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event,
>  	 * we know the event must be on the current CPU, therefore we
>  	 * don't need to use it.
>  	 */
> -	switch (event->state) {
> -	case PERF_EVENT_STATE_ACTIVE:
> +	if (event->state == PERF_EVENT_STATE_ACTIVE)
>  		event->pmu->read(event);
> -		/* fall-through */
>  
> -	case PERF_EVENT_STATE_INACTIVE:
> -		update_event_times(event);
> -		break;
> -
> -	default:
> -		break;
> -	}
> +	perf_event_update_time(event);
>  
>  	/*
>  	 * In order to keep per-task stats reliable we need to flip the event
> @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>  		if (!event_filter_match(event))
>  			continue;
>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> -
>  		if (group_can_go_on(event, cpuctx, 1))
>  			group_sched_in(event, cpuctx, ctx);
>  
> @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx,
>  		 * If this pinned group hasn't been scheduled,
>  		 * put it in error state.
>  		 */
> -		if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -			update_group_times(event);
> -			event->state = PERF_EVENT_STATE_ERROR;
> -		}
> +		if (event->state == PERF_EVENT_STATE_INACTIVE)
> +			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
>  	}
>  }
>  
> @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>  		if (!event_filter_match(event))
>  			continue;
>  
> -		/* may need to reset tstamp_enabled */
> -		if (is_cgroup_event(event))
> -			perf_cgroup_mark_enabled(event, ctx);
> -
>  		if (group_can_go_on(event, cpuctx, can_add_hw)) {
>  			if (group_sched_in(event, cpuctx, ctx))
>  				can_add_hw = 0;
> @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event,
>  	if (event->state >= PERF_EVENT_STATE_INACTIVE)
>  		return 0;
>  
> -	__perf_event_mark_enabled(event);
> +	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
>  
>  	return 1;
>  }
> @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn)
>  		put_ctx(clone_ctx);
>  }
>  
> -struct perf_read_data {
> -	struct perf_event *event;
> -	bool group;
> -	int ret;
> -};
> -
>  static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  {
>  	u16 local_pkg, event_pkg;
> @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu)
>  	return event_cpu;
>  }
>  
> -/*
> - * Cross CPU call to read the hardware event
> - */
> -static void __perf_event_read(void *info)
> -{
> -	struct perf_read_data *data = info;
> -	struct perf_event *sub, *event = data->event;
> -	struct perf_event_context *ctx = event->ctx;
> -	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> -	struct pmu *pmu = event->pmu;
> -
> -	/*
> -	 * If this is a task context, we need to check whether it is
> -	 * the current task context of this cpu.  If not it has been
> -	 * scheduled out before the smp call arrived.  In that case
> -	 * event->count would have been updated to a recent sample
> -	 * when the event was scheduled out.
> -	 */
> -	if (ctx->task && cpuctx->task_ctx != ctx)
> -		return;
> -
> -	raw_spin_lock(&ctx->lock);
> -	if (ctx->is_active) {
> -		update_context_time(ctx);
> -		update_cgrp_time_from_event(event);
> -	}
> -
> -	update_event_times(event);
> -	if (event->state != PERF_EVENT_STATE_ACTIVE)
> -		goto unlock;
> -
> -	if (!data->group) {
> -		pmu->read(event);
> -		data->ret = 0;
> -		goto unlock;
> -	}
> -
> -	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> -
> -	pmu->read(event);
> -
> -	list_for_each_entry(sub, &event->sibling_list, group_entry) {
> -		update_event_times(sub);
> -		if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> -			/*
> -			 * Use sibling's PMU rather than @event's since
> -			 * sibling could be on different (eg: software) PMU.
> -			 */
> -			sub->pmu->read(sub);
> -		}
> -	}
> -
> -	data->ret = pmu->commit_txn(pmu);
> -
> -unlock:
> -	raw_spin_unlock(&ctx->lock);
> -}
> -
>  static inline u64 perf_event_count(struct perf_event *event)
>  {
>  	return local64_read(&event->count) + atomic64_read(&event->child_count);
> @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value)
>  	return ret;
>  }
>  
> -static int perf_event_read(struct perf_event *event, bool group)
> +struct perf_read_data {
> +	struct perf_event *event;
> +	bool group;
> +	int ret;
> +};
> +
> +static void __perf_event_read(struct perf_event *event,
> +			      struct perf_cpu_context *cpuctx,
> +			      struct perf_event_context *ctx,
> +			      void *data)
>  {
> -	int event_cpu, ret = 0;
> +	struct perf_read_data *prd = data;
> +	struct pmu *pmu = event->pmu;
> +	struct perf_event *sibling;
>  
> -	/*
> -	 * If event is enabled and currently active on a CPU, update the
> -	 * value in the event structure:
> -	 */
> -	if (event->state == PERF_EVENT_STATE_ACTIVE) {
> -		struct perf_read_data data = {
> -			.event = event,
> -			.group = group,
> -			.ret = 0,
> -		};
> +	if (ctx->is_active & EVENT_TIME) {
> +		update_context_time(ctx);
> +		update_cgrp_time_from_cpuctx(cpuctx);
> +	}
>  
> -		event_cpu = READ_ONCE(event->oncpu);
> -		if ((unsigned)event_cpu >= nr_cpu_ids)
> -			return 0;
> +	perf_event_update_time(event);
> +	if (prd->group)
> +		perf_event_update_sibling_time(event);
>  
> -		preempt_disable();
> -		event_cpu = __perf_event_read_cpu(event, event_cpu);
> +	if (event->state != PERF_EVENT_STATE_ACTIVE)
> +		return;
>  
> +	if (!prd->group) {
> +		pmu->read(event);
> +		prd->ret = 0;
> +		return;
> +	}
> +
> +	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
> +
> +	pmu->read(event);
> +	list_for_each_entry(sibling, &event->sibling_list, group_entry) {
> +		if (sibling->state == PERF_EVENT_STATE_ACTIVE) {
> +			/*
> +			 * Use sibling's PMU rather than @event's since
> +			 * sibling could be on different (eg: software) PMU.
> +			 */
> +			sibling->pmu->read(sibling);
> +		}
> +	}
> +
> +	prd->ret = pmu->commit_txn(pmu);
> +}
> +
> +static int perf_event_read(struct perf_event *event, bool group)
> +{
> +	struct perf_read_data prd = {
> +		.event = event,
> +		.group = group,
> +		.ret = 0,
> +	};
> +
> +	if (event->ctx->task) {
> +		event_function_call(event, __perf_event_read, &prd);
> +	} else {
>  		/*
> -		 * Purposely ignore the smp_call_function_single() return
> -		 * value.
> -		 *
> -		 * If event_cpu isn't a valid CPU it means the event got
> -		 * scheduled out and that will have updated the event count.
> -		 *
> -		 * Therefore, either way, we'll have an up-to-date event count
> -		 * after this.
> -		 */
> -		(void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1);
> -		preempt_enable();
> -		ret = data.ret;
> -	} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
> -		struct perf_event_context *ctx = event->ctx;
> -		unsigned long flags;
> -
> -		raw_spin_lock_irqsave(&ctx->lock, flags);
> -		/*
> -		 * may read while context is not active
> -		 * (e.g., thread is blocked), in that case
> -		 * we cannot update context time
> +		 * For uncore events (which are per definition per-cpu)
> +		 * allow a different read CPU from event->cpu.
>  		 */
> -		if (ctx->is_active) {
> -			update_context_time(ctx);
> -			update_cgrp_time_from_event(event);
> -		}
> -		if (group)
> -			update_group_times(event);
> -		else
> -			update_event_times(event);
> -		raw_spin_unlock_irqrestore(&ctx->lock, flags);
> +		struct event_function_struct efs = {
> +			.event = event,
> +			.func = __perf_event_read,
> +			.data = &prd,
> +		};
> +		int cpu = __perf_event_read_cpu(event, event->cpu);
> +
> +		cpu_function_call(cpu, event_function, &efs);
>  	}
>  
> -	return ret;
> +	return prd.ret;
>  }
>  
>  /*
> @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  {
>  	struct perf_event *child;
>  	u64 total = 0;
> @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
>  
>  	return total;
>  }
> +
> +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
> +{
> +	struct perf_event_context *ctx;
> +	u64 count;
> +
> +	ctx = perf_event_ctx_lock(event);
> +	count = __perf_event_read_value(event, enabled, running);
> +	perf_event_ctx_unlock(event, ctx);
> +
> +	return count;
> +}
>  EXPORT_SYMBOL_GPL(perf_event_read_value);
>  
>  static int __perf_read_group_add(struct perf_event *leader,
> @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader,
>  	if (ret)
>  		return ret;
>  
> +	raw_spin_lock_irqsave(&ctx->lock, flags);
> +
>  	/*
>  	 * Since we co-schedule groups, {enabled,running} times of siblings
>  	 * will be identical to those of the leader, so we only publish one
> @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader,
>  	if (read_format & PERF_FORMAT_ID)
>  		values[n++] = primary_event_id(leader);
>  
> -	raw_spin_lock_irqsave(&ctx->lock, flags);
> -
>  	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>  		values[n++] += perf_event_count(sub);
>  		if (read_format & PERF_FORMAT_ID)
> @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event,
>  	u64 values[4];
>  	int n = 0;
>  
> -	values[n++] = perf_event_read_value(event, &enabled, &running);
> +	values[n++] = __perf_event_read_value(event, &enabled, &running);
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
>  		values[n++] = enabled;
>  	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
> @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event,
>  
>  	*now = perf_clock();
>  	ctx_time = event->shadow_ctx_time + *now;
> -	*enabled = ctx_time - event->tstamp_enabled;
> -	*running = ctx_time - event->tstamp_running;
> +	__perf_update_times(event, ctx_time, enabled, running);
>  }
>  
>  static void perf_event_init_userpage(struct perf_event *event)
> @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event,
>  	if (parent_event)
>  		perf_group_detach(child_event);
>  	list_del_event(child_event, child_ctx);
> -	child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */
> +	perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */
>  	raw_spin_unlock_irq(&child_ctx->lock);
>  
>  	/*
> @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event,
>  	      struct perf_event *group_leader,
>  	      struct perf_event_context *child_ctx)
>  {
> -	enum perf_event_active_state parent_state = parent_event->state;
> +	enum perf_event_state parent_state = parent_event->state;
>  	struct perf_event *child_event;
>  	unsigned long flags;
>  
> @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info)
>  	struct perf_event *event;
>  
>  	raw_spin_lock(&ctx->lock);
> +	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  	list_for_each_entry(event, &ctx->event_list, event_entry)
>  		__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
>  	raw_spin_unlock(&ctx->lock);
>