[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABPqkBTKYRU1uwhfWXhWKeqYmWCukXLrinvwtfC9xN4Fy=+2yg@mail.gmail.com>
Date: Fri, 27 Mar 2015 09:31:45 -0700
From: Stephane Eranian <eranian@...gle.com>
To: Thomas Gleixner <tglx@...utronix.de>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
Jiri Olsa <jolsa@...hat.com>,
Stephane Eranian <eranian@...gle.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>,
John Stultz <john.stultz@...aro.org>,
"H. Peter Anvin" <hpa@...or.com>, David Ahern <dsahern@...il.com>,
Peter Zijlstra <peterz@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Ingo Molnar <mingo@...nel.org>
Subject: Re: [tip:perf/timer] perf: Add per event clockid support
On Fri, Mar 27, 2015 at 4:48 AM, tip-bot for Peter Zijlstra
<tipbot@...or.com> wrote:
> Commit-ID: 34f439278cef7b1177f8ce24f9fc81dfc6221d3b
> Gitweb: http://git.kernel.org/tip/34f439278cef7b1177f8ce24f9fc81dfc6221d3b
> Author: Peter Zijlstra <peterz@...radead.org>
> AuthorDate: Fri, 20 Feb 2015 14:05:38 +0100
> Committer: Ingo Molnar <mingo@...nel.org>
> CommitDate: Fri, 27 Mar 2015 10:13:22 +0100
>
> perf: Add per event clockid support
>
> While thinking on the whole clock discussion it occurred to me we have
> two distinct uses of time:
>
> 1) the tracking of event/ctx/cgroup enabled/running/stopped times
> which includes the self-monitoring support in struct
> perf_event_mmap_page.
>
> 2) the actual timestamps visible in the data records.
>
> And we've been conflating them.
>
> The first is all about tracking time deltas, nobody should really care
> in what time base that happens, its all relative information, as long
> as its internally consistent it works.
>
> The second however is what people are worried about when having to
> merge their data with external sources. And here we have the
> discussion on MONOTONIC vs MONOTONIC_RAW etc..
>
> Where MONOTONIC is good for correlating between machines (static
> offset), MONOTNIC_RAW is required for correlating against a fixed rate
> hardware clock.
>
> This means configurability; now 1) makes that hard because it needs to
> be internally consistent across groups of unrelated events; which is
> why we had to have a global perf_clock().
>
> However, for 2) it doesn't really matter, perf itself doesn't care
> what it writes into the buffer.
>
> The below patch makes the distinction between these two cases by
> adding perf_event_clock() which is used for the second case. It
> further makes this configurable on a per-event basis, but adds a few
> sanity checks such that we cannot combine events with different clocks
> in confusing ways.
>
> And since we then have per-event configurability we might as well
> retain the 'legacy' behaviour as a default.
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
> Cc: Arnaldo Carvalho de Melo <acme@...hat.com>
> Cc: David Ahern <dsahern@...il.com>
> Cc: Jiri Olsa <jolsa@...hat.com>
> Cc: John Stultz <john.stultz@...aro.org>
> Cc: Linus Torvalds <torvalds@...ux-foundation.org>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Stephane Eranian <eranian@...gle.com>
> Cc: Thomas Gleixner <tglx@...utronix.de>
> Signed-off-by: Ingo Molnar <mingo@...nel.org>
> ---
> arch/x86/kernel/cpu/perf_event.c | 14 ++++++--
> include/linux/perf_event.h | 2 ++
> include/uapi/linux/perf_event.h | 6 ++--
> kernel/events/core.c | 77 ++++++++++++++++++++++++++++++++++++++--
> 4 files changed, 91 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index ac41b3a..0420ebc 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -1978,13 +1978,23 @@ void arch_perf_update_userpage(struct perf_event *event,
>
> data = cyc2ns_read_begin();
>
> + /*
> + * Internal timekeeping for enabled/running/stopped times
> + * is always in the local_clock domain.
> + */
> userpg->cap_user_time = 1;
> userpg->time_mult = data->cyc2ns_mul;
> userpg->time_shift = data->cyc2ns_shift;
> userpg->time_offset = data->cyc2ns_offset - now;
>
> - userpg->cap_user_time_zero = 1;
> - userpg->time_zero = data->cyc2ns_offset;
> + /*
> + * cap_user_time_zero doesn't make sense when we're using a different
> + * time base for the records.
> + */
> + if (event->clock == &local_clock) {
> + userpg->cap_user_time_zero = 1;
> + userpg->time_zero = data->cyc2ns_offset;
> + }
>
> cyc2ns_read_end(data);
> }
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index b16eac5..4015540 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -173,6 +173,7 @@ struct perf_event;
> * pmu::capabilities flags
> */
> #define PERF_PMU_CAP_NO_INTERRUPT 0x01
> +#define PERF_PMU_CAP_NO_NMI 0x02
>
> /**
> * struct pmu - generic performance monitoring unit
> @@ -457,6 +458,7 @@ struct perf_event {
> struct pid_namespace *ns;
> u64 id;
>
> + u64 (*clock)(void);
> perf_overflow_handler_t overflow_handler;
> void *overflow_handler_context;
>
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 1e3cd07..3bb40dda 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -326,7 +326,8 @@ struct perf_event_attr {
> exclude_callchain_user : 1, /* exclude user callchains */
> mmap2 : 1, /* include mmap with inode data */
> comm_exec : 1, /* flag comm events that are due to an exec */
> - __reserved_1 : 39;
> + use_clockid : 1, /* use @clockid for time fields */
> + __reserved_1 : 38;
>
> union {
> __u32 wakeup_events; /* wakeup every n events */
> @@ -355,8 +356,7 @@ struct perf_event_attr {
> */
> __u32 sample_stack_user;
>
> - /* Align to u64. */
> - __u32 __reserved_2;
> + __s32 clockid;
> /*
> * Defines set of regs to dump for each sample
> * state captured on:
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index bb1a7c3..c40c2ca 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -327,6 +327,11 @@ static inline u64 perf_clock(void)
> return local_clock();
> }
>
> +static inline u64 perf_event_clock(struct perf_event *event)
> +{
> + return event->clock();
> +}
> +
> static inline struct perf_cpu_context *
> __get_cpu_context(struct perf_event_context *ctx)
> {
> @@ -4762,7 +4767,7 @@ static void __perf_event_header__init_id(struct perf_event_header *header,
> }
>
> if (sample_type & PERF_SAMPLE_TIME)
> - data->time = perf_clock();
> + data->time = perf_event_clock(event);
>
> if (sample_type & (PERF_SAMPLE_ID | PERF_SAMPLE_IDENTIFIER))
> data->id = primary_event_id(event);
> @@ -5340,6 +5345,8 @@ static void perf_event_task_output(struct perf_event *event,
> task_event->event_id.tid = perf_event_tid(event, task);
> task_event->event_id.ptid = perf_event_tid(event, current);
>
> + task_event->event_id.time = perf_event_clock(event);
> +
> perf_output_put(&handle, task_event->event_id);
>
> perf_event__output_id_sample(event, &handle, &sample);
> @@ -5373,7 +5380,7 @@ static void perf_event_task(struct task_struct *task,
> /* .ppid */
> /* .tid */
> /* .ptid */
> - .time = perf_clock(),
> + /* .time */
> },
> };
>
> @@ -5749,7 +5756,7 @@ static void perf_log_throttle(struct perf_event *event, int enable)
> .misc = 0,
> .size = sizeof(throttle_event),
> },
> - .time = perf_clock(),
> + .time = perf_event_clock(event),
> .id = primary_event_id(event),
> .stream_id = event->id,
> };
> @@ -6293,6 +6300,8 @@ static int perf_swevent_init(struct perf_event *event)
> static struct pmu perf_swevent = {
> .task_ctx_nr = perf_sw_context,
>
> + .capabilities = PERF_PMU_CAP_NO_NMI,
> +
> .event_init = perf_swevent_init,
> .add = perf_swevent_add,
> .del = perf_swevent_del,
> @@ -6636,6 +6645,8 @@ static int cpu_clock_event_init(struct perf_event *event)
> static struct pmu perf_cpu_clock = {
> .task_ctx_nr = perf_sw_context,
>
> + .capabilities = PERF_PMU_CAP_NO_NMI,
> +
> .event_init = cpu_clock_event_init,
> .add = cpu_clock_event_add,
> .del = cpu_clock_event_del,
> @@ -6715,6 +6726,8 @@ static int task_clock_event_init(struct perf_event *event)
> static struct pmu perf_task_clock = {
> .task_ctx_nr = perf_sw_context,
>
> + .capabilities = PERF_PMU_CAP_NO_NMI,
> +
> .event_init = task_clock_event_init,
> .add = task_clock_event_add,
> .del = task_clock_event_del,
> @@ -7200,6 +7213,10 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
> event->hw.target = task;
> }
>
> + event->clock = &local_clock;
> + if (parent_event)
> + event->clock = parent_event->clock;
> +
> if (!overflow_handler && parent_event) {
> overflow_handler = parent_event->overflow_handler;
> context = parent_event->overflow_handler_context;
> @@ -7422,6 +7439,12 @@ perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
> if (output_event->cpu == -1 && output_event->ctx != event->ctx)
> goto out;
>
> + /*
> + * Mixing clocks in the same buffer is trouble you don't need.
> + */
> + if (output_event->clock != event->clock)
> + goto out;
> +
> set:
> mutex_lock(&event->mmap_mutex);
> /* Can't redirect output if we've got an active mmap() */
> @@ -7454,6 +7477,43 @@ static void mutex_lock_double(struct mutex *a, struct mutex *b)
> mutex_lock_nested(b, SINGLE_DEPTH_NESTING);
> }
>
> +static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
> +{
> + bool nmi_safe = false;
> +
> + switch (clk_id) {
> + case CLOCK_MONOTONIC:
> + event->clock = &ktime_get_mono_fast_ns;
> + nmi_safe = true;
> + break;
> +
> + case CLOCK_MONOTONIC_RAW:
> + event->clock = &ktime_get_raw_fast_ns;
> + nmi_safe = true;
> + break;
> +
> + case CLOCK_REALTIME:
> + event->clock = &ktime_get_real_ns;
> + break;
> +
> + case CLOCK_BOOTTIME:
> + event->clock = &ktime_get_boot_ns;
> + break;
> +
> + case CLOCK_TAI:
> + event->clock = &ktime_get_tai_ns;
> + break;
> +
Can all those clocks be safely called from an NMI context?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists