[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABPqkBSwuNpUJGTtW_2q7MRrueAH+9UN2f_fFvJDR4AzWG5APQ@mail.gmail.com>
Date: Mon, 2 Mar 2015 12:08:59 -0500
From: Stephane Eranian <eranian@...gle.com>
To: Kan Liang <kan.liang@...el.com>
Cc: Peter Zijlstra <a.p.zijlstra@...llo.nl>,
LKML <linux-kernel@...r.kernel.org>,
Ingo Molnar <mingo@...nel.org>,
Arnaldo Carvalho de Melo <acme@...radead.org>,
Andi Kleen <andi@...stfloor.org>
Subject: Re: [PATCH V5 3/6] perf, x86: large PEBS interrupt threshold
Hi,
I spent some time looking at this patch series and testing some scenarios.
On Mon, Feb 23, 2015 at 9:25 AM, Kan Liang <kan.liang@...el.com> wrote:
>
> From: Yan, Zheng <zheng.z.yan@...el.com>
>
> PEBS always had the capability to log samples to its buffers without
> an interrupt. Traditionally perf has not used this but always set the
> PEBS threshold to one.
>
> For frequently occurring events (like cycles or branches or load/store)
> this in term requires using a relatively high sampling period to avoid
> overloading the system, by only processing PMIs. This in term increases
> sampling error.
>
> For the common cases we still need to use the PMI because the PEBS
> hardware has various limitations. The biggest one is that it can not
> supply a callgraph. It also requires setting a fixed period, as the
> hardware does not support adaptive period. Another issue is that it
> cannot supply a time stamp and some other options. To supply a TID it
> requires flushing on context switch. It can however supply the IP, the
> load/store address, TSX information, registers, and some other things.
>
> So we can make PEBS work for some specific cases, basically as long as
> you can do without a callgraph and can set the period you can use this
> new PEBS mode.
>
> The main benefit is the ability to support much lower sampling period
> (down to -c 1000) without extensive overhead.
>
> One use cases is for example to increase the resolution of the c2c tool.
> Another is double checking when you suspect the standard sampling has
> too much sampling error.
>
> Some numbers on the overhead, using cycle soak, comparing the elapsed
> time from "kernbench -M -H" between plain (threshold set to one) and
> multi (large threshold).
> The test command for plain:
> "perf record -e cycles:p -c $period -- kernbench -M -H"
> The test command for multi:
> "perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
> (The only difference of test command between multi and plain is time
> stamp options. Since time stamp is not supported by large PEBS
> threshold, it can be used as a flag to indicate if large threshold is
> enabled during the test.)
>
> period plain(Sec) multi(Sec) Delta
> 10003 32.7 16.5 16.2
> 20003 30.2 16.2 14.0
> 40003 18.6 14.1 4.5
> 80003 16.8 14.6 2.2
> 100003 16.9 14.1 2.8
> 800003 15.4 15.7 -0.3
Who collects with such small periods? I believe there would be other
sides effects.
>
> 1000003 15.3 15.2 0.2
> 2000003 15.3 15.1 0.1
At more reasonable periods, the benefit seems to vanish. I don't quite know
of a scenario where one would absolutely need very small period. This is
about statistical sampling and not tracing.
I believe that you need to know exactly what you are doing, i.e., what you
are measuring for this improvement to make a difference and yet provide
useful data. I believe in system-wide mode, the benefit vanishes quickly
because you may not know in advance what you are measuring. The
key problem is the lack of timestamp which help order the MMAP records
vs. sample records. I tried with my recently released Jitted code patch
series and sure enough with the --no-time, you cannot correlate samples
to symbols in the jitted code anymore. This is a case where you need
the timestamps to synchronize user level sampling data (obtained from
the jit compiler) with perf samples. Unless I know I am measuring jitted
code, I would not be able to use the PEBS buffer improvements without
losing symbolization capabilities.
I believe the patch may only be useful for per-process monitoring,
where the user knows exactly what is measured. But then maybe a
much lower overhead is not the key factor in the collection. However,
if you were to measure in a production environment, you'd need to
minimize overhead, yet most commonly this is done in system-wide
mode.
Overall, I think the patch is only useful to a small set of monitoring scenarios
and for people using "extreme" sampling periods. It requires the user to be
aware of the behavior of the monitoring app (because of the no-timestamp
requirement).
>
> With periods below 100003, plain (threshold one) cause much more
> overhead. With 10003 sampling period, the Elapsed Time for multi is
> even 2X faster than plain.
>
> Signed-off-by: Yan, Zheng <zheng.z.yan@...el.com>
> Signed-off-by: Kan Liang <kan.liang@...el.com>
> ---
> arch/x86/kernel/cpu/perf_event_intel_ds.c | 40 +++++++++++++++++++++++++++----
> 1 file changed, 36 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> index 1c16700..16fdb18 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
> @@ -250,7 +250,7 @@ static int alloc_pebs_buffer(int cpu)
> {
> struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
> int node = cpu_to_node(cpu);
> - int max, thresh = 1; /* always use a single PEBS record */
> + int max;
> void *buffer, *ibuffer;
>
> if (!x86_pmu.pebs)
> @@ -280,9 +280,6 @@ static int alloc_pebs_buffer(int cpu)
> ds->pebs_absolute_maximum = ds->pebs_buffer_base +
> max * x86_pmu.pebs_record_size;
>
> - ds->pebs_interrupt_threshold = ds->pebs_buffer_base +
> - thresh * x86_pmu.pebs_record_size;
> -
> return 0;
> }
>
> @@ -667,15 +664,35 @@ struct event_constraint *intel_pebs_constraints(struct perf_event *event)
> return &emptyconstraint;
> }
>
> +/*
> + * Flags PEBS can handle without an PMI.
> + *
> + * TID can only be handled by flushing at context switch.
> + */
> +#define PEBS_FREERUNNING_FLAGS \
> + (PERF_SAMPLE_IP | PERF_SAMPLE_TID | PERF_SAMPLE_ADDR | \
> + PERF_SAMPLE_ID | PERF_SAMPLE_CPU | PERF_SAMPLE_STREAM_ID | \
> + PERF_SAMPLE_DATA_SRC | PERF_SAMPLE_IDENTIFIER | \
> + PERF_SAMPLE_TRANSACTION)
> +
> +static inline bool pebs_is_enabled(struct cpu_hw_events *cpuc)
> +{
> + return (cpuc->pebs_enabled & ((1ULL << MAX_PEBS_EVENTS) - 1));
> +}
> +
> void intel_pmu_pebs_enable(struct perf_event *event)
> {
> struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> struct hw_perf_event *hwc = &event->hw;
> + struct debug_store *ds = cpuc->ds;
> + bool first_pebs;
> + u64 threshold;
>
> hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
> if (!event->attr.freq)
> hwc->flags |= PERF_X86_EVENT_AUTO_RELOAD;
>
> + first_pebs = !pebs_is_enabled(cpuc);
> cpuc->pebs_enabled |= 1ULL << hwc->idx;
>
> if (event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT)
> @@ -683,6 +700,21 @@ void intel_pmu_pebs_enable(struct perf_event *event)
> else if (event->hw.flags & PERF_X86_EVENT_PEBS_ST)
> cpuc->pebs_enabled |= 1ULL << 63;
>
> + /*
> + * When the event is constrained enough we can use a larger
> + * threshold and run the event with less frequent PMI.
> + */
> + if (0 && /* disable this temporarily */
> + (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) &&
> + !(event->attr.sample_type & ~PEBS_FREERUNNING_FLAGS)) {
> + threshold = ds->pebs_absolute_maximum -
> + x86_pmu.max_pebs_events * x86_pmu.pebs_record_size;
> + } else {
> + threshold = ds->pebs_buffer_base + x86_pmu.pebs_record_size;
> + }
> + if (first_pebs || ds->pebs_interrupt_threshold > threshold)
> + ds->pebs_interrupt_threshold = threshold;
> +
> /* Use auto-reload if possible to save a MSR write in the PMI */
> if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD) {
> ds->pebs_event_reset[hwc->idx] =
> --
> 1.8.3.2
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists