[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJN39ogefQc3i4qRuQg4HmMhqn-Ah1aHYTQ1=-fr0WSBhXiShw@mail.gmail.com>
Date: Fri, 5 Aug 2016 10:22:08 -0700
From: Brendan Gregg <bgregg@...flix.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Alexei Starovoitov <alexei.starovoitov@...il.com>,
Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
linux-kernel@...r.kernel.org, Alexei Starovoitov <ast@...nel.org>,
Wang Nan <wangnan0@...wei.com>
Subject: Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
On Fri, Aug 5, 2016 at 3:52 AM, Peter Zijlstra <peterz@...radead.org> wrote:
> On Thu, Aug 04, 2016 at 10:24:06PM -0700, Alexei Starovoitov wrote:
>> tracepoints are actually zero overhead already via static-key mechanism.
>> I don't think Peter's objection for the tracepoint was due to overhead.
>
> Almost 0, they still have some I$ footprint, but yes. My main worry is
> that we can feed tracepoints into perf, so having tracepoints in perf is
> tricky.
Coincidentally I$ footprint was my most recent use case for needing
this: I have an I$ busting workload, and wanting to profile
instructions at a very high rate to get a breakdown of I$ population.
(Normally I'd use I$ miss overflow, but none of our Linux systems have
PMCs: cloud.)
> I also don't much like this tracepoint being specific to the hrtimer
> bits, I can well imagine people wanting to do the same thing for
> hardware based samples or whatnot.
Sure, which is why I thought we'd have two in a perf category. I'm all
for PMCs events, even though we can't currently use them!
>
>> > The perf:perf_hrtimer probe point is also reading state mid-way
>> > through a function, so it's not quite as simple as wrapping the
>> > function pointer. I do like that idea, though, but for things like
>> > struct file_operations.
>
> So what additional state to you need?
I was pulling in regs after get_irq_regs(), struct perf_event *event
after it's populated. Not that hard to duplicate. Just noting it
didn't map directly to the function entry.
I wanted perf_event just for event->ctx->task->pid, so that a BPF
program can differentiate between it's samples and other concurrent
sessions.
(I was thinking of changing my patch to expose pid_t instead of
perf_event, since I was noticing it didn't add many instructions.)
[...]
>> instead of adding a tracepoint to perf_swevent_hrtimer we can replace
>> overflow_handler for that particular event with some form of bpf wrapper.
>> (probably new bpf program type). Then not only periodic events
>> will be triggering bpf prog, but pmu events as well.
>
> Exactly.
Although the timer use case is a bit different, and is via
hwc->hrtimer.function = perf_swevent_hrtimer.
[...]
>> The question is what to pass into the
>> program to make the most use out of it. 'struct pt_regs' is done deal.
>> but perf_sample_data we cannot pass as-is, since it's kernel internal.
>
> Urgh, does it have to be stable API? Can't we simply rely on the kernel
> headers to provide the right structure definition?
For timer it can be: struct pt_regs, pid_t.
So that would restrict your BPF program to one timer, since if you had
two (from one pid) you couldn't tell them apart. But I'm not sure of a
use case for two in-kernel timers. If there were, we could also add
struct perf_event_attr, which has enough info to tell things apart,
and is already exposed to user space.
I haven't looked into the PMU arguments, but perhaps that could be:
struct pt_regs, pid_t, struct perf_event_attr.
Thanks,
Brendan
Powered by blists - more mailing lists