[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YjhUjotmo+kYvoNP@google.com>
Date: Mon, 21 Mar 2022 11:33:50 +0100
From: "Steinar H. Gunderson" <sesse@...gle.com>
To: Adrian Hunter <adrian.hunter@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...hat.com>,
Namhyung Kim <namhyung@...nel.org>,
linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] perf intel-pt: Synthesize cycle events
On Mon, Mar 21, 2022 at 11:16:56AM +0200, Adrian Hunter wrote:
> I had another look at this and it seemed *mostly* OK for me. One change
> I would make is to subject the cycle period to the logic of the 'A' option
> (approximate IPC).
>
> So what does the 'A' option do.
>
> By default, IPC is output only when the exact number of cycles and
> instructions is known for the sample. Decoding walks instructions
> to reconstruct the control flow, so the exact number of instructions
> is known, but the cycle count (CYC packet) is only produced with
> another packet, so only indirect/async branches or the first
> conditional branch of a TNT packet.
Ah, I hadn't thought of the fact that you only get the first branch per
packet. It's a bit unfortunate for (exact) cycle counts, since I guess
TNT packets can also easily cross functions?
> So the cycle sample function looks like this:
>
> static int intel_pt_synth_cycle_sample(struct intel_pt_queue *ptq)
>
> [...]
>
> With regard to the results you got with perf report, please try:
>
> perf report --itrace=y0nse --show-total-period --stdio
>
> and see if the percentages and cycle counts for rarely executed
> functions make more sense.
I already run mostly with 0ns period, so I don't think that's it.
I tried your new version, and it's very similar to your previous one;
there are some small changes (largest is that one function goes from
2.5% to 2.2% or so), but the general gist of it is the same.
I am increasingly leaning towards that my original version is wrong
somehow, though.
By the way, I noticed that synthesized call stacks do not respect
--inline; is that on purpose? The patch seems simple enough (just
a call to add_inlines), although it exposes extreme slowness in libbfd
when run over large binaries, which I'll have to look into.
(10+ ms for each address-to-symbol lookup is rather expensive when you
have 4M samples to churn through!)
/* Steinar */
Powered by blists - more mailing lists