linux-kernel - Re: [PATCH] perf intel-pt: Synthesize cycle events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YjhUjotmo+kYvoNP@google.com>
Date:   Mon, 21 Mar 2022 11:33:50 +0100
From:   "Steinar H. Gunderson" <sesse@...gle.com>
To:     Adrian Hunter <adrian.hunter@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] perf intel-pt: Synthesize cycle events

On Mon, Mar 21, 2022 at 11:16:56AM +0200, Adrian Hunter wrote:
> I had another look at this and it seemed *mostly* OK for me.  One change
> I would make is to subject the cycle period to the logic of the 'A' option
> (approximate IPC).
> 
> So what does the 'A' option do.
> 
> By default, IPC is output only when the exact number of cycles and
> instructions is known for the sample.  Decoding walks instructions
> to reconstruct the control flow, so the exact number of instructions
> is known, but the cycle count (CYC packet) is only produced with
> another packet, so only indirect/async branches or the first
> conditional branch of a TNT packet.

Ah, I hadn't thought of the fact that you only get the first branch per
packet. It's a bit unfortunate for (exact) cycle counts, since I guess
TNT packets can also easily cross functions?

> So the cycle sample function looks like this:
> 
> static int intel_pt_synth_cycle_sample(struct intel_pt_queue *ptq)
>
> [...]
>
> With regard to the results you got with perf report, please try:
> 
> 	perf report --itrace=y0nse --show-total-period --stdio
> 
> and see if the percentages and cycle counts for rarely executed
> functions make more sense.

I already run mostly with 0ns period, so I don't think that's it.
I tried your new version, and it's very similar to your previous one;
there are some small changes (largest is that one function goes from
2.5% to 2.2% or so), but the general gist of it is the same.
I am increasingly leaning towards that my original version is wrong
somehow, though.

By the way, I noticed that synthesized call stacks do not respect
--inline; is that on purpose? The patch seems simple enough (just
a call to add_inlines), although it exposes extreme slowness in libbfd
when run over large binaries, which I'll have to look into.
(10+ ms for each address-to-symbol lookup is rather expensive when you
have 4M samples to churn through!)

/* Steinar */