linux-kernel - Re: [PATCH] perf intel-pt: Synthesize cycle events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YjiuoEUL6jH32cBi@google.com>
Date:   Mon, 21 Mar 2022 17:58:08 +0100
From:   "Steinar H. Gunderson" <sesse@...gle.com>
To:     Adrian Hunter <adrian.hunter@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] perf intel-pt: Synthesize cycle events

On Mon, Mar 21, 2022 at 03:09:08PM +0200, Adrian Hunter wrote:
> Yes, it can cross calls and returns.  'returns' due to "Return Compression"
> which can be switched off at record time with config term noretcomp, but
> that may cause more overflows / trace data loss.
> 
> To get accurate times for a single function there is Intel PT
> address filtering.
> 
> Otherwise LBRs can have cycle times.

Many interesting points, I'll be sure to look into them.

Meanwhile, should I send a new patch with your latest changes?
It's more your patch than mine now, it seems, but I think we're
converging on something useful.

>> By the way, I noticed that synthesized call stacks do not respect
>> --inline; is that on purpose? The patch seems simple enough (just
>> a call to add_inlines), although it exposes extreme slowness in libbfd
>> when run over large binaries, which I'll have to look into.
>> (10+ ms for each address-to-symbol lookup is rather expensive when you
>> have 4M samples to churn through!)
> No, not on purpose.

The patch appears to be trivial:

--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -44,6 +44,7 @@
 #include <linux/zalloc.h>

 static void __machine__remove_thread(struct machine *machine, struct thread *th, bool lock);
+static int append_inlines(struct callchain_cursor *cursor, struct map_symbol *ms, u64 ip);

 static struct dso *machine__kernel_dso(struct machine *machine)
 {
@@ -2217,6 +2218,10 @@ static int add_callchain_ip(struct thread *thread,
 	ms.maps = al.maps;
 	ms.map = al.map;
 	ms.sym = al.sym;
+
+	if (append_inlines(cursor, &ms, ip) == 0)
+		return 0;
+
        srcline = callchain_srcline(&ms, al.addr);
        return callchain_cursor_append(cursor, ip, &ms,
                                       branch, flags, nr_loop_iter,

but I'm seeing problems with “failing to process sample” when I go from
10us to 1us, so I'll have to look into that.

I've sent some libbfd patches try to help the slowness:

  https://sourceware.org/git/?p=binutils-gdb.git;a=commitdiff;h=30cbd32aec30b4bc13427bbd87c4c63c739d4578
  https://sourceware.org/pipermail/binutils/2022-March/120131.html
  https://sourceware.org/pipermail/binutils/2022-March/120133.html

/* Steinar */