linux-kernel - Re: [PATCH] perf intel-pt: Synthesize cycle events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <371faf0d-f794-4a2e-0a1c-9d454d7c8b12@intel.com>
Date:   Mon, 21 Mar 2022 11:16:56 +0200
From:   Adrian Hunter <adrian.hunter@...el.com>
To:     "Steinar H. Gunderson" <sesse@...gle.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>,
        linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] perf intel-pt: Synthesize cycle events

On 16.3.2022 14.59, Steinar H. Gunderson wrote:
> On Wed, Mar 16, 2022 at 01:19:46PM +0200, Adrian Hunter wrote:
>>> I guess the good news is that the perf report coming out of your version
>>> looks more likely to me; I have some functions that are around 1% that
>>> shouldn't intuitively be that much (and, if I write some Perl to sum up
>>> the cycles from the IPC lines in perf script, are more around 0.1%).
>>> So perhaps we should stop chasing the difference? I don't know.
>> That doesn't sound right.  I will look at it more closely in the next few days.
> 
> If you need, I can supply the perf.data and binaries, but we're talking
> a couple of gigabytes of data (and I don't know immediately if there's
> an easy way I can package up everything perf.data references) :-)
> 
> /* Steinar */

I had another look at this and it seemed *mostly* OK for me.  One change
I would make is to subject the cycle period to the logic of the 'A' option
(approximate IPC).

So what does the 'A' option do.

By default, IPC is output only when the exact number of cycles and
instructions is known for the sample.  Decoding walks instructions
to reconstruct the control flow, so the exact number of instructions
is known, but the cycle count (CYC packet) is only produced with
another packet, so only indirect/async branches or the first
conditional branch of a TNT packet.

Reporting exact IPC makes sense when sampling every branch or
instruction, but makes less sense when sampling less often.

For example with:

$ perf record -e intel_pt/cyc/u uname
Linux
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.218 MB perf.data ]

Sampling every 50us, exact IPC is reported only twice:

$ perf script --itrace=i50us -F+ipc
           uname 2007962 [005] 2426597.185314:      91866 instructions:uH:      7f3feb913deb _dl_relocate_object+0x40b (/usr/lib/x86_64-linux-gnu/ld-2.31.so)
           uname 2007962 [005] 2426597.185353:      21959 instructions:uH:      7f3feb91158f do_lookup_x+0xcf (/usr/lib/x86_64-linux-gnu/ld-2.31.so)
           uname 2007962 [005] 2426597.185670:     129834 instructions:uH:      7f3feb72e05a read_alias_file+0x23a (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
           uname 2007962 [005] 2426597.185709:      39373 instructions:uH:      7f3feb72ed52 _nl_explode_name+0x52 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)
           uname 2007962 [005] 2426597.185947:     137486 instructions:uH:      7f3feb87e5f3 __strlen_avx2+0x13 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)         IPC: 0.88 (420518/472789) 
           uname 2007962 [005] 2426597.186026:      79196 instructions:uH:      7f3feb87e5f3 __strlen_avx2+0x13 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)         IPC: 1.34 (79196/59092) 
           uname 2007962 [005] 2426597.186066:      29855 instructions:uH:      7f3feb78dee6 _int_malloc+0x446 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)

But if we relax the requirement and just use the number of cycles
counted so far, whether it is exactly correct or not, we can get
approx IPC for every sample:

$ perf script --itrace=i50usA -F+ipc
           uname 2007962 [005] 2426597.185314:      91866 instructions:uH:      7f3feb913deb _dl_relocate_object+0x40b (/usr/lib/x86_64-linux-gnu/ld-2.31.so)    IPC: 0.74 (91866/122744) 
           uname 2007962 [005] 2426597.185353:      21959 instructions:uH:      7f3feb91158f do_lookup_x+0xcf (/usr/lib/x86_64-linux-gnu/ld-2.31.so)     IPC: 0.92 (21959/23822) 
           uname 2007962 [005] 2426597.185670:     129834 instructions:uH:      7f3feb72e05a read_alias_file+0x23a (/usr/lib/x86_64-linux-gnu/libc-2.31.so)      IPC: 0.77 (129834/167753) 
           uname 2007962 [005] 2426597.185709:      39373 instructions:uH:      7f3feb72ed52 _nl_explode_name+0x52 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)      IPC: 1.01 (39373/38881) 
           uname 2007962 [005] 2426597.185947:     137486 instructions:uH:      7f3feb87e5f3 __strlen_avx2+0x13 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)         IPC: 1.14 (137486/119589) 
           uname 2007962 [005] 2426597.186026:      79196 instructions:uH:      7f3feb87e5f3 __strlen_avx2+0x13 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)         IPC: 1.34 (79196/59092) 
           uname 2007962 [005] 2426597.186066:      29855 instructions:uH:      7f3feb78dee6 _int_malloc+0x446 (/usr/lib/x86_64-linux-gnu/libc-2.31.so)          IPC: 1.33 (29855/22282) 


So the cycle sample function looks like this:

static int intel_pt_synth_cycle_sample(struct intel_pt_queue *ptq)
{
	struct intel_pt *pt = ptq->pt;
	union perf_event *event = ptq->event_buf;
	struct perf_sample sample = { .ip = 0, };
	u64 period = 0;

	if (ptq->sample_ipc)
		period = ptq->ipc_cyc_cnt - ptq->last_cy_cyc_cnt;

	if (!period || intel_pt_skip_event(pt))
		return 0;

	intel_pt_prep_sample(pt, ptq, event, &sample);

	sample.id = ptq->pt->cycles_id;
	sample.stream_id = ptq->pt->cycles_id;
	sample.period = period;

	sample.cyc_cnt = period;
	sample.insn_cnt = ptq->ipc_insn_cnt - ptq->last_cy_insn_cnt;
	ptq->last_cy_insn_cnt = ptq->ipc_insn_cnt;
	ptq->last_cy_cyc_cnt = ptq->ipc_cyc_cnt;

	return intel_pt_deliver_synth_event(pt, event, &sample, pt->cycles_sample_type);
}


With regard to the results you got with perf report, please try:

	perf report --itrace=y0nse --show-total-period --stdio

and see if the percentages and cycle counts for rarely executed
functions make more sense.