linux-kernel - Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <F55800AC-73A5-46A4-9E08-1DD00691267C@fb.com>
Date:   Fri, 12 Mar 2021 18:52:39 +0000
From:   Song Liu <songliubraving@...com>
To:     Arnaldo Carvalho de Melo <acme@...nel.org>
CC:     linux-kernel <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        "acme@...hat.com" <acme@...hat.com>,
        "namhyung@...nel.org" <namhyung@...nel.org>,
        "jolsa@...nel.org" <jolsa@...nel.org>,
        "linux-perf-users@...stprotocols.net" 
        <linux-perf-users@...stprotocols.net>
Subject: Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF



> On Mar 12, 2021, at 6:24 AM, Arnaldo Carvalho de Melo <acme@...nel.org> wrote:
> 
> Em Thu, Mar 11, 2021 at 06:02:57PM -0800, Song Liu escreveu:
>> perf uses performance monitoring counters (PMCs) to monitor system
>> performance. The PMCs are limited hardware resources. For example,
>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>> 
>> Modern data center systems use these PMCs in many different ways:
>> system level monitoring, (maybe nested) container level monitoring, per
>> process monitoring, profiling (in sample mode), etc. In some cases,
>> there are more active perf_events than available hardware PMCs. To allow
>> all perf_events to have a chance to run, it is necessary to do expensive
>> time multiplexing of events.
>> 
>> On the other hand, many monitoring tools count the common metrics (cycles,
>> instructions). It is a waste to have multiple tools create multiple
>> perf_events of "cycles" and occupy multiple PMCs.
>> 
>> bperf tries to reduce such wastes by allowing multiple perf_events of
>> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
>> of having each perf-stat session to read its own perf_events, bperf uses
>> BPF programs to read the perf_events and aggregate readings to BPF maps.
>> Then, the perf-stat session(s) reads the values from these BPF maps.
>> 
>> Please refer to the comment before the definition of bperf_ops for the
>> description of bperf architecture.
>> 
>> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
>> bperf uses a BPF hashmap to share information about BPF programs and maps
>> used by bperf. This map is pinned to bpffs. The default address is
>> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
>> --attr-map.
>> 
>> ---
>> Known limitations:
>> 1. Do not support per cgroup events;
>> 2. Do not support monitoring of BPF program (perf-stat -b);
>> 3. Do not support event groups.
> 
> Cool stuff, but I think you can break this up into more self contained
> patches, see below.
> 
> Apart from that, some suggestions/requests:
> 
> We need a shell 'perf test' that uses some synthetic workload so that we
> can count events with/without --use-bpf (--bpf-counters is my
> alternative name, see below), and then compare if the difference is
> under some acceptable range.
> 
> As a followup patch we could have something like:
> 
> perf config stat.bpf-counters=yes
> 
> That would make 'perf stat' use BPF counters for what it can, using the
> default method for the non-supported targets, emitting some 'perf stat
> -v' visible warning (i.e. a debug message), i.e. make it opt-in that the
> user wants to use BPF counters for all that is possible at that point in
> time.o
> 
> Thanks for working on this,
> 
> - Arnaldo
> 
>> The following commands have been tested:
>> 
>>   perf stat --use-bpf -e cycles -a
>>   perf stat --use-bpf -e cycles -C 1,3,4
>>   perf stat --use-bpf -e cycles -p 123
>>   perf stat --use-bpf -e cycles -t 100,101
>> 
>> Signed-off-by: Song Liu <songliubraving@...com>
>> ---
>> tools/perf/Makefile.perf                      |   1 +
>> tools/perf/builtin-stat.c                     |  20 +-
>> tools/perf/util/bpf_counter.c                 | 552 +++++++++++++++++-
>> tools/perf/util/bpf_skel/bperf.h              |  14 +
>> tools/perf/util/bpf_skel/bperf_follower.bpf.c |  65 +++
>> tools/perf/util/bpf_skel/bperf_leader.bpf.c   |  46 ++
>> tools/perf/util/evsel.h                       |  20 +-
>> tools/perf/util/target.h                      |   4 +-
>> 8 files changed, 712 insertions(+), 10 deletions(-)
>> create mode 100644 tools/perf/util/bpf_skel/bperf.h
>> create mode 100644 tools/perf/util/bpf_skel/bperf_follower.bpf.c
>> create mode 100644 tools/perf/util/bpf_skel/bperf_leader.bpf.c
>> 
>> diff --git a/tools/perf/Makefile.perf b/tools/perf/Makefile.perf
>> index f6e609673de2b..ca9aa08e85a1f 100644
>> --- a/tools/perf/Makefile.perf
>> +++ b/tools/perf/Makefile.perf
>> @@ -1007,6 +1007,7 @@ python-clean:
>> SKEL_OUT := $(abspath $(OUTPUT)util/bpf_skel)
>> SKEL_TMP_OUT := $(abspath $(SKEL_OUT)/.tmp)
>> SKELETONS := $(SKEL_OUT)/bpf_prog_profiler.skel.h
>> +SKELETONS += $(SKEL_OUT)/bperf_leader.skel.h $(SKEL_OUT)/bperf_follower.skel.h
>> 
>> ifdef BUILD_BPF_SKEL
>> BPFTOOL := $(SKEL_TMP_OUT)/bootstrap/bpftool
>> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
>> index 2e2e4a8345ea2..34df713a8eea9 100644
>> --- a/tools/perf/builtin-stat.c
>> +++ b/tools/perf/builtin-stat.c
>> @@ -792,6 +792,12 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 	}
>> 
>> 	evlist__for_each_cpu (evsel_list, i, cpu) {
>> +		/*
>> +		 * bperf calls evsel__open_per_cpu() in bperf__load(), so
>> +		 * no need to call it again here.
>> +		 */
>> +		if (target.use_bpf)
>> +			break;
>> 		affinity__set(&affinity, cpu);
>> 
>> 		evlist__for_each_entry(evsel_list, counter) {
>> @@ -925,15 +931,15 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 	/*
>> 	 * Enable counters and exec the command:
>> 	 */
>> -	t0 = rdclock();
>> -	clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> -
>> 	if (forks) {
>> 		evlist__start_workload(evsel_list);
>> 		err = enable_counters();
>> 		if (err)
>> 			return -1;
>> 
>> +		t0 = rdclock();
>> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> +
>> 		if (interval || timeout || evlist__ctlfd_initialized(evsel_list))
>> 			status = dispatch_events(forks, timeout, interval, &times);
>> 		if (child_pid != -1) {
>> @@ -954,6 +960,10 @@ static int __run_perf_stat(int argc, const char **argv, int run_idx)
>> 		err = enable_counters();
>> 		if (err)
>> 			return -1;
>> +
>> +		t0 = rdclock();
>> +		clock_gettime(CLOCK_MONOTONIC, &ref_time);
>> +
>> 		status = dispatch_events(forks, timeout, interval, &times);
>> 	}
>> 
> 
> The above two hunks seems out of place, i.e. can they go to a different
> patch and with an explanation about why this is needed?

Actually, I am still debating whether we want the above change in a separate 
patch. It is related to the following change. 

[...]

>> +	/*
>> +	 * Attahcing the skeleton takes non-trivial time (0.2s+ on a kernel
>> +	 * with some debug options enabled). This shows as a longer first
>> +	 * interval:
>> +	 *
>> +	 * # perf stat -e cycles -a -I 1000
>> +	 * #           time             counts unit events
>> +	 *      1.267634674     26,259,166,523      cycles
>> +	 *      2.271637827     22,550,822,286      cycles
>> +	 *      3.275406553     22,852,583,744      cycles
>> +	 *
>> +	 * Fix this by zeroing accum_readings after attaching the program.
>> +	 */
>> +	bperf_sync_counters(evsel);
>> +	entry_cnt = bpf_map__max_entries(skel->maps.accum_readings);
>> +	memset(values, 0, sizeof(struct bpf_perf_event_value) * num_cpu_bpf);
>> +
>> +	for (i = 0; i < entry_cnt; i++) {
>> +		bpf_map_update_elem(bpf_map__fd(skel->maps.accum_readings),
>> +				    &i, values, BPF_ANY);
>> +	}
>> +	return 0;
>> +}

Attaching the skeleton takes non-trivial time, so that we get a bigger first 
interval (1.26s in the example above). To fix this, in __run_perf_stat(), we 
get t0 and ref_time after enable_counters(). 

Maybe a comment in __run_perf_stat() is better than a separate patch?

Thanks,
Song