linux-kernel - Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM9d7cg+HD3-vLXX_rUSg1kWSZ3MGeyrQwdJoa5CgbZjeD2+GA@mail.gmail.com>
Date:   Sat, 13 Mar 2021 11:47:51 +0900
From:   Namhyung Kim <namhyung@...nel.org>
To:     Song Liu <songliubraving@...com>
Cc:     linux-kernel <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Arnaldo Carvalho de Melo <acme@...hat.com>,
        Jiri Olsa <jolsa@...nel.org>
Subject: Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

On Sat, Mar 13, 2021 at 12:38 AM Song Liu <songliubraving@...com> wrote:
>
>
>
> > On Mar 12, 2021, at 12:36 AM, Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > Hi,
> >
> > On Fri, Mar 12, 2021 at 11:03 AM Song Liu <songliubraving@...com> wrote:
> >>
> >> perf uses performance monitoring counters (PMCs) to monitor system
> >> performance. The PMCs are limited hardware resources. For example,
> >> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>
> >> Modern data center systems use these PMCs in many different ways:
> >> system level monitoring, (maybe nested) container level monitoring, per
> >> process monitoring, profiling (in sample mode), etc. In some cases,
> >> there are more active perf_events than available hardware PMCs. To allow
> >> all perf_events to have a chance to run, it is necessary to do expensive
> >> time multiplexing of events.
> >>
> >> On the other hand, many monitoring tools count the common metrics (cycles,
> >> instructions). It is a waste to have multiple tools create multiple
> >> perf_events of "cycles" and occupy multiple PMCs.
> >>
> >> bperf tries to reduce such wastes by allowing multiple perf_events of
> >> "cycles" or "instructions" (at different scopes) to share PMUs. Instead
> >> of having each perf-stat session to read its own perf_events, bperf uses
> >> BPF programs to read the perf_events and aggregate readings to BPF maps.
> >> Then, the perf-stat session(s) reads the values from these BPF maps.
> >>
> >> Please refer to the comment before the definition of bperf_ops for the
> >> description of bperf architecture.
> >
> > Interesting!  Actually I thought about something similar before,
> > but my BPF knowledge is outdated.  So I need to catch up but
> > failed to have some time for it so far. ;-)
> >
> >>
> >> bperf is off by default. To enable it, pass --use-bpf option to perf-stat.
> >> bperf uses a BPF hashmap to share information about BPF programs and maps
> >> used by bperf. This map is pinned to bpffs. The default address is
> >> /sys/fs/bpf/bperf_attr_map. The user could change the address with option
> >> --attr-map.
> >>
> >> ---
> >> Known limitations:
> >> 1. Do not support per cgroup events;
> >> 2. Do not support monitoring of BPF program (perf-stat -b);
> >> 3. Do not support event groups.
> >
> > In my case, per cgroup event counting is very important.
> > And I'd like to do that with lots of cpus and cgroups.
>
> We can easily extend this approach to support cgroups events. I didn't
> implement it to keep the first version simple.

OK.

>
> > So I'm working on an in-kernel solution (without BPF),
> > I hope to share it soon.
>
> This is interesting! I cannot wait to see how it looks like. I spent
> quite some time try to enable in kernel sharing (not just cgroup
> events), but finally decided to try BPF approach.

Well I found it hard to support generic event sharing that works
for all use cases.  So I'm focusing on the per cgroup case only.

>
> >
> > And for event groups, it seems the current implementation
> > cannot handle more than one event (not even in a group).
> > That could be a serious limitation..
>
> It supports multiple events. Multiple events are independent, i.e.,
> "cycles" and "instructions" would use two independent leader programs.

OK, then do you need multiple bperf_attr_maps?  Does it work
for an arbitrary number of events?

>
> >
> >>
> >> The following commands have been tested:
> >>
> >>   perf stat --use-bpf -e cycles -a
> >>   perf stat --use-bpf -e cycles -C 1,3,4
> >>   perf stat --use-bpf -e cycles -p 123
> >>   perf stat --use-bpf -e cycles -t 100,101
> >
> > Hmm... so it loads both leader and follower programs if needed, right?
> > Does it support multiple followers with different targets at the same time?
>
> Yes, the whole idea is to have one leader program and multiple follower
> programs. If we only run one of these commands at a time, it will load
> one leader and one follower. If we run multiple of them in parallel,
> they will share the same leader program and load multiple follower
> programs.
>
> I actually tested more than the commands above. The list actually means
> we support -a, -C -p, and -t.
>
> Currently, this works for multiple events, and different parallel
> perf-stat. The two commands below will work well in parallel:
>
>   perf stat --use-bpf -e ref-cycles,instructions -a
>   perf stat --use-bpf -e ref-cycles,cycles -C 1,3,5
>
> Note the use of ref-cycles, which can only use one counter on Intel CPUs.
> With this approach, the above two commands will not do time multiplexing
> on ref-cycles.

Awesome!

Thanks,
Namhyung