linux-kernel - Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAM9d7ch_axD_4E0W7MEx8ueeq9QsvhxNWaJ0J3AtVgeKqKQmbA@mail.gmail.com>
Date:   Fri, 19 Mar 2021 09:54:59 +0900
From:   Namhyung Kim <namhyung@...nel.org>
To:     Song Liu <songliubraving@...com>
Cc:     Arnaldo <arnaldo.melo@...il.com>, Jiri Olsa <jolsa@...hat.com>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        Arnaldo Carvalho de Melo <acme@...hat.com>,
        Jiri Olsa <jolsa@...nel.org>
Subject: Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF

On Fri, Mar 19, 2021 at 9:22 AM Song Liu <songliubraving@...com> wrote:
>
>
>
> > On Mar 18, 2021, at 5:09 PM, Arnaldo <arnaldo.melo@...il.com> wrote:
> >
> >
> >
> > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa <jolsa@...hat.com> wrote:
> >> On Thu, Mar 18, 2021 at 03:52:51AM +0000, Song Liu wrote:
> >>>
> >>>
> >>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
> >> <acme@...nel.org> wrote:
> >>>>
> >>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> >>>>> Hi Song,
> >>>>>
> >>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu <songliubraving@...com>
> >> wrote:
> >>>>>>
> >>>>>> perf uses performance monitoring counters (PMCs) to monitor
> >> system
> >>>>>> performance. The PMCs are limited hardware resources. For
> >> example,
> >>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>>>>>
> >>>>>> Modern data center systems use these PMCs in many different ways:
> >>>>>> system level monitoring, (maybe nested) container level
> >> monitoring, per
> >>>>>> process monitoring, profiling (in sample mode), etc. In some
> >> cases,
> >>>>>> there are more active perf_events than available hardware PMCs.
> >> To allow
> >>>>>> all perf_events to have a chance to run, it is necessary to do
> >> expensive
> >>>>>> time multiplexing of events.
> >>>>>>
> >>>>>> On the other hand, many monitoring tools count the common metrics
> >> (cycles,
> >>>>>> instructions). It is a waste to have multiple tools create
> >> multiple
> >>>>>> perf_events of "cycles" and occupy multiple PMCs.
> >>>>>
> >>>>> Right, it'd be really helpful when the PMCs are frequently or
> >> mostly shared.
> >>>>> But it'd also increase the overhead for uncontended cases as BPF
> >> programs
> >>>>> need to run on every context switch.  Depending on the workload,
> >> it may
> >>>>> cause a non-negligible performance impact.  So users should be
> >> aware of it.
> >>>>
> >>>> Would be interesting to, humm, measure both cases to have a firm
> >> number
> >>>> of the impact, how many instructions are added when sharing using
> >>>> --bpf-counters?
> >>>>
> >>>> I.e. compare the "expensive time multiplexing of events" with its
> >>>> avoidance by using --bpf-counters.
> >>>>
> >>>> Song, have you perfmormed such measurements?
> >>>
> >>> I have got some measurements with perf-bench-sched-messaging:
> >>>
> >>> The system: x86_64 with 23 cores (46 HT)
> >>>
> >>> The perf-stat command:
> >>> perf stat -e
> >> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles <target,
> >> etc.>
> >>>
> >>> The benchmark command and output:
> >>> ./perf bench sched messaging -g 40 -l 50000 -t
> >>> # Running 'sched/messaging' benchmark:
> >>> # 20 sender and receiver threads per group
> >>> # 40 groups == 1600 threads run
> >>>     Total time: 10X.XXX [sec]
> >>>
> >>>
> >>> I use the "Total time" as measurement, so smaller number is better.
> >>>
> >>> For each condition, I run the command 5 times, and took the median of
> >>
> >>> "Total time".
> >>>
> >>> Baseline (no perf-stat)                     104.873 [sec]
> >>> # global
> >>> perf stat -a                                107.887 [sec]
> >>> perf stat -a --bpf-counters         106.071 [sec]
> >>> # per task
> >>> perf stat                           106.314 [sec]
> >>> perf stat --bpf-counters            105.965 [sec]
> >>> # per cpu
> >>> perf stat -C 1,3,5                  107.063 [sec]
> >>> perf stat -C 1,3,5 --bpf-counters   106.406 [sec]
> >>
> >> I can't see why it's actualy faster than normal perf ;-)
> >> would be worth to find out
> >
> > Isn't this all about contended cases?
>
> Yeah, the normal perf is doing time multiplexing; while --bpf-counters
> doesn't need it.

Yep, so for uncontended cases, normal perf should be the same as the
baseline (faster than the bperf).  But for contended cases, the bperf
works faster.

Thanks,
Namhyung