[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM9d7ch_axD_4E0W7MEx8ueeq9QsvhxNWaJ0J3AtVgeKqKQmbA@mail.gmail.com>
Date: Fri, 19 Mar 2021 09:54:59 +0900
From: Namhyung Kim <namhyung@...nel.org>
To: Song Liu <songliubraving@...com>
Cc: Arnaldo <arnaldo.melo@...il.com>, Jiri Olsa <jolsa@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
linux-kernel <linux-kernel@...r.kernel.org>,
Kernel Team <Kernel-team@...com>,
Arnaldo Carvalho de Melo <acme@...hat.com>,
Jiri Olsa <jolsa@...nel.org>
Subject: Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
On Fri, Mar 19, 2021 at 9:22 AM Song Liu <songliubraving@...com> wrote:
>
>
>
> > On Mar 18, 2021, at 5:09 PM, Arnaldo <arnaldo.melo@...il.com> wrote:
> >
> >
> >
> > On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa <jolsa@...hat.com> wrote:
> >> On Thu, Mar 18, 2021 at 03:52:51AM +0000, Song Liu wrote:
> >>>
> >>>
> >>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
> >> <acme@...nel.org> wrote:
> >>>>
> >>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
> >>>>> Hi Song,
> >>>>>
> >>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu <songliubraving@...com>
> >> wrote:
> >>>>>>
> >>>>>> perf uses performance monitoring counters (PMCs) to monitor
> >> system
> >>>>>> performance. The PMCs are limited hardware resources. For
> >> example,
> >>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
> >>>>>>
> >>>>>> Modern data center systems use these PMCs in many different ways:
> >>>>>> system level monitoring, (maybe nested) container level
> >> monitoring, per
> >>>>>> process monitoring, profiling (in sample mode), etc. In some
> >> cases,
> >>>>>> there are more active perf_events than available hardware PMCs.
> >> To allow
> >>>>>> all perf_events to have a chance to run, it is necessary to do
> >> expensive
> >>>>>> time multiplexing of events.
> >>>>>>
> >>>>>> On the other hand, many monitoring tools count the common metrics
> >> (cycles,
> >>>>>> instructions). It is a waste to have multiple tools create
> >> multiple
> >>>>>> perf_events of "cycles" and occupy multiple PMCs.
> >>>>>
> >>>>> Right, it'd be really helpful when the PMCs are frequently or
> >> mostly shared.
> >>>>> But it'd also increase the overhead for uncontended cases as BPF
> >> programs
> >>>>> need to run on every context switch. Depending on the workload,
> >> it may
> >>>>> cause a non-negligible performance impact. So users should be
> >> aware of it.
> >>>>
> >>>> Would be interesting to, humm, measure both cases to have a firm
> >> number
> >>>> of the impact, how many instructions are added when sharing using
> >>>> --bpf-counters?
> >>>>
> >>>> I.e. compare the "expensive time multiplexing of events" with its
> >>>> avoidance by using --bpf-counters.
> >>>>
> >>>> Song, have you perfmormed such measurements?
> >>>
> >>> I have got some measurements with perf-bench-sched-messaging:
> >>>
> >>> The system: x86_64 with 23 cores (46 HT)
> >>>
> >>> The perf-stat command:
> >>> perf stat -e
> >> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles <target,
> >> etc.>
> >>>
> >>> The benchmark command and output:
> >>> ./perf bench sched messaging -g 40 -l 50000 -t
> >>> # Running 'sched/messaging' benchmark:
> >>> # 20 sender and receiver threads per group
> >>> # 40 groups == 1600 threads run
> >>> Total time: 10X.XXX [sec]
> >>>
> >>>
> >>> I use the "Total time" as measurement, so smaller number is better.
> >>>
> >>> For each condition, I run the command 5 times, and took the median of
> >>
> >>> "Total time".
> >>>
> >>> Baseline (no perf-stat) 104.873 [sec]
> >>> # global
> >>> perf stat -a 107.887 [sec]
> >>> perf stat -a --bpf-counters 106.071 [sec]
> >>> # per task
> >>> perf stat 106.314 [sec]
> >>> perf stat --bpf-counters 105.965 [sec]
> >>> # per cpu
> >>> perf stat -C 1,3,5 107.063 [sec]
> >>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec]
> >>
> >> I can't see why it's actualy faster than normal perf ;-)
> >> would be worth to find out
> >
> > Isn't this all about contended cases?
>
> Yeah, the normal perf is doing time multiplexing; while --bpf-counters
> doesn't need it.
Yep, so for uncontended cases, normal perf should be the same as the
baseline (faster than the bperf). But for contended cases, the bperf
works faster.
Thanks,
Namhyung
Powered by blists - more mailing lists