[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3E65B60E-B120-4E1A-BAF2-2FAEF136A4CD@fb.com>
Date: Fri, 19 Mar 2021 00:22:07 +0000
From: Song Liu <songliubraving@...com>
To: Arnaldo <arnaldo.melo@...il.com>
CC: Jiri Olsa <jolsa@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>,
linux-kernel <linux-kernel@...r.kernel.org>,
Kernel Team <Kernel-team@...com>,
"Arnaldo Carvalho de Melo" <acme@...hat.com>,
Jiri Olsa <jolsa@...nel.org>
Subject: Re: [PATCH v2 0/3] perf-stat: share hardware PMCs with BPF
> On Mar 18, 2021, at 5:09 PM, Arnaldo <arnaldo.melo@...il.com> wrote:
>
>
>
> On March 18, 2021 6:14:34 PM GMT-03:00, Jiri Olsa <jolsa@...hat.com> wrote:
>> On Thu, Mar 18, 2021 at 03:52:51AM +0000, Song Liu wrote:
>>>
>>>
>>>> On Mar 17, 2021, at 6:11 AM, Arnaldo Carvalho de Melo
>> <acme@...nel.org> wrote:
>>>>
>>>> Em Wed, Mar 17, 2021 at 02:29:28PM +0900, Namhyung Kim escreveu:
>>>>> Hi Song,
>>>>>
>>>>> On Wed, Mar 17, 2021 at 6:18 AM Song Liu <songliubraving@...com>
>> wrote:
>>>>>>
>>>>>> perf uses performance monitoring counters (PMCs) to monitor
>> system
>>>>>> performance. The PMCs are limited hardware resources. For
>> example,
>>>>>> Intel CPUs have 3x fixed PMCs and 4x programmable PMCs per cpu.
>>>>>>
>>>>>> Modern data center systems use these PMCs in many different ways:
>>>>>> system level monitoring, (maybe nested) container level
>> monitoring, per
>>>>>> process monitoring, profiling (in sample mode), etc. In some
>> cases,
>>>>>> there are more active perf_events than available hardware PMCs.
>> To allow
>>>>>> all perf_events to have a chance to run, it is necessary to do
>> expensive
>>>>>> time multiplexing of events.
>>>>>>
>>>>>> On the other hand, many monitoring tools count the common metrics
>> (cycles,
>>>>>> instructions). It is a waste to have multiple tools create
>> multiple
>>>>>> perf_events of "cycles" and occupy multiple PMCs.
>>>>>
>>>>> Right, it'd be really helpful when the PMCs are frequently or
>> mostly shared.
>>>>> But it'd also increase the overhead for uncontended cases as BPF
>> programs
>>>>> need to run on every context switch. Depending on the workload,
>> it may
>>>>> cause a non-negligible performance impact. So users should be
>> aware of it.
>>>>
>>>> Would be interesting to, humm, measure both cases to have a firm
>> number
>>>> of the impact, how many instructions are added when sharing using
>>>> --bpf-counters?
>>>>
>>>> I.e. compare the "expensive time multiplexing of events" with its
>>>> avoidance by using --bpf-counters.
>>>>
>>>> Song, have you perfmormed such measurements?
>>>
>>> I have got some measurements with perf-bench-sched-messaging:
>>>
>>> The system: x86_64 with 23 cores (46 HT)
>>>
>>> The perf-stat command:
>>> perf stat -e
>> cycles,cycles,instructions,instructions,ref-cycles,ref-cycles <target,
>> etc.>
>>>
>>> The benchmark command and output:
>>> ./perf bench sched messaging -g 40 -l 50000 -t
>>> # Running 'sched/messaging' benchmark:
>>> # 20 sender and receiver threads per group
>>> # 40 groups == 1600 threads run
>>> Total time: 10X.XXX [sec]
>>>
>>>
>>> I use the "Total time" as measurement, so smaller number is better.
>>>
>>> For each condition, I run the command 5 times, and took the median of
>>
>>> "Total time".
>>>
>>> Baseline (no perf-stat) 104.873 [sec]
>>> # global
>>> perf stat -a 107.887 [sec]
>>> perf stat -a --bpf-counters 106.071 [sec]
>>> # per task
>>> perf stat 106.314 [sec]
>>> perf stat --bpf-counters 105.965 [sec]
>>> # per cpu
>>> perf stat -C 1,3,5 107.063 [sec]
>>> perf stat -C 1,3,5 --bpf-counters 106.406 [sec]
>>
>> I can't see why it's actualy faster than normal perf ;-)
>> would be worth to find out
>
> Isn't this all about contended cases?
Yeah, the normal perf is doing time multiplexing; while --bpf-counters
doesn't need it.
Thanks,
Song
Powered by blists - more mailing lists