[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d8382fd2-1c0c-5041-3c0e-cd15244ef9f6@huawei.com>
Date: Wed, 7 Feb 2024 15:25:44 +0800
From: Yicong Yang <yangyicong@...wei.com>
To: Ian Rogers <irogers@...gle.com>
CC: <yangyicong@...ilicon.com>, <acme@...nel.org>, <mark.rutland@....com>,
<peterz@...radead.org>, <mingo@...hat.com>, <james.clark@....com>,
<alexander.shishkin@...ux.intel.com>, <jolsa@...nel.org>,
<namhyung@...nel.org>, <adrian.hunter@...el.com>,
<linux-perf-users@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<Jonathan.Cameron@...wei.com>, <zhanjie9@...ilicon.com>, <21cnbao@...il.com>,
<tim.c.chen@...el.com>, <prime.zeng@...ilicon.com>, <fanghao11@...wei.com>,
<linuxarm@...wei.com>, <linux-arm-kernel@...ts.infradead.org>, Tim Chen
<tim.c.chen@...ux.intel.com>
Subject: Re: [RESEND v4] perf stat: Support per-cluster aggregation
On 2024/2/7 1:49, Ian Rogers wrote:
> On Tue, Feb 6, 2024 at 12:24 AM Yicong Yang <yangyicong@...wei.com> wrote:
>>
>> From: Yicong Yang <yangyicong@...ilicon.com>
>>
>> Some platforms have 'cluster' topology and CPUs in the cluster will
>> share resources like L3 Cache Tag (for HiSilicon Kunpeng SoC) or L2
>> cache (for Intel Jacobsville). Currently parsing and building cluster
>> topology have been supported since [1].
>>
>> perf stat has already supported aggregation for other topologies like
>> die or socket, etc. It'll be useful to aggregate per-cluster to find
>> problems like L3T bandwidth contention.
>>
>> This patch add support for "--per-cluster" option for per-cluster
>> aggregation. Also update the docs and related test. The output will
>> be like:
>>
>> [root@...alhost tmp]# perf stat -a -e LLC-load --per-cluster -- sleep 5
>>
>> Performance counter stats for 'system wide':
>>
>> S56-D0-CLS158 4 1,321,521,570 LLC-load
>> S56-D0-CLS594 4 794,211,453 LLC-load
>> S56-D0-CLS1030 4 41,623 LLC-load
>> S56-D0-CLS1466 4 41,646 LLC-load
>> S56-D0-CLS1902 4 16,863 LLC-load
>> S56-D0-CLS2338 4 15,721 LLC-load
>> S56-D0-CLS2774 4 22,671 LLC-load
>> [...]
>>
>> On a legacy system without cluster or cluster support, the output will
>> be look like:
>> [root@...alhost perf]# perf stat -a -e cycles --per-cluster -- sleep 1
>>
>> Performance counter stats for 'system wide':
>>
>> S56-D0-CLS0 64 18,011,485 cycles
>> S7182-D0-CLS0 64 16,548,835 cycles
>>
>> Note that this patch doesn't mix the cluster information in the outputs
>> of --per-core to avoid breaking any tools/scripts using it.
>>
>> Note that perf recently supports "--per-cache" aggregation, but it's not
>> the same with the cluster although cluster CPUs may share some cache
>> resources. For example on my machine all clusters within a die share the
>> same L3 cache:
>> $ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
>> 0-31
>> $ cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list
>> 0-3
>>
>> [1] commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a die")
>> Tested-by: Jie Zhan <zhanjie9@...ilicon.com>
>> Reviewed-by: Tim Chen <tim.c.chen@...ux.intel.com>
>> Signed-off-by: Yicong Yang <yangyicong@...ilicon.com>
>
> Reviewed-by: Ian Rogers <irogers@...gle.com>
>
Thanks!
>> ---
>> Change since v3:
>> - Rebase on v6.7-rc4 and resolve the conflicts
>> Link: https://lore.kernel.org/all/20230404104951.27537-1-yangyicong@huawei.com/
>>
>> Change since v2:
>> - Use 0 as cluster ID on legacy system without cluster support, keep consistenct
>> with what --per-die does.
>> Link: https://lore.kernel.org/all/20230328112717.19573-1-yangyicong@huawei.com/
>>
>> Change since v1:
>> - Provides the information about how to map the cluster to the CPUs in the manual
>> - Thanks the review from Tim and test from Jie.
>> Link: https://lore.kernel.org/all/20230313085911.61359-1-yangyicong@huawei.com/
>>
>> tools/perf/Documentation/perf-stat.txt | 11 ++++
>> tools/perf/builtin-stat.c | 52 +++++++++++++++++--
>> .../tests/shell/lib/perf_json_output_lint.py | 4 +-
>> tools/perf/tests/shell/lib/stat_output.sh | 12 +++++
>> tools/perf/tests/shell/stat+csv_output.sh | 2 +
>> tools/perf/tests/shell/stat+json_output.sh | 13 +++++
>> tools/perf/tests/shell/stat+std_output.sh | 2 +
>> tools/perf/util/cpumap.c | 32 +++++++++++-
>> tools/perf/util/cpumap.h | 19 +++++--
>> tools/perf/util/env.h | 1 +
>> tools/perf/util/stat-display.c | 13 +++++
>> tools/perf/util/stat.h | 1 +
>> 12 files changed, 153 insertions(+), 9 deletions(-)
>>
[...]
>> diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
>> index 0581ee0fa5f2..5907456d42a2 100644
>> --- a/tools/perf/util/cpumap.c
>> +++ b/tools/perf/util/cpumap.c
>> @@ -222,6 +222,8 @@ static int aggr_cpu_id__cmp(const void *a_pointer, const void *b_pointer)
>> return a->socket - b->socket;
>> else if (a->die != b->die)
>> return a->die - b->die;
>> + else if (a->cluster != b->cluster)
>> + return a->cluster - b->cluster;
>> else if (a->cache_lvl != b->cache_lvl)
>> return a->cache_lvl - b->cache_lvl;
>> else if (a->cache != b->cache)
>> @@ -309,6 +311,29 @@ struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
>> return id;
>> }
>>
>> +int cpu__get_cluster_id(struct perf_cpu cpu)
>> +{
>> + int value, ret = cpu__get_topology_int(cpu.cpu, "cluster_id", &value);
>
> nit: normal coding style is for a whitespace newline here.
>
Will fix. I may referred to cpu__get_core_id() below which is an exception.
Thanks,
Yicong
Powered by blists - more mailing lists