[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <462b0262-305b-4ed2-9c1c-2afeaec8646e@linaro.org>
Date: Fri, 20 Dec 2024 10:28:50 +0000
From: James Clark <james.clark@...aro.org>
To: Namhyung Kim <namhyung@...nel.org>, Ian Rogers <irogers@...gle.com>
Cc: Peter Zijlstra <peterz@...radead.org>, Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Adrian Hunter <adrian.hunter@...el.com>,
Kan Liang <kan.liang@...ux.intel.com>, Sun Haiyong <sunhaiyong@...ngson.cn>,
Yanteng Si <siyanteng@...ngson.cn>, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3 1/5] perf cpumap: If the die_id is missing use
socket/physical_package_id
On 19/12/2024 9:53 pm, Namhyung Kim wrote:
> On Thu, Dec 19, 2024 at 09:20:39AM -0800, Ian Rogers wrote:
>> On Wed, Dec 18, 2024 at 9:32 PM Namhyung Kim <namhyung@...nel.org> wrote:
>>>
>>> On Wed, Dec 18, 2024 at 09:42:27AM -0800, Ian Rogers wrote:
>>>> On Wed, Dec 18, 2024 at 4:04 AM James Clark <james.clark@...aro.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 16/12/2024 11:24 pm, Ian Rogers wrote:
>>>>>> An error value for a missing die_id may be written into things like
>>>>>> the cpu_topology_map. As the topology needs to be fully written out,
>>>>>> including the die_id, to allow perf.data file features to be aligned
>>>>>> we can't allow error values to be written out. Instead base the
>>>>>> missing die_id value off of the socket/physical_package_id assuming
>>>>>> they correlate 1:1.
>>>>>>
>>>>>> Signed-off-by: Ian Rogers <irogers@...gle.com>
>>>>>> ---
>>>>>> tools/perf/util/cpumap.c | 3 ++-
>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
>>>>>> index 27094211edd8..d362272f8466 100644
>>>>>> --- a/tools/perf/util/cpumap.c
>>>>>> +++ b/tools/perf/util/cpumap.c
>>>>>> @@ -283,7 +283,8 @@ int cpu__get_die_id(struct perf_cpu cpu)
>>>>>> {
>>>>>> int value, ret = cpu__get_topology_int(cpu.cpu, "die_id", &value);
>>>>>>
>>>>>> - return ret ?: value;
>>>>>> + /* If die_id is missing fallback on using the socket/physical_package_id. */
>>>>>> + return ret || value < 0 ? cpu__get_socket_id(cpu) : value;
>>>>>> }
>>>>>>
>>>>>> struct aggr_cpu_id aggr_cpu_id__die(struct perf_cpu cpu, void *data)
>>>>>
>>>>> Hi Ian,
>>>>>
>>>>> I sent a fix for the same or a similar problem here [1]. For this one
>>>>> I'm not sure why we'd want to use the socket ID for die when it's always
>>>>> been 0 for not present. I wonder if this change is mingling two things:
>>>>> fixing the negative error value appearing and replacing die with socket ID.
>>>>>
>>>>> Personally I would prefer to keep the 0 to fix the error value, that way
>>>>> nobody gets surprised by the change.
>>>>>
>>>>> Also it looks like cpu__get_cluster_id() can suffer from the same issue,
>>>>> and if we do it this way we should drop these as they aren't valid anymore:
>>>>>
>>>>> /* There is no die_id on legacy system. */
>>>>> if (die < 0)
>>>>> die = 0;
>>>>
>>>> I think this breaks the assumption here:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/tests/expr.c?h=perf-tools-next#n244
>>>>>> Hmm.. I'm not sure how it worked before. The code is already
there and
>>> it just changed the condition from == -1 to < 0, right?
>>
>> You'd need to be testing on a multi-socket machine to see the issue.
>> If you had say a dual socket Ampere chip and the die_id was missing,
>> does it make sense for there to be two sockets/packages but only 1
>> die? I think it is best we assume 1 die per socket when the die_id is
>> missing, and to some crippled extent (because of the s390 workaround)
>> the expr test is doing the sanity check.
>
> AFAICS die ID is always used with socket ID already so I guess it means
> an ID inside a socket.
>
> Thanks,
> Namhyung
>
Don't we then need to update has_die_topology() to always return true if
we're using socket ID as a placeholder? And the #num_dies expression
should potentially become non-zero for consistency?
Seems like there are two options:
- Fully support "no die reported"
* -errno is saved into the topology header, but it's still converted
to 0 for Perf stat aggregation and display
* #num_dies == 0 as it currently is
* Anywhere else die is used it's checked for error values first
- Replace die with package
* #num_dies == #num_packages
* has_die_topology() == true
* delete code to display die as 0 in perf stat
With this patch we get somewhere between the two.
Thanks
James
Powered by blists - more mailing lists