linux-kernel - Re: [PATCH v3] perf pmu intel: Adjust cpumaks for sub-NUMA clusters on graniterapids

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aCviJJpEYKt8wYEk@google.com>
Date: Mon, 19 May 2025 19:00:04 -0700
From: Namhyung Kim <namhyung@...nel.org>
To: Ian Rogers <irogers@...gle.com>
Cc: "Liang, Kan" <kan.liang@...ux.intel.com>,
	Weilin Wang <weilin.wang@...el.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Mark Rutland <mark.rutland@....com>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	Jiri Olsa <jolsa@...nel.org>,
	Adrian Hunter <adrian.hunter@...el.com>,
	Ravi Bangoria <ravi.bangoria@....com>,
	linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v3] perf pmu intel: Adjust cpumaks for sub-NUMA clusters
 on graniterapids

On Sun, May 18, 2025 at 10:45:52AM -0700, Ian Rogers wrote:
> On Sun, May 18, 2025 at 10:09 AM Namhyung Kim <namhyung@...nel.org> wrote:
> >
> > Hello,
> >
> > On Thu, May 15, 2025 at 03:35:41PM -0700, Ian Rogers wrote:
> > > On Thu, May 15, 2025 at 2:01 PM Liang, Kan <kan.liang@...ux.intel.com> wrote:
> > > >
> > > > On 2025-05-15 2:14 p.m., Ian Rogers wrote:
> > > > > On graniterapids the cache home agent (CHA) and memory controller
> > > > > (IMC) PMUs all have their cpumask set to per-socket information. In
> > > > > order for per NUMA node aggregation to work correctly the PMUs cpumask
> > > > > needs to be set to CPUs for the relevant sub-NUMA grouping.
> > > > >
> > > > > For example, on a 2 socket graniterapids machine with sub NUMA
> > > > > clustering of 3, for uncore_cha and uncore_imc PMUs the cpumask is
> > > > > "0,120" leading to aggregation only on NUMA nodes 0 and 3:
> > > > > ```
> > > > > $ perf stat --per-node -e 'UNC_CHA_CLOCKTICKS,UNC_M_CLOCKTICKS' -a sleep 1
> > > > >
> > > > >  Performance counter stats for 'system wide':
> > > > >
> > > > > N0        1    277,835,681,344      UNC_CHA_CLOCKTICKS
> > > > > N0        1     19,242,894,228      UNC_M_CLOCKTICKS
> > > > > N3        1    277,803,448,124      UNC_CHA_CLOCKTICKS
> > > > > N3        1     19,240,741,498      UNC_M_CLOCKTICKS
> > > > >
> > > > >        1.002113847 seconds time elapsed
> > > > > ```
> > > > >
> > > > > By updating the PMUs cpumasks to "0,120", "40,160" and "80,200" then
> > > > > the correctly 6 NUMA node aggregations are achieved:
> > > > > ```
> > > > > $ perf stat --per-node -e 'UNC_CHA_CLOCKTICKS,UNC_M_CLOCKTICKS' -a sleep 1
> > > > >
> > > > >  Performance counter stats for 'system wide':
> > > > >
> > > > > N0        1     92,748,667,796      UNC_CHA_CLOCKTICKS
> > > > > N0        0      6,424,021,142      UNC_M_CLOCKTICKS
> > > > > N1        0     92,753,504,424      UNC_CHA_CLOCKTICKS
> > > > > N1        1      6,424,308,338      UNC_M_CLOCKTICKS
> > > > > N2        0     92,751,170,084      UNC_CHA_CLOCKTICKS
> > > > > N2        0      6,424,227,402      UNC_M_CLOCKTICKS
> > > > > N3        1     92,745,944,144      UNC_CHA_CLOCKTICKS
> > > > > N3        0      6,423,752,086      UNC_M_CLOCKTICKS
> > > > > N4        0     92,725,793,788      UNC_CHA_CLOCKTICKS
> > > > > N4        1      6,422,393,266      UNC_M_CLOCKTICKS
> > > > > N5        0     92,717,504,388      UNC_CHA_CLOCKTICKS
> > > > > N5        0      6,421,842,618      UNC_M_CLOCKTICKS
> > > >
> > > > Is the second coloum  the number of units?
> > > > If so, it's wrong.
> > > >
> > > > On my GNR with SNC-2, I observed the similar issue.
> > > >
> > > > $ sudo ./perf stat -e 'UNC_M_CLOCKTICKS' --per-node -a sleep 1
> > > >  Performance counter stats for 'system wide':
> > > >
> > > > N0        0      6,405,811,284      UNC_M_CLOCKTICKS
> > > > N1        1      6,405,895,988      UNC_M_CLOCKTICKS
> > > > N2        0      6,152,906,692      UNC_M_CLOCKTICKS
> > > > N3        1      6,063,415,630      UNC_M_CLOCKTICKS
> > > >
> > > > It's supposed to be 4?
> > >
> > > Agreed it is weird, but it is what has historically been displayed.
> > > The number is the aggregation number:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat-display.c?h=perf-tools-next#n307
> > > which comes from:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat-display.c?h=perf-tools-next#n135
> > > which comes from:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat.c?h=perf-tools-next#n435
> > > However, I think it is missing updates from:
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat.c?h=perf-tools-next#n526
> > > but there is a comment there saying "don't increase aggr.nr for
> > > aliases" and all the uncore events are aliases. I don't understand
> > > what the aggregation number is supposed to be, it is commented as
> > > "number of entries (CPUs) aggregated":
> > > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/stat.h?h=perf-tools-next#n26
> > > it would seem to make sense in the CHA case with SNC3 and 42 evsels
> > > per NUMA node that the value should be 42. Maybe Namhyung (who did the
> > > evsel__merge_aggr_counters clean up) knows why it is this way but in
> > > my eyes it just seems like something that has been broken for a long
> > > time.
> >
> > I think it's the number of aggregated entries (FDs?).
> >
> > For core events, it's the number of CPUs for the given aggregation as it
> > collects from all CPUs.  But it'd be confusing with uncore events which
> > have cpumask to collect data from a few CPUs.
> >
> > On my laptop, --per-socket gives different numbers depending on the
> > events/PMUs.
> >
> > For core events, it's 4.
> >
> >   $ sudo ./perf stat -a --per-socket -e cycles sleep 1
> >
> >    Performance counter stats for 'system wide':
> >
> >   S0        4        225,297,257      cycles
> >
> >          1.002581995 seconds time elapsed
> >
> > While uncore events give 1.
> >
> >   $ sudo ./perf stat -a --per-socket -e unc_mc0_rdcas_count_freerun sleep 1
> >
> >    Performance counter stats for 'system wide':
> >
> >   S0        1         23,665,510      unc_mc0_rdcas_count_freerun
> >
> >          1.002148012 seconds time elapsed
> 
> I think we're agreeing. I wonder that the intent of the aggregation
> number is to make it so that you can work out an average from the
> aggregated count. So for core PMUs you divide the count by the
> aggregation number and get the average count per core (CPU?). If we're
> getting an aggregated count of say uncore memory controller events
> then it would make sense to me that we show the aggregated total and
> the aggregation count is the number of memory controller PMUs, so we
> can have an average per memory controller. This should line up with
> using the number of file descriptors.

Sounds right.

> 
> I think this isn't the current behavior, on perf v6.12:
> ```
> $ sudo perf stat --per-socket -e data_read -a sleep 1
> 
>  Performance counter stats for 'system wide':
> 
> S0        1           2,484.96 MiB  data_read
> 
>        1.001365319 seconds time elapsed
> 
> $ sudo perf stat -A -e data_read -a sleep 1
> 
>  Performance counter stats for 'system wide':
> 
> CPU0             1,336.48 MiB  data_read [uncore_imc_free_running_0]
> CPU0             1,337.06 MiB  data_read [uncore_imc_free_running_1]
> 
>        1.001049096 seconds time elapsed
> ```
> so the aggregation number shows 1 but 2 events were aggregated together.

Ugh.. right.  Merging uncore PMU instances can add more confusion. :(

> 
> I think computing the aggregation number in the stat code is probably
> wrong. The value should be constant for an evsel and aggr_cpu_id, it's
> just computing it for an aggr_cpu_id is a pain due to needing topology
> and/or PMU information. The code is ripe for refactoring. I'd prefer
> not to do it as part of this change though which is altering a
> particular Intel Granite Rapids issue.

That's ok.  Just one more TODO items..

Thanks,
Namhyung