lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 5 Apr 2023 22:39:02 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     <linux-perf-users@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <acme@...nel.org>, <peterz@...radead.org>, <mingo@...hat.com>,
        <mark.rutland@....com>, <alexander.shishkin@...ux.intel.com>,
        <jolsa@...nel.org>, <namhyung@...nel.org>
CC:     <ravi.bangoria@....com>, <sandipan.das@....com>,
        <ananth.narayan@....com>, <gautham.shenoy@....com>,
        <eranian@...gle.com>, <puwen@...on.cn>
Subject: [RFC PATCH v2 0/4] perf stat: Add option to aggregate data based on the cache topology

Motivation behind this feature is to aggregate the data at the LLC level
for chiplet based processors which currently do not expose the chiplet
details in sysfs cpu topology information.

For the completeness of the feature, the series adds ability to
aggregate data at any cache level. Following is the example of the
output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
chiplet per socket.

  $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

   Performance counter stats for 'system wide':

  S0-D0-L3-ID0             16              4,463      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID1             16              2,962      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID2             16              2,592      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID3             16              2,508      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID4             16              1,841      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID5             16              1,764      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID6             16              1,205      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID7             16              5,806      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID8             16              1,461      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID9             16                648      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID10            16              1,443      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID11            16              1,333      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID12            16              1,167      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID13            16                640      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID14            16                601      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID15            16              3,423      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017954593 seconds time elapsed

The series also adds support for perf stat record and perf stat report
to aggregate data at various cache levels. Following is an example of
recording with aggregation at L2 level and reporting the same data with
aggregation at L3 level.

  $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

   Performance counter stats for 'system wide':

  S0-D0-L2-ID0              2              3,212      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID1              2                240      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID2              2                 10      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID3              2                 13      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID4              2                 13      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID5              2                319      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID6              2                348      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID7              2                648      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L2-ID8              2                284      ls_dmnd_fills_from_sys.ext_cache_remote
  ...
  S1-D1-L2-ID127            2                113      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017958787 seconds time elapsed

  $ sudo perf stat report --per-cache=L3

   Performance counter stats for '/home/amd/dev/linux/tools/perf/perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5':

  S0-D0-L3-ID0             16              4,803      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID1             16              3,421      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID2             16              1,149      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID3             16              1,220      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID4             16              1,502      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID5             16              6,751      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID6             16              1,600      ls_dmnd_fills_from_sys.ext_cache_remote
  S0-D0-L3-ID7             16              1,985      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID8             16              1,566      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID9             16              1,010      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID10            16              1,337      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID11            16              2,298      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID12            16                314      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID13            16                350      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID14            16                664      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID15            16              3,834      ls_dmnd_fills_from_sys.ext_cache_remote

         5.017958787 seconds time elapsed

The sum of the aggregate at L2 from S0-D0-L2-ID0 to S0-D0-L2-ID7 is
equal to the value for S0-D0-L3-ID0 in perf stat report with aggregation
at L3 level since L3-ID0 contains L2-ID0 to L2-ID7 on the machine.

[New in v2]
On a kernel which does not have the cache instance ID in the sysfs, the
cache ID is set to 0. Running perf stat will give the following output
on the same system with cache instance ID hidden:

  $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5
  
   Performance counter stats for 'system wide':
     
  S0-D0-L3-ID0            128             13,277      ls_dmnd_fills_from_sys.ext_cache_remote
  S1-D1-L3-ID0            128              9,822      ls_dmnd_fills_from_sys.ext_cache_remote
     
         5.020718145 seconds time elapsed

This series makes breaking change when saving the cache details of env
for recording and reporting purpose. If there is a better way to do
this, please do let me know.

Following points were not considered when designing this RFC:

- Handling multiple cache types at same level: For example consider a
  case where L1i is thread local but L1d is core-wide. The logic
  currently selects the last cache instance it sees at a particular
  level when iterating over the indices. This may lead to user expecting
  a different result than the one perf reported.

- For the same example as above, where L1i is thread local and L1d is
  core-wide, the record and report might not give consistent result as
  the qsort() function used to sort cache_level_data[] when saving the
  env data is unstable and might not preserve the order for the different
  caches at same level. Since we consider the data for the last set of
  data at the same level, the unstable sort might lead to
  inconsistencies.

I'm seeking some clarification from the community for the above problems
and potential solutions for processors where all CPUs might not share
the same topology structure.

This series cleanly applies on top perf-tool branch from Arnaldo's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
at:

commit e8d018dd0257 ("Linux 6.3-rc3")

--
Changelog:
o v1->v2
  - Set cache instance ID to 0 if the file cannot be read.
  - Fix cache level parsing function.
  - Updated details in cover letter.
--
K Prateek Nayak (4):
  perf: Read cache instance ID when building cache topology
  perf: Save cache instance ID when saving cache topology data
  perf: Extract building cache level for a CPU into separate function
  perf: Add option for --per-cache aggregation

 tools/lib/perf/include/perf/cpumap.h          |   5 +
 tools/lib/perf/include/perf/event.h           |   3 +-
 tools/perf/Documentation/perf-stat.txt        |  16 ++
 tools/perf/builtin-stat.c                     | 149 +++++++++++++++++-
 .../tests/shell/lib/perf_json_output_lint.py  |   4 +-
 tools/perf/tests/shell/stat+csv_output.sh     |  14 ++
 tools/perf/tests/shell/stat+json_output.sh    |  13 ++
 tools/perf/util/cpumap.c                      |  97 ++++++++++++
 tools/perf/util/cpumap.h                      |  17 ++
 tools/perf/util/env.h                         |   1 +
 tools/perf/util/event.c                       |   7 +-
 tools/perf/util/header.c                      |  77 ++++++---
 tools/perf/util/header.h                      |   4 +
 tools/perf/util/stat-display.c                |  16 ++
 tools/perf/util/stat-shadow.c                 |   1 +
 tools/perf/util/stat.h                        |   2 +
 tools/perf/util/synthetic-events.c            |   1 +
 17 files changed, 395 insertions(+), 32 deletions(-)

-- 
2.34.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ