lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP-5=fUCvQNsW0Tnj7Q8sjFTqTEC9YUbFxAedRFtA=5zUe7BVA@mail.gmail.com>
Date:   Wed, 17 May 2023 10:58:01 -0700
From:   Ian Rogers <irogers@...gle.com>
To:     K Prateek Nayak <kprateek.nayak@....com>
Cc:     linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org,
        acme@...nel.org, peterz@...radead.org, mingo@...hat.com,
        mark.rutland@....com, alexander.shishkin@...ux.intel.com,
        jolsa@...nel.org, namhyung@...nel.org, ravi.bangoria@....com,
        sandipan.das@....com, ananth.narayan@....com,
        gautham.shenoy@....com, eranian@...gle.com, puwen@...on.cn
Subject: Re: [PATCH v4 0/5] perf stat: Add option to aggregate data based on
 the cache topology

On Wed, May 17, 2023 at 10:22 AM K Prateek Nayak <kprateek.nayak@....com> wrote:
>
> Motivation behind this feature is to aggregate the data at the LLC level
> for chiplet based processors which currently do not expose the chiplet
> details in sysfs cpu topology information.
>
> For the completeness of the feature, the series adds ability to
> aggregate data at any cache level. Following is the example of the
> output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
> chiplet per socket.
>
>   $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
>     taskset -c 0-15,64-79,128-143,192-207\
>     perf bench sched messaging -p -t -l 100000 -g 8
>
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 8 groups == 320 threads run
>
>     Total time: 7.648 [sec]
>
>     Performance counter stats for 'system wide':
>
>     S0-D0-L3-ID0             16         17,145,912      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID8             16         14,977,628      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID16            16            262,539      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID24            16              3,140      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID32            16             27,403      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID40            16             17,026      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID48            16              7,292      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID56            16              2,464      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID64            16         22,489,306      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID72            16         21,455,257      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID80            16             11,619      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID88            16             30,978      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID96            16             37,628      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID104           16             13,594      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID112           16             10,164      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID120           16             11,259      ls_dmnd_fills_from_sys.ext_cache_remote
>
>           7.779171484 seconds time elapsed
>
> The series also adds support for perf stat record and perf stat report
> to aggregate data at various cache levels. Following is an example of
> recording with aggregation at L2 level and reporting the same data with
> aggregation at L3 level.
>
>   $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
>     taskset -c 0-15,64-79,128-143,192-207\
>     perf bench sched messaging -p -t -l 100000 -g 8
>
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 8 groups == 320 threads run
>
>     Total time: 7.318 [sec]
>
>     Performance counter stats for 'system wide':
>
>     S0-D0-L2-ID0              2          2,171,980      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID1              2          2,048,494      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID2              2          2,120,293      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID3              2          2,224,725      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID4              2          2,021,618      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID5              2          1,995,331      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID6              2          2,163,029      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID7              2          2,104,623      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L2-ID8              2          1,948,776      ls_dmnd_fills_from_sys.ext_cache_remote
>     ...
>     S0-D0-L2-ID63             2              2,648      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID64             2          2,963,323      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID65             2          2,856,629      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID66             2          2,901,725      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID67             2          3,046,120      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID68             2          2,637,971      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID69             2          2,680,029      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID70             2          2,672,259      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID71             2          2,638,768      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID72             2          3,308,642      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID73             2          3,064,473      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID74             2          3,023,379      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID75             2          2,975,119      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID76             2          2,952,677      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID77             2          2,981,695      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID78             2          3,455,916      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID79             2          2,959,540      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L2-ID80             2              4,977      ls_dmnd_fills_from_sys.ext_cache_remote
>     ...
>     S1-D1-L2-ID127            2              3,359      ls_dmnd_fills_from_sys.ext_cache_remote
>
>           7.451725897 seconds time elapsed
>
>   $ sudo perf stat report --per-cache=L3
>
>     Performance counter stats for '...':
>
>     S0-D0-L3-ID0             16         16,850,093      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID8             16         16,001,493      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID16            16            301,011      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID24            16             26,276      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID32            16             48,958      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID40            16             43,799      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID48            16             16,771      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID56            16             12,544      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID64            16         22,396,824      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID72            16         24,721,441      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID80            16             29,426      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID88            16             54,348      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID96            16             41,557      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID104           16             10,084      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID112           16             14,361      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID120           16             24,446      ls_dmnd_fills_from_sys.ext_cache_remote
>
>            7.451725897 seconds time elapsed
>
> The aggregate at S0-D0-L3-ID0 is the sum of S0-D0-L2-ID0 to S0-D0-L3-ID7
> as L3 containing CPU0 contains the L2 instance of CPU0 to CPU7.
>
> Cache IDs are derived from the shared_cpus_list file in the cache
> topology. This allows for --per-cache aggregation of data on a kernel
> which does not expose the cache instance ID in the sysfs. Running perf
> stat will give the following output on the same system with cache
> instance ID hidden:
>
>   $ ls /sys/devices/system/cpu/cpu0/cache/index0/
>
>     coherency_line_size  level  number_of_sets  physical_line_partition
>     shared_cpu_list  shared_cpu_map  size  type  uevent
>     ways_of_associativity
>
>   $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
>     taskset -c 0-15,64-79,128-143,192-207\
>     perf bench sched messaging -p -t -l 100000 -g 8
>
>     # Running 'sched/messaging' benchmark:
>     # 20 sender and receiver threads per group
>     # 8 groups == 320 threads run
>
>          Total time: 6.949 [sec]
>
>      Performance counter stats for 'system wide':
>
>     S0-D0-L3-ID0             16          5,297,615      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID8             16          4,347,868      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID16            16            416,593      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID24            16              4,346      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID32            16              5,506      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID40            16             15,845      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID48            16             24,164      ls_dmnd_fills_from_sys.ext_cache_remote
>     S0-D0-L3-ID56            16              4,543      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID64            16         41,610,374      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID72            16         38,393,688      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID80            16             22,188      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID88            16             22,918      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID96            16             39,230      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID104           16              6,236      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID112           16             66,846      ls_dmnd_fills_from_sys.ext_cache_remote
>     S1-D1-L3-ID120           16             72,713      ls_dmnd_fills_from_sys.ext_cache_remote
>
>            7.098471410 seconds time elapsed
>
> Few notes:
>
> - This series makes breaking change when saving the aggregation details
>   as the cache level needs to be saved along with the aggregation
>   method.
>
> - This series assumes that caches at same level will be shared by same
>   set of threads. The implementation will run into an issue if, say L1i
>   is thread local, but L1d is shared by the SMT siblings on the core.
>
> This series cleanly applies on top perf-tool branch from Arnaldo's tree
> (https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
> at commit 760ebc45746b ("perf lock contention: Add empty 'struct rq' to
> satisfy libbpf 'runqueue' type verification")
> ---
> Changelog:
> o v3->v4:
>   - Dropped the RFC tag.
>   - Break down Patch 2 from v3 into smaller patches (kind of!)
>   - Fixed couple of errors in docs and comments.
>
> o v2->v3:
>   - Dropped patches 1 and 2 that saved and retrieved the cache instance
>     ID when saving the cache data.
>   - The above is unnecessary as the IDs are being derived from the first
>     online CPU in the cache domain for a given cache instance.
>   - Improvements to handling cases where a cache level is not present
>     but the level is allowed by MAX_CACHE_LVL.
>   - Updated details in cover letter.
>
> o v1->v2
>   - Set cache instance ID to 0 if the file cannot be read.
>   - Fix cache level parsing function.
>   - Updated details in cover letter.
> ---
> K Prateek Nayak (5):
>   perf: Extract building cache level for a CPU into separate function
>   perf stat: Setup the foundation to allow aggregation based on cache
>     topology
>   perf stat: Save cache level information when running perf stat record
>   perf stat: Add "--per-cache" aggregation option and document the same
>   pert stat: Add tests for the "--per-cache" option

Acked-by: Ian Rogers <irogers@...gle.com>

Thanks,
Ian

>  tools/lib/perf/include/perf/cpumap.h          |   5 +
>  tools/lib/perf/include/perf/event.h           |   3 +-
>  tools/perf/Documentation/perf-stat.txt        |  16 ++
>  tools/perf/builtin-stat.c                     | 144 +++++++++++++++++-
>  .../tests/shell/lib/perf_json_output_lint.py  |   4 +-
>  tools/perf/tests/shell/stat+csv_output.sh     |  14 ++
>  tools/perf/tests/shell/stat+json_output.sh    |  13 ++
>  tools/perf/util/cpumap.c                      | 119 +++++++++++++++
>  tools/perf/util/cpumap.h                      |  28 ++++
>  tools/perf/util/event.c                       |   7 +-
>  tools/perf/util/header.c                      |  62 +++++---
>  tools/perf/util/header.h                      |   4 +
>  tools/perf/util/stat-display.c                |  17 +++
>  tools/perf/util/stat.h                        |   2 +
>  tools/perf/util/synthetic-events.c            |   1 +
>  15 files changed, 409 insertions(+), 30 deletions(-)
>
> --
> 2.34.1
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ