linux-kernel - [PATCH v2 0/6] perf sched: Introduce stats tool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241122084452.1064968-1-swapnil.sapkal@amd.com>
Date: Fri, 22 Nov 2024 08:44:46 +0000
From: Swapnil Sapkal <swapnil.sapkal@....com>
To: <peterz@...radead.org>, <mingo@...hat.com>, <acme@...nel.org>,
	<namhyung@...nel.org>, <irogers@...gle.com>, <james.clark@....com>
CC: <ravi.bangoria@....com>, <yu.c.chen@...el.com>, <mark.rutland@....com>,
	<alexander.shishkin@...ux.intel.com>, <jolsa@...nel.org>,
	<rostedt@...dmis.org>, <vincent.guittot@...aro.org>,
	<adrian.hunter@...el.com>, <kan.liang@...ux.intel.com>,
	<gautham.shenoy@....com>, <kprateek.nayak@....com>, <juri.lelli@...hat.com>,
	<yangjihong@...edance.com>, <void@...ifault.com>, <tj@...nel.org>,
	<vineethr@...ux.ibm.com>, <linux-kernel@...r.kernel.org>,
	<linux-perf-users@...r.kernel.org>, <santosh.shukla@....com>,
	<ananth.narayan@....com>, <sandipan.das@....com>, Swapnil Sapkal
	<swapnil.sapkal@....com>
Subject: [PATCH v2 0/6] perf sched: Introduce stats tool

MOTIVATION
----------

Existing `perf sched` is quite exhaustive and provides lot of insights
into scheduler behavior but it quickly becomes impractical to use for
long running or scheduler intensive workload. For ex, `perf sched record`
has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
generates huge 56G perf.data for which perf takes ~137 mins to prepare
and write it to disk [1].

Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
and generates samples on a tracepoint hit, `perf sched stats record` takes
snapshot of the /proc/schedstat file before and after the workload, i.e.
there is almost zero interference on workload run. Also, it takes very
minimal time to parse /proc/schedstat, convert it into perf samples and
save those samples into perf.data file. Result perf.data file is much
smaller. So, overall `perf sched stats record` is much more light weight
compare to `perf sched record`.

We, internally at AMD, have been using this (a variant of this, known as
"sched-scoreboard"[2]) and found it to be very useful to analyse impact
of any scheduler code changes[3][4].

Please note that, this is not a replacement of perf sched record/report.
The intended users of the new tool are scheduler developers, not regular
users.

USAGE
-----

  # perf sched stats record
  # perf sched stats report
  # perf sched stats diff

Note: Although `perf sched stats` tool supports workload profiling syntax
(i.e. -- <workload> ), the recorded profile is still systemwide since the
/proc/schedstat is a systemwide file.

HOW TO INTERPRET THE REPORT
---------------------------

The `perf sched stats report` starts with description of the columns
present in the report. These column names are gievn before cpu and
domain stats to improve the readability of the report.

  ----------------------------------------------------------------------------------------------------
  DESC                    -> Description of the field
  COUNT                   -> Value of the field
  PCT_CHANGE              -> Percent change with corresponding base value
  AVG_JIFFIES             -> Avg time in jiffies between two consecutive occurrence of event
  ----------------------------------------------------------------------------------------------------

Next is the total profiling time in terms of jiffies:

  ----------------------------------------------------------------------------------------------------
  Time elapsed (in jiffies)                                   :       24537
  ----------------------------------------------------------------------------------------------------

Next is CPU scheduling statistics. These are simple diffs of
/proc/schedstat CPU lines along with description. The report also
prints % relative to base stat.

In the example below, schedule() left the CPU0 idle 98.19% of the time.
16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
same CPU.

  ----------------------------------------------------------------------------------------------------
  CPU 0
  ----------------------------------------------------------------------------------------------------
  DESC                                                                COUNT  PCT_CHANGE
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                         :           0
  Legacy counter can be ignored                               :           0
  schedule() called                                           :       17138
  schedule() left the processor idle                          :       16827  (  98.19% )
  try_to_wake_up() was called                                 :         508
  try_to_wake_up() was called to wake up the local cpu        :          84  (  16.54% )
  total runtime by tasks on this processor (in jiffies)       :  2408959243
  total waittime by tasks on this processor (in jiffies)      :    11731825  (  0.49% )
  total timeslices run on this cpu                            :         311
  ----------------------------------------------------------------------------------------------------

Next is load balancing statistics. For each of the sched domains
(eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
the following three categories:

  1) Idle Load Balance: Load balancing performed on behalf of a long
                        idling CPU by some other CPU.
  2) Busy Load Balance: Load balancing performed when the CPU was busy.
  3) New Idle Balance : Load balancing performed when a CPU just became
                        idle.

Under each of these three categories, sched stats report provides
different load balancing statistics. Along with direct stats, the
report also contains derived metrics prefixed with *. Example:

  ----------------------------------------------------------------------------------------------------
  CPU 0 DOMAIN SMT CPUS <0, 64>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                     COUNT     AVG_JIFFIES
  ----------------------------------------- <Category idle> ------------------------------------------
  load_balance() count on cpu idle                                 :          50   $      490.74 $
  load_balance() found balanced on cpu idle                        :          42   $      584.21 $
  load_balance() move task failed on cpu idle                      :           8   $     3067.12 $
  imbalance sum on cpu idle                                        :           8
  pull_task() count on cpu idle                                    :           0
  pull_task() when target task was cache-hot on cpu idle           :           0
  load_balance() failed to find busier queue on cpu idle           :           0   $        0.00 $
  load_balance() failed to find busier group on cpu idle           :          42   $      584.21 $
  *load_balance() success count on cpu idle                        :           0
  *avg task pulled per successful lb attempt (cpu idle)            :        0.00
  ----------------------------------------- <Category busy> ------------------------------------------
  load_balance() count on cpu busy                                 :           2   $    12268.50 $
  load_balance() found balanced on cpu busy                        :           2   $    12268.50 $
  load_balance() move task failed on cpu busy                      :           0   $        0.00 $
  imbalance sum on cpu busy                                        :           0
  pull_task() count on cpu busy                                    :           0
  pull_task() when target task was cache-hot on cpu busy           :           0
  load_balance() failed to find busier queue on cpu busy           :           0   $        0.00 $
  load_balance() failed to find busier group on cpu busy           :           1   $    24537.00 $
  *load_balance() success count on cpu busy                        :           0
  *avg task pulled per successful lb attempt (cpu busy)            :        0.00
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :         427   $       57.46 $
  load_balance() found balanced on cpu newly idle                  :         382   $       64.23 $
  load_balance() move task failed on cpu newly idle                :          45   $      545.27 $
  imbalance sum on cpu newly idle                                  :          48
  pull_task() count on cpu newly idle                              :           0
  pull_task() when target task was cache-hot on cpu newly idle     :           0
  load_balance() failed to find busier queue on cpu newly idle     :           0   $        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :         382   $       64.23 $
  *load_balance() success count on cpu newly idle                  :           0
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.00
  ----------------------------------------------------------------------------------------------------

Consider following line:

  load_balance() found balanced on cpu newly idle                  :         382    $      64.23 $

While profiling was active, the load-balancer found 382 times the load
needs to be balanced on a newly idle CPU 0. Following value encapsulated
inside $ is average jiffies between two events (24537 / 382 = 64.23).

Next are active_load_balance() stats. alb did not trigger while the
profiling was active, hence it's all 0s.

  --------------------------------- <Category active_load_balance()> ---------------------------------
  active_load_balance() count                                      :           0
  active_load_balance() move task failed                           :           0
  active_load_balance() successfully moved a task                  :           0
  ----------------------------------------------------------------------------------------------------

Next are sched_balance_exec() and sched_balance_fork() stats. They are
not used but we kept it in RFC just for legacy purpose. Unless opposed,
we plan to remove them in next revision.

Next are wakeup statistics. For every domain, the report also shows
task-wakeup statistics. Example:

  ------------------------------------------- <Wakeup Info> ------------------------------------------
  try_to_wake_up() awoke a task that last ran on a diff cpu       :       12070
  try_to_wake_up() moved task because cache-cold on own cpu       :        3324
  try_to_wake_up() started passive balancing                      :           0
  ----------------------------------------------------------------------------------------------------

Same set of stats are reported for each CPU and each domain level.

HOW TO INTERPRET THE DIFF
-------------------------

The `perf sched stats diff` will also start with explaining the columns
present in the diff. Then it will show the diff in time in terms of
jiffies. The order of the values depends on the order of input data
files. Example:

  ----------------------------------------------------------------------------------------------------
  Time elapsed (in jiffies)                                        :        2009,       2001
  ----------------------------------------------------------------------------------------------------

Below is the sample representing the difference in cpu and domain stats of
two runs. Here third column or the values enclosed in `|...|` shows the
percent change between the two. Second and fourth columns shows the
side-by-side representions of the corresponding fields from `perf sched
stats report`.

  ----------------------------------------------------------------------------------------------------
  CPU 0
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2  PCT_CHANGE  PCT_CHANGE1 PCT_CHANGE2
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                              :           0,          0  |    0.00% |
  Legacy counter can be ignored                                    :           0,          0  |    0.00% |
  schedule() called                                                :      442939,     447305  |    0.99% |
  schedule() left the processor idle                               :      154012,     174657  |   13.40% |  (   34.77,      39.05 )
  try_to_wake_up() was called                                      :      306810,     258076  |  -15.88% |
  try_to_wake_up() was called to wake up the local cpu             :       21313,      14130  |  -33.70% |  (    6.95,       5.48 )
  total runtime by tasks on this processor (in jiffies)            :  6235330010, 5463133934  |  -12.38% |
  total waittime by tasks on this processor (in jiffies)           :  8349785693, 5755097654  |  -31.07% |  (  133.91,     105.34 )
  total timeslices run on this cpu                                 :      288869,     272599  |   -5.63% |
  ----------------------------------------------------------------------------------------------------

Below is the sample of domain stats diff:

  ----------------------------------------------------------------------------------------------------
  CPU 0, DOMAIN SMT CPUS <0, 64>
  ----------------------------------------------------------------------------------------------------
  DESC                                                                    COUNT1      COUNT2  PCT_CHANGE     AVG_JIFFIES1  AVG_JIFFIES2
  ----------------------------------------- <Category busy> ------------------------------------------
  load_balance() count on cpu busy                                 :         154,         80  |  -48.05% |  $       13.05,       25.01 $
  load_balance() found balanced on cpu busy                        :         120,         66  |  -45.00% |  $       16.74,       30.32 $
  load_balance() move task failed on cpu busy                      :           0,          4  |    0.00% |  $        0.00,      500.25 $
  imbalance sum on cpu busy                                        :        1640,        299  |  -81.77% |
  pull_task() count on cpu busy                                    :          55,         18  |  -67.27% |
  pull_task() when target task was cache-hot on cpu busy           :           0,          0  |    0.00% |
  load_balance() failed to find busier queue on cpu busy           :           0,          0  |    0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu busy           :         120,         66  |  -45.00% |  $       16.74,       30.32 $
  *load_balance() success count on cpu busy                        :          34,         10  |  -70.59% |
  *avg task pulled per successful lb attempt (cpu busy)            :        1.62,       1.80  |   11.27% |
  ----------------------------------------- <Category idle> ------------------------------------------
  load_balance() count on cpu idle                                 :         299,        481  |   60.87% |  $        6.72,        4.16 $
  load_balance() found balanced on cpu idle                        :         197,        331  |   68.02% |  $       10.20,        6.05 $
  load_balance() move task failed on cpu idle                      :           1,          2  |  100.00% |  $     2009.00,     1000.50 $
  imbalance sum on cpu idle                                        :         145,        222  |   53.10% |
  pull_task() count on cpu idle                                    :         133,        199  |   49.62% |
  pull_task() when target task was cache-hot on cpu idle           :           0,          0  |    0.00% |
  load_balance() failed to find busier queue on cpu idle           :           0,          0  |    0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu idle           :         197,        331  |   68.02% |  $       10.20,        6.05 $
  *load_balance() success count on cpu idle                        :         101,        148  |   46.53% |
  *avg task pulled per successful lb attempt (cpu idle)            :        1.32,       1.34  |    2.11% |
  ---------------------------------------- <Category newidle> ----------------------------------------
  load_balance() count on cpu newly idle                           :       21791,      15976  |  -26.69% |  $        0.09,        0.13 $
  load_balance() found balanced on cpu newly idle                  :       16226,      12125  |  -25.27% |  $        0.12,        0.17 $
  load_balance() move task failed on cpu newly idle                :         236,         88  |  -62.71% |  $        8.51,       22.74 $
  imbalance sum on cpu newly idle                                  :        6655,       4628  |  -30.46% |
  pull_task() count on cpu newly idle                              :        5329,       3763  |  -29.39% |
  pull_task() when target task was cache-hot on cpu newly idle     :           0,          0  |    0.00% |
  load_balance() failed to find busier queue on cpu newly idle     :           0,          0  |    0.00% |  $        0.00,        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :       12649,       9914  |  -21.62% |  $        0.16,        0.20 $
  *load_balance() success count on cpu newly idle                  :        5329,       3763  |  -29.39% |
  *avg task pulled per successful lb attempt (cpu newly idle)      :        1.00,       1.00  |    0.00% |
  --------------------------------- <Category active_load_balance()> ---------------------------------
  active_load_balance() count                                      :           0,          0  |    0.00% |
  active_load_balance() move task failed                           :           0,          0  |    0.00% |
  active_load_balance() successfully moved a task                  :           0,          0  |    0.00% |
  --------------------------------- <Category sched_balance_exec()> ----------------------------------
  sbe_count is not used                                            :           0,          0  |    0.00% |
  sbe_balanced is not used                                         :           0,          0  |    0.00% |
  sbe_pushed is not used                                           :           0,          0  |    0.00% |
  --------------------------------- <Category sched_balance_fork()> ----------------------------------
  sbf_count is not used                                            :           0,          0  |    0.00% |
  sbf_balanced is not used                                         :           0,          0  |    0.00% |
  sbf_pushed is not used                                           :           0,          0  |    0.00% |
  ------------------------------------------ <Wakeup Info> -------------------------------------------
  try_to_wake_up() awoke a task that last ran on a diff cpu        :       16606,      10214  |  -38.49% |
  try_to_wake_up() moved task because cache-cold on own cpu        :        3184,       2534  |  -20.41% |
  try_to_wake_up() started passive balancing                       :           0,          0  |    0.00% |
  ----------------------------------------------------------------------------------------------------

v1: https://lore.kernel.org/lkml/20240916164722.1838-1-ravi.bangoria@amd.com
v1->v2
 - Add the support for `perf sched stats diff`
 - Add column header in report for better readability. Use
   procfs__mountpoint for consistency. Add hint for enabling
   CONFIG_SCHEDSTAT if disabled. [James Clark]
 - Use a single header file for both cpu and domain fileds. Change
   the layout of structs to minimise the padding. I tried changing
   `v15` to `15` in the header files but it was not giving any
   benefits so drop the idea. [Namhyung Kim]
 - Add tested-by.

RFC: https://lore.kernel.org/r/20240508060427.417-1-ravi.bangoria@amd.com
RFC->v1:
 - [Kernel] Print domain name along with domain number in /proc/schedstat
   file.
 - s/schedstat/stats/ for the subcommand.
 - Record domain name and cpumask details, also show them in report.
 - Add CPU filtering capability at record and report time.
 - Add /proc/schedstat v16 support.
 - Live mode support. Similar to perf stat command, live mode prints the
   sched stats on the stdout.
 - Add pager support in `perf sched stats report` for better scrolling.
 - Some minor cosmetic changes in report output to improve readability.
 - Rebase to latest perf-tools-next/perf-tools-next (1de5b5dcb835).

TODO:
 - Add perf unit tests to test basic sched stats functionalities
 - Describe new tool, it's usage and interpretation of report data in the
   perf-sched man page.
 - perf sched stats records /proc/schedstat which is a CPU and domain
   level scheduler statistic. We are planning to add taskstat tool which
   reads task stats from procfs and generate scheduler statistic report
   at task granularity. this will probably a standalone tool, something
   like `perf sched taskstat record/report`.
 - Except pre-processor related checkpatch warnings, we have addressed
   most of the other possible warnings.

Patches are prepared on v6.12 (adc218676eef).

[1] https://youtu.be/lg-9aG2ajA0?t=283
[2] https://github.com/AMDESE/sched-scoreboard
[3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amd.com/
[4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/


K Prateek Nayak (1):
  sched/stats: Print domain name in /proc/schedstat

Swapnil Sapkal (5):
  perf sched stats: Add record and rawdump support
  perf sched stats: Add schedstat v16 support
  perf sched stats: Add support for report subcommand
  perf sched stats: Add support for live mode
  perf sched stats: Add support for diff subcommand

 Documentation/scheduler/sched-stats.rst     |   8 +-
 kernel/sched/stats.c                        |   6 +-
 tools/lib/perf/Documentation/libperf.txt    |   2 +
 tools/lib/perf/Makefile                     |   2 +-
 tools/lib/perf/include/perf/event.h         |  56 ++
 tools/lib/perf/include/perf/schedstat-v15.h | 142 +++
 tools/lib/perf/include/perf/schedstat-v16.h | 143 +++
 tools/perf/builtin-inject.c                 |   2 +
 tools/perf/builtin-sched.c                  | 965 +++++++++++++++++++-
 tools/perf/util/event.c                     | 104 +++
 tools/perf/util/event.h                     |   2 +
 tools/perf/util/session.c                   |  20 +
 tools/perf/util/synthetic-events.c          | 256 ++++++
 tools/perf/util/synthetic-events.h          |   3 +
 tools/perf/util/tool.c                      |  20 +
 tools/perf/util/tool.h                      |   4 +-
 16 files changed, 1729 insertions(+), 6 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-v15.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-v16.h

-- 
2.43.0