linux-kernel - Re: [PATCH 0/5] perf sched: Introduce stats tool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f022afad-793d-4a3d-acad-13fd27dd5e62@linaro.org>
Date: Tue, 17 Sep 2024 11:35:46 +0100
From: James Clark <james.clark@...aro.org>
To: Ravi Bangoria <ravi.bangoria@....com>, swapnil.sapkal@....com
Cc: yu.c.chen@...el.com, mark.rutland@....com,
 alexander.shishkin@...ux.intel.com, jolsa@...nel.org, rostedt@...dmis.org,
 vincent.guittot@...aro.org, bristot@...hat.com, adrian.hunter@...el.com,
 james.clark@....com, kan.liang@...ux.intel.com, gautham.shenoy@....com,
 kprateek.nayak@....com, juri.lelli@...hat.com, yangjihong@...edance.com,
 void@...ifault.com, tj@...nel.org, linux-kernel@...r.kernel.org,
 linux-perf-users@...r.kernel.org, santosh.shukla@....com,
 ananth.narayan@....com, sandipan.das@....com, peterz@...radead.org,
 mingo@...hat.com, acme@...nel.org, namhyung@...nel.org, irogers@...gle.com
Subject: Re: [PATCH 0/5] perf sched: Introduce stats tool



On 16/09/2024 17:47, Ravi Bangoria wrote:
> MOTIVATION
> ----------
> 
> Existing `perf sched` is quite exhaustive and provides lot of insights
> into scheduler behavior but it quickly becomes impractical to use for
> long running or scheduler intensive workload. For ex, `perf sched record`
> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
> generates huge 56G perf.data for which perf takes ~137 mins to prepare
> and write it to disk [1].
> 
> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
> and generates samples on a tracepoint hit, `perf sched stats record` takes
> snapshot of the /proc/schedstat file before and after the workload, i.e.
> there is almost zero interference on workload run. Also, it takes very
> minimal time to parse /proc/schedstat, convert it into perf samples and
> save those samples into perf.data file. Result perf.data file is much
> smaller. So, overall `perf sched stats record` is much more light weight
> compare to `perf sched record`.
> 
> We, internally at AMD, have been using this (a variant of this, known as
> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
> of any scheduler code changes[3][4].
> 
> Please note that, this is not a replacement of perf sched record/report.
> The intended users of the new tool are scheduler developers, not regular
> users.
> 
> USAGE
> -----
> 
>    # perf sched stats record
>    # perf sched stats report
> 
> Note: Although `perf sched stats` tool supports workload profiling syntax
> (i.e. -- <workload> ), the recorded profile is still systemwide since the
> /proc/schedstat is a systemwide file.
> 
> HOW TO INTERPRET THE REPORT
> ---------------------------
> 
> The `perf sched stats report` starts with total time profiling was active
> in terms of jiffies:
> 
>    ----------------------------------------------------------------------------------------------------
>    Time elapsed (in jiffies)                                   :       24537
>    ----------------------------------------------------------------------------------------------------
> 
> Next is CPU scheduling statistics. These are simple diffs of
> /proc/schedstat CPU lines along with description. The report also
> prints % relative to base stat.
> 
> In the example below, schedule() left the CPU0 idle 98.19% of the time.
> 16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
> waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
> same CPU.
> 
>    ----------------------------------------------------------------------------------------------------
>    CPU 0
>    ----------------------------------------------------------------------------------------------------
>    sched_yield() count                                         :           0
>    Legacy counter can be ignored                               :           0
>    schedule() called                                           :       17138
>    schedule() left the processor idle                          :       16827 ( 98.19% )
>    try_to_wake_up() was called                                 :         508
>    try_to_wake_up() was called to wake up the local cpu        :          84 ( 16.54% )
>    total runtime by tasks on this processor (in jiffies)       :  2408959243
>    total waittime by tasks on this processor (in jiffies)      :    11731825 ( 0.49% )
>    total timeslices run on this cpu                            :         311
>    ----------------------------------------------------------------------------------------------------
> 
> Next is load balancing statistics. For each of the sched domains
> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
> the following three categories:
> 
>    1) Idle Load Balance: Load balancing performed on behalf of a long
>                          idling CPU by some other CPU.
>    2) Busy Load Balance: Load balancing performed when the CPU was busy.
>    3) New Idle Balance : Load balancing performed when a CPU just became
>                          idle.
> 
> Under each of these three categories, sched stats report provides
> different load balancing statistics. Along with direct stats, the
> report also contains derived metrics prefixed with *. Example:
> 
>    ----------------------------------------------------------------------------------------------------
>    CPU 0 DOMAIN SMT CPUS <0, 64>
>    ----------------------------------------- <Category idle> ------------------------------------------
>    load_balance() count on cpu idle                                 :          50   $      490.74 $
>    load_balance() found balanced on cpu idle                        :          42   $      584.21 $
>    load_balance() move task failed on cpu idle                      :           8   $     3067.12 $
>    imbalance sum on cpu idle                                        :           8
>    pull_task() count on cpu idle                                    :           0
>    pull_task() when target task was cache-hot on cpu idle           :           0
>    load_balance() failed to find busier queue on cpu idle           :           0   $        0.00 $
>    load_balance() failed to find busier group on cpu idle           :          42   $      584.21 $
>    *load_balance() success count on cpu idle                        :           0
>    *avg task pulled per successful lb attempt (cpu idle)            :        0.00
>    ----------------------------------------- <Category busy> ------------------------------------------
>    load_balance() count on cpu busy                                 :           2   $    12268.50 $
>    load_balance() found balanced on cpu busy                        :           2   $    12268.50 $
>    load_balance() move task failed on cpu busy                      :           0   $        0.00 $
>    imbalance sum on cpu busy                                        :           0
>    pull_task() count on cpu busy                                    :           0
>    pull_task() when target task was cache-hot on cpu busy           :           0
>    load_balance() failed to find busier queue on cpu busy           :           0   $        0.00 $
>    load_balance() failed to find busier group on cpu busy           :           1   $    24537.00 $
>    *load_balance() success count on cpu busy                        :           0
>    *avg task pulled per successful lb attempt (cpu busy)            :        0.00
>    ---------------------------------------- <Category newidle> ----------------------------------------
>    load_balance() count on cpu newly idle                           :         427   $       57.46 $
>    load_balance() found balanced on cpu newly idle                  :         382   $       64.23 $
>    load_balance() move task failed on cpu newly idle                :          45   $      545.27 $
>    imbalance sum on cpu newly idle                                  :          48
>    pull_task() count on cpu newly idle                              :           0
>    pull_task() when target task was cache-hot on cpu newly idle     :           0
>    load_balance() failed to find busier queue on cpu newly idle     :           0   $        0.00 $
>    load_balance() failed to find busier group on cpu newly idle     :         382   $       64.23 $
>    *load_balance() success count on cpu newly idle                  :           0
>    *avg task pulled per successful lb attempt (cpu newly idle)      :        0.00
>    ----------------------------------------------------------------------------------------------------
> 
> Consider following line:
> 
>    load_balance() found balanced on cpu newly idle                  :         382    $      64.23 $
> 
> While profiling was active, the load-balancer found 382 times the load
> needs to be balanced on a newly idle CPU 0. Following value encapsulated
> inside $ is average jiffies between two events (24537 / 382 = 64.23).

This explanation of the $ fields is quite buried. Is there a way of 
making it clearer with a column header in the report? I think even if it 
was documented in the man pages it might not be that useful.

There are also other jiffies fields that don't use $. Maybe if it was 
like this it could be semi self documenting:

----------------------------------------------------------------------
     Time elapsed (in jiffies)               :        $  24537        $ 
  ----------------------------------------------------------------------

------------------<Category newidle> ---------------------------------
   load_balance() count on cpu newly idle    :   427  $     57.46 avg $
----------------------------------------------------------------------


Other than that:

Tested-by: James Clark <james.clark@...aro.org>