[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <124e9602-77c6-4e85-907c-907dfdc42087@amd.com>
Date: Wed, 18 Sep 2024 18:49:52 +0530
From: "Sapkal, Swapnil" <swapnil.sapkal@....com>
To: James Clark <james.clark@...aro.org>, Ravi Bangoria
<ravi.bangoria@....com>
CC: <yu.c.chen@...el.com>, <mark.rutland@....com>,
<alexander.shishkin@...ux.intel.com>, <jolsa@...nel.org>,
<rostedt@...dmis.org>, <vincent.guittot@...aro.org>, <bristot@...hat.com>,
<adrian.hunter@...el.com>, <james.clark@....com>,
<kan.liang@...ux.intel.com>, <gautham.shenoy@....com>,
<kprateek.nayak@....com>, <juri.lelli@...hat.com>,
<yangjihong@...edance.com>, <void@...ifault.com>, <tj@...nel.org>,
<linux-kernel@...r.kernel.org>, <linux-perf-users@...r.kernel.org>,
<santosh.shukla@....com>, <ananth.narayan@....com>, <sandipan.das@....com>,
<peterz@...radead.org>, <mingo@...hat.com>, <acme@...nel.org>,
<namhyung@...nel.org>, <irogers@...gle.com>
Subject: Re: [PATCH 0/5] perf sched: Introduce stats tool
Hi James,
On 9/17/2024 4:05 PM, James Clark wrote:
>
>
> On 16/09/2024 17:47, Ravi Bangoria wrote:
>> MOTIVATION
>> ----------
>>
>> Existing `perf sched` is quite exhaustive and provides lot of insights
>> into scheduler behavior but it quickly becomes impractical to use for
>> long running or scheduler intensive workload. For ex, `perf sched record`
>> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
>> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
>> generates huge 56G perf.data for which perf takes ~137 mins to prepare
>> and write it to disk [1].
>>
>> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
>> and generates samples on a tracepoint hit, `perf sched stats record`
>> takes
>> snapshot of the /proc/schedstat file before and after the workload, i.e.
>> there is almost zero interference on workload run. Also, it takes very
>> minimal time to parse /proc/schedstat, convert it into perf samples and
>> save those samples into perf.data file. Result perf.data file is much
>> smaller. So, overall `perf sched stats record` is much more light weight
>> compare to `perf sched record`.
>>
>> We, internally at AMD, have been using this (a variant of this, known as
>> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
>> of any scheduler code changes[3][4].
>>
>> Please note that, this is not a replacement of perf sched record/report.
>> The intended users of the new tool are scheduler developers, not regular
>> users.
>>
>> USAGE
>> -----
>>
>> # perf sched stats record
>> # perf sched stats report
>>
>> Note: Although `perf sched stats` tool supports workload profiling syntax
>> (i.e. -- <workload> ), the recorded profile is still systemwide since the
>> /proc/schedstat is a systemwide file.
>>
>> HOW TO INTERPRET THE REPORT
>> ---------------------------
>>
>> The `perf sched stats report` starts with total time profiling was active
>> in terms of jiffies:
>>
>>
>> ----------------------------------------------------------------------------------------------------
>> Time elapsed (in jiffies) :
>> 24537
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is CPU scheduling statistics. These are simple diffs of
>> /proc/schedstat CPU lines along with description. The report also
>> prints % relative to base stat.
>>
>> In the example below, schedule() left the CPU0 idle 98.19% of the time.
>> 16.54% of total try_to_wake_up() was to wakeup local CPU. And, the total
>> waittime by tasks on CPU0 is 0.49% of the total runtime by tasks on the
>> same CPU.
>>
>>
>> ----------------------------------------------------------------------------------------------------
>> CPU 0
>>
>> ----------------------------------------------------------------------------------------------------
>> sched_yield() count
>> : 0
>> Legacy counter can be ignored
>> : 0
>> schedule() called :
>> 17138
>> schedule() left the processor idle :
>> 16827 ( 98.19% )
>> try_to_wake_up() was called
>> : 508
>> try_to_wake_up() was called to wake up the local cpu
>> : 84 ( 16.54% )
>> total runtime by tasks on this processor (in jiffies) :
>> 2408959243
>> total waittime by tasks on this processor (in jiffies) :
>> 11731825 ( 0.49% )
>> total timeslices run on this cpu
>> : 311
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is load balancing statistics. For each of the sched domains
>> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
>> the following three categories:
>>
>> 1) Idle Load Balance: Load balancing performed on behalf of a long
>> idling CPU by some other CPU.
>> 2) Busy Load Balance: Load balancing performed when the CPU was busy.
>> 3) New Idle Balance : Load balancing performed when a CPU just became
>> idle.
>>
>> Under each of these three categories, sched stats report provides
>> different load balancing statistics. Along with direct stats, the
>> report also contains derived metrics prefixed with *. Example:
>>
>>
>> ----------------------------------------------------------------------------------------------------
>> CPU 0 DOMAIN SMT CPUS <0, 64>
>> ----------------------------------------- <Category idle>
>> ------------------------------------------
>> load_balance() count on cpu idle
>> : 50 $ 490.74 $
>> load_balance() found balanced on cpu idle
>> : 42 $ 584.21 $
>> load_balance() move task failed on cpu idle
>> : 8 $ 3067.12 $
>> imbalance sum on cpu idle
>> : 8
>> pull_task() count on cpu idle
>> : 0
>> pull_task() when target task was cache-hot on cpu idle
>> : 0
>> load_balance() failed to find busier queue on cpu idle
>> : 0 $ 0.00 $
>> load_balance() failed to find busier group on cpu idle
>> : 42 $ 584.21 $
>> *load_balance() success count on cpu idle
>> : 0
>> *avg task pulled per successful lb attempt (cpu idle)
>> : 0.00
>> ----------------------------------------- <Category busy>
>> ------------------------------------------
>> load_balance() count on cpu busy
>> : 2 $ 12268.50 $
>> load_balance() found balanced on cpu busy
>> : 2 $ 12268.50 $
>> load_balance() move task failed on cpu busy
>> : 0 $ 0.00 $
>> imbalance sum on cpu busy
>> : 0
>> pull_task() count on cpu busy
>> : 0
>> pull_task() when target task was cache-hot on cpu busy
>> : 0
>> load_balance() failed to find busier queue on cpu busy
>> : 0 $ 0.00 $
>> load_balance() failed to find busier group on cpu busy
>> : 1 $ 24537.00 $
>> *load_balance() success count on cpu busy
>> : 0
>> *avg task pulled per successful lb attempt (cpu busy)
>> : 0.00
>> ---------------------------------------- <Category newidle>
>> ----------------------------------------
>> load_balance() count on cpu newly idle
>> : 427 $ 57.46 $
>> load_balance() found balanced on cpu newly idle
>> : 382 $ 64.23 $
>> load_balance() move task failed on cpu newly idle
>> : 45 $ 545.27 $
>> imbalance sum on cpu newly idle
>> : 48
>> pull_task() count on cpu newly idle
>> : 0
>> pull_task() when target task was cache-hot on cpu newly idle
>> : 0
>> load_balance() failed to find busier queue on cpu newly idle
>> : 0 $ 0.00 $
>> load_balance() failed to find busier group on cpu newly idle
>> : 382 $ 64.23 $
>> *load_balance() success count on cpu newly idle
>> : 0
>> *avg task pulled per successful lb attempt (cpu newly idle)
>> : 0.00
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> Consider following line:
>>
>> load_balance() found balanced on cpu newly idle
>> : 382 $ 64.23 $
>>
>> While profiling was active, the load-balancer found 382 times the load
>> needs to be balanced on a newly idle CPU 0. Following value encapsulated
>> inside $ is average jiffies between two events (24537 / 382 = 64.23).
>
> This explanation of the $ fields is quite buried. Is there a way of
> making it clearer with a column header in the report? I think even if it
> was documented in the man pages it might not be that useful.
>
Thank you for the suggestion. I will add a header in the report to
explain what each column values are representing.
> There are also other jiffies fields that don't use $. Maybe if it was
> like this it could be semi self documenting:
>
> ----------------------------------------------------------------------
> Time elapsed (in jiffies) : $ 24537 $
> ----------------------------------------------------------------------
>
> ------------------<Category newidle> ---------------------------------
> load_balance() count on cpu newly idle : 427 $ 57.46 avg $
> ----------------------------------------------------------------------
>
>
> Other than that:
>
> Tested-by: James Clark <james.clark@...aro.org>
Ack.
--
Thanks and Regards,
Swapnil
Powered by blists - more mailing lists