[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0778a20a-00cb-4a90-9e8e-99ff033dc23d@amd.com>
Date: Fri, 23 Jan 2026 21:49:45 +0530
From: Swapnil Sapkal <swapnil.sapkal@....com>
To: Shrikanth Hegde <sshegde@...ux.ibm.com>
CC: <ravi.bangoria@....com>, <yu.c.chen@...el.com>, <mark.rutland@....com>,
<alexander.shishkin@...ux.intel.com>, <jolsa@...nel.org>,
<rostedt@...dmis.org>, <vincent.guittot@...aro.org>,
<adrian.hunter@...el.com>, <kan.liang@...ux.intel.com>,
<gautham.shenoy@....com>, <kprateek.nayak@....com>, <juri.lelli@...hat.com>,
<yangjihong@...edance.com>, <void@...ifault.com>, <tj@...nel.org>,
<ctshao@...gle.com>, <quic_zhonhan@...cinc.com>, <thomas.falcon@...el.com>,
<blakejones@...gle.com>, <ashelat@...hat.com>, <leo.yan@....com>,
<dvyukov@...gle.com>, <ak@...ux.intel.com>, <yujie.liu@...el.com>,
<graham.woodward@....com>, <ben.gainey@....com>, <vineethr@...ux.ibm.com>,
<tim.c.chen@...ux.intel.com>, <linux@...blig.org>, <santosh.shukla@....com>,
<sandipan.das@....com>, <linux-kernel@...r.kernel.org>,
<linux-perf-users@...r.kernel.org>, <peterz@...radead.org>,
<mingo@...hat.com>, <acme@...nel.org>, <namhyung@...nel.org>,
<irogers@...gle.com>, <james.clark@....com>
Subject: Re: [PATCH v5 00/10] perf sched: Introduce stats tool
Hi Shrikanth,
On 21-01-2026 23:22, Shrikanth Hegde wrote:
>
>
> On 1/19/26 11:28 PM, Swapnil Sapkal wrote:
>> MOTIVATION
>> ----------
>>
>> Existing `perf sched` is quite exhaustive and provides lot of insights
>> into scheduler behavior but it quickly becomes impractical to use for
>> long running or scheduler intensive workload. For ex, `perf sched record`
>> has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
>> on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
>> generates huge 56G perf.data for which perf takes ~137 mins to prepare
>> and write it to disk [1].
>>
>> Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
>> and generates samples on a tracepoint hit, `perf sched stats record`
>> takes
>> snapshot of the /proc/schedstat file before and after the workload, i.e.
>> there is almost zero interference on workload run. Also, it takes very
>> minimal time to parse /proc/schedstat, convert it into perf samples and
>> save those samples into perf.data file. Result perf.data file is much
>> smaller. So, overall `perf sched stats record` is much more light weight
>> compare to `perf sched record`.
>>
>> We, internally at AMD, have been using this (a variant of this, known as
>> "sched-scoreboard"[2]) and found it to be very useful to analyse impact
>> of any scheduler code changes[3][4]. Prateek used v2[5] of this patch
>> series to report the analysis[6][7].
>>
>> Please note that, this is not a replacement of perf sched record/report.
>> The intended users of the new tool are scheduler developers, not regular
>> users.
>>
>> USAGE
>> -----
>>
>> # perf sched stats record
>> # perf sched stats report
>> # perf sched stats diff
>>
>> Note: Although `perf sched stats` tool supports workload profiling syntax
>> (i.e. -- <workload> ), the recorded profile is still systemwide since the
>> /proc/schedstat is a systemwide file.
>>
>> HOW TO INTERPRET THE REPORT
>> ---------------------------
>>
>> The `perf sched stats report` starts with description of the columns
>> present in the report. These column names are given before cpu and
>> domain stats to improve the readability of the report.
>>
>>
>> ----------------------------------------------------------------------------------------------------
>> DESC -> Description of the field
>> COUNT -> Value of the field
>> PCT_CHANGE -> Percent change with corresponding base
>> value
>> AVG_JIFFIES -> Avg time in jiffies between two
>> consecutive occurrence of event
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is the total profiling time in terms of jiffies:
>>
>>
>> ----------------------------------------------------------------------------------------------------
>> Time elapsed (in jiffies) :
>> 24537
>>
>> ----------------------------------------------------------------------------------------------------
>>
>
> nit:
> Is there a way to export HZ value too here?
As per my knowledge, we can get this value from /proc/config.gz and this
depends on 'CONFIG_IKCONFIG_PROC' being enabled.
Peter, Is it okay to export the HZ value through a debugfs file, say
something like '/sys/kernel/debug/sched/hz_value'? Though I am not sure
if this is useful for anythyng else.
>
>> Next is CPU scheduling statistics. These are simple diffs of
>> /proc/schedstat CPU lines along with description. The report also
>> prints % relative to base stat.
>>
>> In the example below, schedule() left the CPU0 idle 36.58% of the time.
>> 0.45% of total try_to_wake_up() was to wakeup local CPU. And, the total
>> waittime by tasks on CPU0 is 48.70% of the total runtime by tasks on the
>> same CPU.
>>
>>
>> ----------------------------------------------------------------------------------------------------
>> CPU 0
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> DESC COUNT PCT_CHANGE
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> yld_count : 0
>>
>> array_exp : 0
>>
>> sched_count : 402267
>>
>> sched_goidle : 147161 ( 36.58% )
>>
>> ttwu_count : 236309
>>
>> ttwu_local : 1062 ( 0.45% )
>> rq_cpu_time :
>> 7083791148
>> run_delay :
>> 3449973971 ( 48.70% )
>>
>> pcount : 255035
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> Next is load balancing statistics. For each of the sched domains
>> (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
>> the following three categories:
>>
>> 1) Idle Load Balance: Load balancing performed on behalf of a long
>> idling CPU by some other CPU.
>> 2) Busy Load Balance: Load balancing performed when the CPU was busy.
>> 3) New Idle Balance : Load balancing performed when a CPU just became
>> idle.
>>
>> Under each of these three categories, sched stats report provides
>> different load balancing statistics. Along with direct stats, the
>> report also contains derived metrics prefixed with *. Example:
>>
>>
>> ----------------------------------------------------------------------------------------------------
>> CPU: 0 | DOMAIN: SMT | DOMAIN_CPUS: 0,64
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> DESC COUNT AVG_JIFFIES
>> ----------------------------------------- <Category busy>
>> ------------------------------------------
>>
>> busy_lb_count : 136 $ 17.08 $
>>
>> busy_lb_balanced : 131 $ 17.73 $
>>
>> busy_lb_failed : 0 $ 0.00 $
>>
>> busy_lb_imbalance_load : 58
>>
>> busy_lb_imbalance_util : 0
>>
>> busy_lb_imbalance_task : 0
>>
>> busy_lb_imbalance_misfit : 0
>>
>> busy_lb_gained : 7
>>
>> busy_lb_hot_gained : 0
>>
>> busy_lb_nobusyq : 2 $ 1161.50 $
>>
>> busy_lb_nobusyg : 129 $ 18.01 $
>>
>> *busy_lb_success_count : 5
>>
>> *busy_lb_avg_pulled : 1.40
>> ----------------------------------------- <Category idle>
>> ------------------------------------------
>>
>> idle_lb_count : 449 $ 5.17 $
>>
>> idle_lb_balanced : 382 $ 6.08 $
>>
>> idle_lb_failed : 3 $ 774.33 $
>>
>> idle_lb_imbalance_load : 0
>>
>> idle_lb_imbalance_util : 0
>>
>> idle_lb_imbalance_task : 71
>>
>> idle_lb_imbalance_misfit : 0
>>
>> idle_lb_gained : 67
>>
>> idle_lb_hot_gained : 0
>>
>> idle_lb_nobusyq : 0 $ 0.00 $
>>
>> idle_lb_nobusyg : 382 $ 6.08 $
>>
>> *idle_lb_success_count : 64
>>
>> *idle_lb_avg_pulled : 1.05
>> ---------------------------------------- <Category newidle>
>> ----------------------------------------
>>
>> newidle_lb_count : 30471 $ 0.08 $
>>
>> newidle_lb_balanced : 28490 $ 0.08 $
>>
>> newidle_lb_failed : 633 $ 3.67 $
>>
>> newidle_lb_imbalance_load : 0
>>
>> newidle_lb_imbalance_util : 0
>>
>> newidle_lb_imbalance_task : 2040
>>
>> newidle_lb_imbalance_misfit : 0
>>
>> newidle_lb_gained : 1348
>>
>> newidle_lb_hot_gained : 0
>>
>> newidle_lb_nobusyq : 6 $ 387.17 $
>>
>> newidle_lb_nobusyg : 26634 $ 0.09 $
>>
>> *newidle_lb_success_count : 1348
>>
>> *newidle_lb_avg_pulled : 1.00
>>
>> ----------------------------------------------------------------------------------------------------
>>
>> Consider following line:
>>
>> newidle_lb_balanced : 28490 $ 0.08 $
>>
>> While profiling was active, the load-balancer found 28490 times the load
>> needs to be balanced on a newly idle CPU 0. Following value encapsulated
>> inside $ is average jiffies between two events (28490 / 24537 = 0.08).
>>
>
> Could you please explain this? I couldn't understand.
>
> IIUC, you are parsing two instance of /proc/schedtstat,
> once in the beginning and once in the end.
>
> newidle_lb_balanced is a counter. In the beginning every iteration could
> have decided domain is imbalanced and once load stabilized, it could have
> decided now domain is balanced more often. i.e initially counter would add
> quickly and then may stay more or less same value.
>
> Also, what is this logic ? (28490 / 24537 = 0.08)?
>
Thanks for catching this. This is a miss from my end while writing the
cover letter. Here the jiffies value and the counter values are from two
different runs. Total jiffies for the run was 2323.
The values inside $ .. $ represents per jiffie how many times this event
has occured. The calculation here is (total_jiffies / counter_value).
In the run it is (2323 / 24537 = 0.08)
I will fix this in man page also.
--
Thanks and Regards,
Swapnil
Powered by blists - more mailing lists