linux-kernel - Re: [RFC] Documentation/scheduler/schedutil.txt

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <jhjzh3cuu9q.mognet@arm.com>
Date:   Fri, 20 Nov 2020 11:45:05 +0000
From:   Valentin Schneider <valentin.schneider@....com>
To:     Peter Zijlstra <peterz@...radead.org>
Cc:     "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Ingo Molnar <mingo@...nel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Morten Rasmussen <morten.rasmussen@....com>,
        dietmar.eggemann@....com, patrick.bellasi@...bug.net,
        lenb@...nel.org, linux-kernel@...r.kernel.org,
        ionela.voinescu@....com, qperret@...gle.com,
        viresh.kumar@...aro.org
Subject: Re: [RFC] Documentation/scheduler/schedutil.txt


Hi,

On 20/11/20 07:55, Peter Zijlstra wrote:
> Frequency- / Heterogeneous Invariance
> -------------------------------------
>
> Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
> for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
> a big CPU, we allow architectures to scale the time delta with two ratios, one
> DVFS ratio and one microarch ratio.
>
> For simple DVFS architectures (where software is in full control) we trivially
> compute the ratio as:
>
> 	    f_cur
>   r_dvfs := -----
>             f_max
>
> For more dynamic systems where the hardware is in control of DVFS (Intel,
> ARMv8.4-AMU) we use hardware counters to provide us this ratio. In specific,
> for Intel, we use:
>
> 	   APERF
>   f_cur := ----- * P0
> 	   MPERF
>
> 	     4C-turbo;	if available and turbo enabled
>   f_max := { 1C-turbo;	if turbo enabled
> 	     P0;	otherwise
>
>                     f_cur
>   r_dvfs := min( 1, ----- )
>                     f_max
>
> We pick 4C turbo over 1C turbo to make it slightly more sustainable.
>
> r_het is determined as the average performance difference between a big and
> LITTLE core when running at max frequency over 'relevant' benchmarks.

Welcome to our wonderful world where there can be more than just two types
of CPUs! A perhaps safer statement would be:

  r_het is determined as the ratio of highest performance level of the
  current CPU vs the highest performance level of any other CPU in the 
  system.


Also; do we want to further state the obvious?

  r_tot := r_het * r_dvfs

>
> The result is that the above 'running' and 'runnable' metrics become invariant
> of DVFS and Heterogenous state. IOW. we can transfer and compare them between
> CPUs.
>
> For more detail see:
>
>  - kernel/sched/pelt.h:update_rq_clock_pelt()
>  - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
>

Some of that is rephrased in

  Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"

(with added diagrams crafted with love by yours truly); I suppose a cross
reference can't hurt.

>
> UTIL_EST / UTIL_EST_FASTUP
> --------------------------
>
> Because periodic tasks have their averages decayed while they sleep, even
> though when running their expected utilization will be the same, they suffer a
> (DVFS) ramp-up after they become runnable again.
>
> To alleviate this (a default enabled option) UTIL_EST drives an (IIR) EWMA
> with the 'running' value on dequeue -- when it is highest. A further default
> enabled option UTIL_EST_FASTUP modifies the IIR filter to instantly increase
> and only decay on decrease.
>
> A further runqueue wide sum (of runnable tasks) is maintained of:
>
>   util_est := \Sum_t max( t_running, t_util_est_ewma )
>
> For more detail see: kernel/sched/fair.h:util_est_dequeue()
>
>
> UCLAMP
> ------
>
> It is possible to set effective u_min and u_max clamps on each task; the

Nit: effective clamps are the task clamps clamped by task group clamps
(yes, that is 4 times 'clamp' in a single line).

> runqueue keeps an max aggregate of these clamps for all running tasks.
>
> For more detail see: include/uapi/linux/sched/types.h
>
>
> Schedutil / DVFS
> ----------------
>
> Every time the scheduler load tracking is updated (task wakeup, task
> migration, time progression) we call out to schedutil to update the hardware
> DVFS state.
>
> The basis is the CPU runqueue's 'running' metric, which per the above it is
> the frequency invariant utilization estimate of the CPU. From this we compute
> a desired frequency like:
>
>              max( running, util_est );	if UTIL_EST
>   u_cfs := { running;			otherwise
>
>   u_clamp := clamp( u_cfs, u_min, u_max )
>
>   u := u_cfs + u_rt + u_irq + u_dl;	[approx. see source for more detail]
>
>   f_des := min( f_max, 1.25 u * f_max )
>
> XXX IO-wait; when the update is due to a task wakeup from IO-completion we
> boost 'u' above.
>

IIRC the boost is fiddled with during the above, but can be applied at
different subsequent updates (even if the task that triggered the boost is
no longer here).

> This frequency is then used to select a P-state/OPP or directly munged into a
> CPPC style request to the hardware.
>
> XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
> required to satisfy the workload.
>
> Because these callbacks are directly from the scheduler, the DVFS hardware
> interaction should be 'fast' and non-blocking. Schedutil supports
> rate-limiting DVFS requests for when hardware interaction is slow and
> expensive, this reduces effectiveness.
>
> For more information see: kernel/sched/cpufreq_schedutil.c
>
>
> NOTES
> -----
>
>  - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
>    will closely reflect utilization.
>
>  - In saturated scenarios task movement will cause some transient dips,
>    suppose we have a CPU saturated with 4 tasks, then when we migrate a task
>    to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
>    new CPU will gain 0.25. This is inevitable and time progression will
>    correct this. XXX do we still guarantee f_max due to no idle-time?
>
>  - Much of the above is about avoiding DVFS dips, and independent DVFS domains
>    having to re-learn / ramp-up when load shifts.