linux-kernel - Re: [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate response time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c55339cd-85d6-4777-beec-41c4d9931b9a@arm.com>
Date: Tue, 17 Sep 2024 00:22:15 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>,
 Vincent Guittot <vincent.guittot@...aro.org>,
 "Rafael J. Wysocki" <rafael@...nel.org>,
 Viresh Kumar <viresh.kumar@...aro.org>
Cc: Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
 John Stultz <jstultz@...gle.com>, linux-pm@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 06/16] sched/schedutil: Add a new tunable to dictate
 response time

On 20/08/2024 18:35, Qais Yousef wrote:
> The new tunable, response_time_ms,  allow us to speed up or slow down
> the response time of the policy to meet the perf, power and thermal
> characteristic desired by the user/sysadmin. There's no single universal
> trade-off that we can apply for all systems even if they use the same
> SoC. The form factor of the system, the dominant use case, and in case
> of battery powered systems, the size of the battery and presence or
> absence of active cooling can play a big role on what would be best to
> use.
> 
> The new tunable provides sensible defaults, but yet gives the power to
> control the response time to the user/sysadmin, if they wish to.
> 
> This tunable is applied before we apply the DVFS headroom.
> 
> The default behavior of applying 1.25 headroom can be re-instated easily
> now. But we continue to keep the min required headroom to overcome
> hardware limitation in its speed to change DVFS. And any additional
> headroom to speed things up must be applied by userspace to match their
> expectation for best perf/watt as it dictates a type of policy that will
> be better for some systems, but worse for others.
> 
> There's a whitespace clean up included in sugov_start().
> 
> Signed-off-by: Qais Yousef <qyousef@...alina.io>
> ---
>  Documentation/admin-guide/pm/cpufreq.rst |  17 +++-
>  drivers/cpufreq/cpufreq.c                |   4 +-
>  include/linux/cpufreq.h                  |   3 +
>  kernel/sched/cpufreq_schedutil.c         | 115 ++++++++++++++++++++++-
>  4 files changed, 132 insertions(+), 7 deletions(-)
> 
> diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
> index 6adb7988e0eb..fa0d602a920e 100644
> --- a/Documentation/admin-guide/pm/cpufreq.rst
> +++ b/Documentation/admin-guide/pm/cpufreq.rst
> @@ -417,7 +417,7 @@ is passed by the scheduler to the governor callback which causes the frequency
>  to go up to the allowed maximum immediately and then draw back to the value
>  returned by the above formula over time.
>  
> -This governor exposes only one tunable:
> +This governor exposes two tunables:
>  
>  ``rate_limit_us``
>  	Minimum time (in microseconds) that has to pass between two consecutive
> @@ -427,6 +427,21 @@ This governor exposes only one tunable:
>  	The purpose of this tunable is to reduce the scheduler context overhead
>  	of the governor which might be excessive without it.
>  
> +``respone_time_ms``
> +	Amount of time (in milliseconds) required to ramp the policy from
> +	lowest to highest frequency. Can be decreased to speed up the
                  ^^^^^^^^^^^^^^^^^

This has changed IMHO. Should be the time from lowest (or better 0) to
second highest frequency.

https://lkml.kernel.org/r/20230827233203.1315953-6-qyousef@layalina.io

[...]

> @@ -59,6 +63,70 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>  
>  /************************ Governor internals ***********************/
>  
> +static inline u64 sugov_calc_freq_response_ms(struct sugov_policy *sg_policy)
> +{
> +	int cpu = cpumask_first(sg_policy->policy->cpus);
> +	unsigned long cap = arch_scale_cpu_capacity(cpu);
> +	unsigned int max_freq, sec_max_freq;
> +
> +	max_freq = sg_policy->policy->cpuinfo.max_freq;
> +	sec_max_freq = __resolve_freq(sg_policy->policy,
> +				      max_freq - 1,
> +				      CPUFREQ_RELATION_H);
> +
> +	/*
> +	 * We will request max_freq as soon as util crosses the capacity at
> +	 * second highest frequency. So effectively our response time is the
> +	 * util at which we cross the cap@..._highest_freq.
> +	 */
> +	cap = sec_max_freq * cap / max_freq;
> +
> +	return approximate_runtime(cap + 1);
> +}

Still uses the CPU capacity value based on dt-entry

  capacity-dmips-mhz = <578> (CPU0 on juno-r0)
                        ^^^

i.e. frequency invariance is not considered.

[    1.943356] CPU0 max_freq=850000 sec_max_freq=775000 cap=578 cap_at_sec_max_opp=527 runtime=34
                                                        ^^^^^^^    
[    1.957593] CPU1 max_freq=1100000 sec_max_freq=950000 cap=1024 cap_at_sec_max_opp=884 runtime=92


# cat /sys/devices/system/cpu/cpu*/cpu_capacity
446
^^^
1024
1024
446
446
446

[...]