linux-kernel - Re: [PATCH 2/2] sched_ext: Add cpuperf support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63c76af4-6451-4d6a-8aeb-0bc4812c4101@arm.com>
Date: Tue, 2 Jul 2024 11:23:58 +0100
From: Hongyan Xia <hongyan.xia2@....com>
To: Tejun Heo <tj@...nel.org>, rafael@...nel.org, viresh.kumar@...aro.org
Cc: linux-pm@...r.kernel.org, void@...ifault.com,
 linux-kernel@...r.kernel.org, kernel-team@...a.com, mingo@...hat.com,
 peterz@...radead.org, David Vernet <dvernet@...a.com>,
 "Rafael J . Wysocki" <rafael.j.wysocki@...el.com>
Subject: Re: [PATCH 2/2] sched_ext: Add cpuperf support

On 19/06/2024 04:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
> 
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
> 
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
>    different CPUs in the system.
> 
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
>    relative to its max performance.
> 
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
> 
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
> 
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
> 
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
> 
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Reviewed-by: David Vernet <dvernet@...a.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@...el.com>
> Cc: Viresh Kumar <viresh.kumar@...aro.org>
> ---
>   kernel/sched/cpufreq_schedutil.c         |  12 +-
>   kernel/sched/ext.c                       |  83 ++++++++++++-
>   kernel/sched/ext.h                       |   9 ++
>   kernel/sched/sched.h                     |   1 +
>   tools/sched_ext/include/scx/common.bpf.h |   3 +
>   tools/sched_ext/scx_qmap.bpf.c           | 142 ++++++++++++++++++++++-
>   tools/sched_ext/scx_qmap.c               |   8 ++
>   7 files changed, 252 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>   
>   static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>   {
> -	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> +	unsigned long min, max;
> +	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> +		scx_cpuperf_target(sg_cpu->cpu);
>   
>   	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>   	util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
>   	unsigned long idle_calls;
>   	bool ret;
>   
> +	/*
> +	 * The heuristics in this function is for the fair class. For SCX, the
> +	 * performance target comes directly from the BPF scheduler. Let's just
> +	 * follow it.
> +	 */
> +	if (scx_switched_all())
> +		return false;
> +
>   	/* if capped by uclamp_max, always update to be in compliance */
>   	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
>   		return false;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f814e84ceeb3..04fb0eeee5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -16,6 +16,8 @@ enum scx_consts {
>   	SCX_EXIT_BT_LEN			= 64,
>   	SCX_EXIT_MSG_LEN		= 1024,
>   	SCX_EXIT_DUMP_DFL_LEN		= 32768,
> +
> +	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
>   };
>   
>   enum scx_exit_kind {
> @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
>   	.update_curr		= update_curr_scx,
>   
>   #ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 0,
> +	.uclamp_enabled		= 1,
>   #endif
>   };
>   

Hi. I know this is a bit late, but the implication of this change here 
can be quite interesting.

With this patch but without switching this knob from 0 to 1, this series 
gives me the perfect opportunity to implement a custom uclamp within 
sched_ext on top of the cpufreq support added. I think this would be 
what some vendors looking at sched_ext would also want. But, if 
.uclamp_enabled == 1, then the mainline uclamp implementation is in 
effect regardless of what ext scheduler is loaded. In fact, 
uclamp_{inc,dec}() are before calling the {enqueue,dequeue}_task() so 
now there's no easy way to circumvent it.

What would be really nice is to have cpufreq support in sched_ext but 
not force uclamp_enabled. But, I also think there will be people who are 
happy with the current uclamp implementation and want to just reuse it. 
The best thing is to let the loaded scheduler decide, somehow, which I 
don't know if there's an easy way to do this yet.

> [...]