linux-kernel - Re: [RFC PATCH 3/6] sched/fair: Use CFS util_avg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9c6f1e12-ed49-4b3d-ad20-cf2b32741836@arm.com>
Date:   Wed, 1 Nov 2023 23:34:18 +0100
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Hongyan Xia <Hongyan.Xia2@....com>, Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        "Rafael J. Wysocki" <rafael@...nel.org>,
        Viresh Kumar <viresh.kumar@...aro.org>
Cc:     Qais Yousef <qyousef@...alina.io>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Lukasz Luba <lukasz.luba@....com>,
        Christian Loehle <christian.loehle@....com>,
        linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org
Subject: Re: [RFC PATCH 3/6] sched/fair: Use CFS util_avg_uclamp for
 utilization and frequency

On 04/10/2023 11:04, Hongyan Xia wrote:
> From: Hongyan Xia <hongyan.xia2@....com>
> 
> Switch to the new util_avg_uclamp for task and runqueue utilization.
> Since util_est() calls task_util(), this means util_est is now also a
> clamped value.

s/util_est()/task_util_est()

but task_util_est() is max(task_util(p), _task_util_est(p))

So I don't immediately spot why util_est is a clamped value as well now.

We have a naming mismatch between CPU and task related function on this
level: cpu_util() vs. task_util_est().

> Now that we have the sum aggregated CFS util value, we do not need to
> consult uclamp buckets to know how the frequency should be clamped. We
> simply look at the aggregated top level root_cfs_util_uclamp to know
> what frequency to choose. Because we simulate PELT decay in
> root_cfs_util_uclamp anyway, there's no need in cpufreq_schedutil.c to
> avoid premature frequency drops.
> 
> Consequently, there is no need for uclamp_rq_util_with(). This function
> takes the un-clamped util value and sends it through various clamping
> filters to get the final value. However, util_avg_uclamp is propagated
> with clamping in mind already, so it does not need to be clamped again.
> 
> TODO: There are two major caveats in this patch.
> 1. At the moment sum aggregation does not consider RT tasks. The avg_rt
>    signal considers all RT tasks on this rq as a single entity, which
>    means the utilization of individual RT tasks is not tracked
>    separately. If we want to use sum aggregation, we might have to track
>    utilization of RT tasks individually.

Not sure if the RT class will except PELT task tracking (plus there is
CONFIG_RT_GROUP_SCHED too).

> 2. Busy time accounting in compute_energy() now takes the uclamp'ed
>    value. Ideally, it should reflect reality and use the un-clamp'ed
>    values. However, that would require maintaining both the normal and
>    uclamp'ed values for util_est. This needs to be revisited if it
>    causes real problems in practice.

You could use your new rq->root_cfs_util_uclamp for eenv_pd_max_util
(FREQUENCY_UTIL) and use rq->cfs.avg.util_avg in eenv_pd_busy_time()
(ENERGY_UTIL).

[...]

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index efe3848978a0..32511ee63f01 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7402,10 +7402,12 @@ int sched_core_idle_cpu(int cpu)
>   * The DL bandwidth number otoh is not a measured metric but a value computed
>   * based on the task model parameters and gives the minimal utilization
>   * required to meet deadlines.
> + *
> + * The util_cfs parameter has already taken uclamp into account (unless uclamp
> + * support is not compiled in).
>   */
>  unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> -				 enum cpu_util_type type,
> -				 struct task_struct *p)
> +				 enum cpu_util_type type)

There are changes proposed in the area of uclamping right now in:

https://lkml.kernel.org/r/20231026170913.32605-2-vincent.guittot@linaro.org

[...]

>  /**
> @@ -282,7 +281,11 @@ static void sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
>  	 * into the same scale so we can compare.
>  	 */
>  	boost = (sg_cpu->iowait_boost * max_cap) >> SCHED_CAPACITY_SHIFT;
> -	boost = uclamp_rq_util_with(cpu_rq(sg_cpu->cpu), boost, NULL);
> +	/*
> +	 * TODO: Investigate what should be done here. In sum aggregation there
> +	 * is no such thing as uclamp_max on a rq, so how do we cap the boost
> +	 * value, or do we want to cap the boost frequency here at all?
> +	 */

https://lkml.kernel.org/r/20231026170913.32605-3-vincent.guittot@linaro.org

is proposing to cap iowait boost with max (set in effective_cpu_util()
and max depends on uclamp_rq_get(rq, UCLAMP_MAX) too.

So you could cap iowait boost in case uclamp_rq_is_capped(), i.e. when:

  if (rq->cfs.avg.util_avg > rq->root_cfs_util_uclamp + margin)

[...]

> @@ -7468,11 +7459,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  static unsigned long
>  cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
>  {
> -	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
> -	unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
> +	struct rq *rq = cpu_rq(cpu);
> +	struct cfs_rq *cfs_rq = &rq->cfs;
> +	unsigned long util = root_cfs_util(rq);
> +	bool capped = uclamp_rq_is_capped(rq);
>  	unsigned long runnable;
>  
> -	if (boost) {
> +	if (boost && !capped) {
>  		runnable = READ_ONCE(cfs_rq->avg.runnable_avg);
>  		util = max(util, runnable);
>  	}

IMHO, this makes sense. Only allow runnable boosting in case the rq is
not uclamp_max capped.

[...]