[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ede83265-4107-4c21-bc33-f733326daafe@arm.com>
Date: Tue, 28 May 2024 12:38:11 +0100
From: Hongyan Xia <hongyan.xia2@....com>
To: Dietmar Eggemann <dietmar.eggemann@....com>,
Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Viresh Kumar <viresh.kumar@...aro.org>
Cc: Qais Yousef <qyousef@...alina.io>,
Morten Rasmussen <morten.rasmussen@....com>,
Lukasz Luba <lukasz.luba@....com>,
Christian Loehle <christian.loehle@....com>, pierre.gondois@....com,
linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org
Subject: Re: [RFC PATCH v3 3/6] sched/fair: Use util biases for utilization
and frequency
On 26/05/2024 23:52, Dietmar Eggemann wrote:
> On 07/05/2024 14:50, Hongyan Xia wrote:
>> Use the new util_avg_bias for task and runqueue utilization. We also
>> maintain separate util_est and util_est_uclamp signals.
>>
>> Now that we have the sum aggregated CFS util value, we do not need to
>
> There is the work uclamp missing somehow here.
Ack.
>> consult uclamp buckets to know how the frequency should be clamped. We
>> simply look at the aggregated top level rq->cfs.avg.util_avg +
>> rq->cfs.avg.util_avg_bias and rq->cfs.avg.util_est_uclamp to know what
>> frequency to choose and how to place tasks.
>
> [...]
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 571c8de59508..14376f23a8b7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4819,14 +4819,14 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
>>
>> static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf);
>>
>> -static inline unsigned long task_util(struct task_struct *p)
>> +static inline unsigned long task_runnable(struct task_struct *p)
>> {
>> - return READ_ONCE(p->se.avg.util_avg);
>> + return READ_ONCE(p->se.avg.runnable_avg);
>> }
>>
>> -static inline unsigned long task_runnable(struct task_struct *p)
>> +static inline unsigned long task_util(struct task_struct *p)
>> {
>> - return READ_ONCE(p->se.avg.runnable_avg);
>> + return READ_ONCE(p->se.avg.util_avg);
>> }
>>
>> static inline unsigned long _task_util_est(struct task_struct *p)
>> @@ -4839,6 +4839,52 @@ static inline unsigned long task_util_est(struct task_struct *p)
>> return max(task_util(p), _task_util_est(p));
>> }
>
> IMHO, we now have two complete hierarchies of util signals:
>
> (a) (b) (uclamp version of (a))
>
> (1) task_util() task_util_uclamp() (+ task_util_bias())
>
> (2) _task_util_est() _task_util_est_uclamp()
>
> (3) task_util_est() task_util_est_uclamp()
>
> (4) cpu_util() cpu_util_uclamp()
>
> (5) cpu_util_cfs() cpu_util_cfs_uclamp()
>
> (6) cpu_util_cfs_boost() cpu_util_cfs_boost_uclamp()
>
>
> For (4), (5), (6) we now have:
>
> --- cpu_util() cpu_util_uclamp()
>
> eenv_pd_busy_time() x
>
> eenv_pd_max_util() x
>
> find_energy_efficient_cpu() x x <-- (A)
>
> cpu_util_without() x
>
> --- cpu_util_cfs() cpu_util_cfs_uclamp()
>
> cpu_overutilized() x x <-- (B)
>
> update_sg_lb_stats() x
>
> update_numa_stats() x
>
> sched_cpu_util() x
>
> --- cpu_util_cfs_boost() cpu_util_cfs_boost_uclamp()
>
> sched_balance_find_src_rq() x
>
> sugov_get_util() x
>
> uclamp version falls back to mon-uclamp version for !CONFIG_UCLAMP_TASK.
> So it looks like taht in this case we calculate the same cpu utilization
> value twice in (A) and (B).
That is indeed the case. I'll see how I can avoid this duplication,
hopefully without too much #ifdef mess.
> [...]
>
>> @@ -4970,134 +5026,29 @@ static inline unsigned long get_actual_cpu_capacity(int cpu)
>> }
>>
>> static inline int util_fits_cpu(unsigned long util,
>> - unsigned long uclamp_min,
>> - unsigned long uclamp_max,
>> + unsigned long util_uclamp,
>> int cpu)
>> {
>> unsigned long capacity = capacity_of(cpu);
>> - unsigned long capacity_orig;
>> - bool fits, uclamp_max_fits;
>> -
>> - /*
>> - * Check if the real util fits without any uclamp boost/cap applied.
>> - */
>> - fits = fits_capacity(util, capacity);
>> -
>> - if (!uclamp_is_used())
>> - return fits;
>> -
>> - /*
>> - * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
>> - * uclamp_max. We only care about capacity pressure (by using
>> - * capacity_of()) for comparing against the real util.
>> - *
>> - * If a task is boosted to 1024 for example, we don't want a tiny
>> - * pressure to skew the check whether it fits a CPU or not.
>> - *
>> - * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
>> - * should fit a little cpu even if there's some pressure.
>> - *
>> - * Only exception is for HW or cpufreq pressure since it has a direct impact
>> - * on available OPP of the system.
>> - *
>> - * We honour it for uclamp_min only as a drop in performance level
>> - * could result in not getting the requested minimum performance level.
>> - *
>> - * For uclamp_max, we can tolerate a drop in performance level as the
>> - * goal is to cap the task. So it's okay if it's getting less.
>> - */
>> - capacity_orig = arch_scale_cpu_capacity(cpu);
>>
>> - /*
>> - * We want to force a task to fit a cpu as implied by uclamp_max.
>> - * But we do have some corner cases to cater for..
>> - *
>> - *
>> - * C=z
>> - * | ___
>> - * | C=y | |
>> - * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
>> - * | C=x | | | |
>> - * | ___ | | | |
>> - * | | | | | | | (util somewhere in this region)
>> - * | | | | | | |
>> - * | | | | | | |
>> - * +----------------------------------------
>> - * CPU0 CPU1 CPU2
>> - *
>> - * In the above example if a task is capped to a specific performance
>> - * point, y, then when:
>> - *
>> - * * util = 80% of x then it does not fit on CPU0 and should migrate
>> - * to CPU1
>> - * * util = 80% of y then it is forced to fit on CPU1 to honour
>> - * uclamp_max request.
>> - *
>> - * which is what we're enforcing here. A task always fits if
>> - * uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
>> - * the normal upmigration rules should withhold still.
>> - *
>> - * Only exception is when we are on max capacity, then we need to be
>> - * careful not to block overutilized state. This is so because:
>> - *
>> - * 1. There's no concept of capping at max_capacity! We can't go
>> - * beyond this performance level anyway.
>> - * 2. The system is being saturated when we're operating near
>> - * max capacity, it doesn't make sense to block overutilized.
>> - */
>> - uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
>> - uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
>> - fits = fits || uclamp_max_fits;
>> + if (fits_capacity(util_uclamp, capacity))
>> + return 1;
>>
>> - /*
>> - *
>> - * C=z
>> - * | ___ (region a, capped, util >= uclamp_max)
>> - * | C=y | |
>> - * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
>> - * | C=x | | | |
>> - * | ___ | | | | (region b, uclamp_min <= util <= uclamp_max)
>> - * |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
>> - * | | | | | | |
>> - * | | | | | | | (region c, boosted, util < uclamp_min)
>> - * +----------------------------------------
>> - * CPU0 CPU1 CPU2
>> - *
>> - * a) If util > uclamp_max, then we're capped, we don't care about
>> - * actual fitness value here. We only care if uclamp_max fits
>> - * capacity without taking margin/pressure into account.
>> - * See comment above.
>> - *
>> - * b) If uclamp_min <= util <= uclamp_max, then the normal
>> - * fits_capacity() rules apply. Except we need to ensure that we
>> - * enforce we remain within uclamp_max, see comment above.
>> - *
>> - * c) If util < uclamp_min, then we are boosted. Same as (b) but we
>> - * need to take into account the boosted value fits the CPU without
>> - * taking margin/pressure into account.
>> - *
>> - * Cases (a) and (b) are handled in the 'fits' variable already. We
>> - * just need to consider an extra check for case (c) after ensuring we
>> - * handle the case uclamp_min > uclamp_max.
>> - */
>> - uclamp_min = min(uclamp_min, uclamp_max);
>> - if (fits && (util < uclamp_min) &&
>> - (uclamp_min > get_actual_cpu_capacity(cpu)))
>> + if (fits_capacity(util, capacity))
>> return -1;
>>
>> - return fits;
>> + return 0;
>
> The possible return values stay the same it seems: 1 if uclamp (and so
> util_avg) fit, -1 if util_avg fit and 0 if uclamp and uclamp don't fit.
Yes, this is the intended change to address some comments elsewhere and
in previous revisions.
> [...]
>
>> @@ -6914,6 +6861,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> "0 tasks on CFS of CPU %d, but util_avg_bias is %d\n",
>> rq->cpu, rq->cfs.avg.util_avg_bias);
>> WRITE_ONCE(rq->cfs.avg.util_avg_bias, 0);
>> + WARN_ONCE(rq->cfs.avg.util_est_uclamp,
>> + "0 tasks on CFS of CPU %d, but util_est_uclamp is %u\n",
>> + rq->cpu, rq->cfs.avg.util_est_uclamp);
>> + WRITE_ONCE(rq->cfs.avg.util_est_uclamp, 0);
>
> Maybe better:
>
> if (WARN_ONCE(...
> WRITE_ONCE(rq->cfs.avg.util_est_uclamp, 0);
>
> How can this happen, that there are 0 tasks on rq->cfs hierarcy and
> rq->cfs.avg.util_est_uclamp is !0 ? Is there a specific code path when
> this happens?
Ack.
This should never happen. Triggering the warning here means something
has gone very wrong in the maths and should be reported.
>> }
>> #endif
>> }
>> @@ -7485,7 +7436,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>> static int
>> select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>> {
>> - unsigned long task_util, util_min, util_max, best_cap = 0;
>> + unsigned long task_util, task_util_uclamp, best_cap = 0;
>> int fits, best_fits = 0;
>> int cpu, best_cpu = -1;
>> struct cpumask *cpus;
>> @@ -7494,8 +7445,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>
>> task_util = task_util_est(p);
>> - util_min = uclamp_eff_value(p, UCLAMP_MIN);
>> - util_max = uclamp_eff_value(p, UCLAMP_MAX);
>> + task_util_uclamp = task_util_est_uclamp(p);
>
> select_idle_sibling() could pass task_util and task_util_uclamp into
> select_idle_capacity(). This way we would save these calls to
> task_util_est() and task_util_est_uclamp() here.
Good idea.
>> for_each_cpu_wrap(cpu, cpus, target) {
>> unsigned long cpu_cap = capacity_of(cpu);
>> @@ -7503,7 +7453,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>> if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
>> continue;
>>
>> - fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>> + fits = util_fits_cpu(task_util, task_util_uclamp, cpu);
>>
>> /* This CPU fits with all requirements */
>> if (fits > 0)
>
> [...]
>
Powered by blists - more mailing lists