linux-kernel - Re: [PATCH v2 9/9] sched/fair: Consider capacity inversion in util_fits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221105204141.3tno6fzuh536ye4e@airbuntu>
Date:   Sat, 5 Nov 2022 20:41:41 +0000
From:   Qais Yousef <qyousef@...alina.io>
To:     Valentin Schneider <vschneid@...hat.com>
Cc:     Qais Yousef <qais.yousef@....com>, Ingo Molnar <mingo@...nel.org>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        linux-kernel@...r.kernel.org, Xuewen Yan <xuewen.yan94@...il.com>,
        Lukasz Luba <lukasz.luba@....com>, Wei Wang <wvw@...gle.com>,
        Jonathan JMChen <Jonathan.JMChen@...iatek.com>,
        Hank <han.lin@...iatek.com>
Subject: Re: [PATCH v2 9/9] sched/fair: Consider capacity inversion in
 util_fits_cpu()

On 11/04/22 17:35, Valentin Schneider wrote:
> On 04/08/22 15:36, Qais Yousef wrote:
> > We do consider thermal pressure in util_fits_cpu() for uclamp_min only.
> > With the exception of the biggest cores which by definition are the max
> > performance point of the system and all tasks by definition should fit.
> >
> > Even under thermal pressure, the capacity of the biggest CPU is the
> > highest in the system and should still fit every task. Except when it
> > reaches capacity inversion point, then this is no longer true.
> >
> > We can handle this by using the inverted capacity as capacity_orig in
> > util_fits_cpu(). Which not only addresses the problem above, but also
> > ensure uclamp_max now considers the inverted capacity. Force fitting
> > a task when a CPU is in this adverse state will contribute to making the
> > thermal throttling last longer.
> >
> > Signed-off-by: Qais Yousef <qais.yousef@....com>
> > ---
> >  kernel/sched/fair.c | 14 +++++++++-----
> >  1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb32dc9a057f..77ae343e32a3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4293,12 +4293,16 @@ static inline int util_fits_cpu(unsigned long util,
> >        * For uclamp_max, we can tolerate a drop in performance level as the
> >        * goal is to cap the task. So it's okay if it's getting less.
> >        *
> > -	 * In case of capacity inversion, which is not handled yet, we should
> > -	 * honour the inverted capacity for both uclamp_min and uclamp_max all
> > -	 * the time.
> > +	 * In case of capacity inversion we should honour the inverted capacity
> > +	 * for both uclamp_min and uclamp_max all the time.
> >        */
> > -	capacity_orig = capacity_orig_of(cpu);
> > -	capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> > +	capacity_orig = cpu_in_capacity_inversion(cpu);
> > +	if (capacity_orig) {
> > +		capacity_orig_thermal = capacity_orig;
> > +	} else {
> > +		capacity_orig = capacity_orig_of(cpu);
> > +		capacity_orig_thermal = capacity_orig - arch_scale_thermal_pressure(cpu);
> > +	}
> >
> 
> IIUC the rq->cpu_capacity_inverted computation in update_cpu_capacity() can be
> summarised as:
> 
> - If there is a PD with equal cap_orig, but higher effective (orig - thermal)
>   capacity
>   OR
>   there is a PD with pd_cap_orig > cpu_effective_cap:
>   rq->cpu_capacity_inverted = capacity_orig - thermal_load_avg(rq)
> 
> - Else:
>   rq->cpu_capacity_inverted = 0
> 
> Then, the code above uses either rq->cpu_capacity_inverted if it is
> non-zero, otherwise:
> 
>   capacity_orig - arch_scale_thermal_pressure(cpu);
> 
> Why use average thermal pressure in one case, and use instantaneous
> thermal pressure in the other?

There was a big debate on [1] about using avg vs instantaneous.

I used avg for detecting inversion to be consistent with using average in in
scale_rt_capacity(). I didn't want the inversion state to be flipping too
quickly too.

I used the instantaneous in the other check based on that discussion. It seemed
using the average is hurtful when for example the medium drops an OPP and by
not reacting quickly at wake up we lose the chance to place it on a big; which
if my memory didn't fail me is what Xuewen was seeing.

[1] https://lore.kernel.org/lkml/24631a27-42d9-229f-d9b0-040ac993b749@arm.com/

> 
> Can't we get rid of rq->cpu_capacity_inverted and replace this whole thing
> with an unconditional
> 
>   capacity_orig_thermal = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu));
> 
> ?

I can't see how we end up with equivalent behavior then. Or address the
concerns raised by Xuewen and Lukasz on the RT thread in regards to avg vs
instantaneous.

Specifically, if we don't use the new rq->cpu_capacity_inverted we can't handle
the case where the task is requesting to run at maximum performance but a small
drop in thermal pressure means it won't fit anywhere. That PD is the best fit
until it hits an inversion.

Originally I wanted to defer handling thermal pressure into a different series.
But Vincent thought it's better to handle it now. We want more data points from
more systems tbh. But I think what we have now is still a good improvement over
what we had before.

Lukasz had a patch [2] which could allow making thermal_load_avg() more
acceptable for systems that care about faster response times.

[2] https://lore.kernel.org/lkml/20220429091245.12423-1-lukasz.luba@arm.com/


Thanks

--
Qais Yousef