linux-kernel - Re: [PATCH 1/2] sched/schedutil: rework performance estimation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtC77NKhz2QKwM68kS8Z9CsB8y588VER8aJeBLRwpdGALA@mail.gmail.com>
Date:   Fri, 20 Oct 2023 15:58:55 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Dietmar Eggemann <dietmar.eggemann@....com>
Cc:     mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, rafael@...nel.org,
        viresh.kumar@...aro.org, qyousef@...alina.io,
        linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
        lukasz.luba@....com
Subject: Re: [PATCH 1/2] sched/schedutil: rework performance estimation

On Fri, 20 Oct 2023 at 11:48, Dietmar Eggemann <dietmar.eggemann@....com> wrote:
>
> On 13/10/2023 17:14, Vincent Guittot wrote:
> > The current method to take into account uclamp hints when estimating the
> > target frequency can end into situation where the selected target
> > frequency is finally higher than uclamp hints whereas there are no real
> > needs. Such cases mainly happen because we are currently mixing the
> > traditional scheduler utilization signal with the uclamp performance
> > hints. By adding these 2 metrics, we loose an important information when
> > it comes to select the target frequency and we have to make some
> > assumptions which can't fit all cases.
> >
> > Rework the interface between the scheduler and schedutil governor in order
> > to propagate all information down to the cpufreq governor.
>
> So we change from:
>
> max(util -> uclamp, iowait_boost -> uclamp) -> head_room()
>
> to:
>
> util = max(util, iowait_boost) -> util =
>                                   head_room(util)
>
> _min = max(irq + cpu_bw_dl,
>            uclamp_min)         ->                  -> max(_min, _max)
>
> _max = min(scale, uclamp_max)  -> _max =
>                                   min(util, _max)
>
> > effective_cpu_util() interface changes and now returns the actual
> > utilization of the CPU with 2 optional inputs:
> > - The minimum performance for this CPU; typically the capacity to handle
> >   the deadline task and the interrupt pressure. But also uclamp_min
> >   request when available.
> > - The maximum targeting performance for this CPU which reflects the
> >   maximum level that we would like to not exceed. By default it will be
> >   the CPU capacity but can be reduced because of some performance hints
> >   set with uclamp. The value can be lower than actual utilization and/or
> >   min performance level.
> >
> > A new sugov_effective_cpu_perf() interface is also available to compute
> > the final performance level that is targeted for the CPU after applying
> > some cpufreq headroom and taking into account all inputs.
> >
> > With these 2 functions, schedutil is now able to decide when it must go
> > above uclamp hints. It now also have a generic way to get the min
> > perfromance level.
> >
> > The dependency between energy model and cpufreq governor and its headroom
> > policy doesn't exist anymore.
>
> But the dependency that both are doing the same thing still exists, right?

For the energy model itself, it is now fully removed; only EAS still
has to estimate which perf level will be selected by schedutil but it
uses now a schedutil function without having to care about headroom
and cpufreq governor policy

>
> sugov_get_util() and eenv_pd_max_util() are calling the same functions:
>
>   util = effective_cpu_util(cpu, util, &min, &max)
>
>   /* ioboost, bw_min = head_room(min) resp. uclamp tsk handling */
>
>   util = sugov_effective_cpu_perf(cpu, util, min, max)
>
> [...]
>
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index a3f9cd52eec5..78228abd1219 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7381,18 +7381,13 @@ int sched_core_idle_cpu(int cpu)
> >   * required to meet deadlines.
> >   */
> >  unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > -                              enum cpu_util_type type,
> > -                              struct task_struct *p)
> > +                              unsigned long *min,
> > +                              unsigned long *max)
>
> FREQUENCY_UTIL relates to *min != NULL and *max != NULL
>
> ENERGY_UTIL relates to *min == NULL and *max == NULL
>
> so both must be either NULL or !NULL.
>
> Calling it with one equa NULL and the other with !NULL should be
> undefined, right?

At now there is no user but one could consider only asking for min or
max. So I would not say undefined but unused

>
> [...]
>
> > @@ -7400,45 +7395,36 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> >        * update_irq_load_avg().
> >        */
> >       irq = cpu_util_irq(rq);
> > -     if (unlikely(irq >= max))
> > -             return max;
> > +     if (unlikely(irq >= scale)) {
> > +             if (min)
> > +                     *min = scale;
> > +             if (max)
> > +                     *max = scale;
> > +             return scale;
> > +     }
> > +
> > +     /* The minimum utilization returns the highest level between:
> > +      * - the computed DL bandwidth needed with the irq pressure which
> > +      *   steals time to the deadline task.
> > +      * - The minimum bandwidth requirement for CFS.
>
> rq UCLAMP_MIN can also be driven by RT, not only CFS.

yes

>
> > +      */
> > +     if (min)
> > +             *min = max(irq + cpu_bw_dl(rq), uclamp_rq_get(rq, UCLAMP_MIN));
> >
> >       /*
> >        * Because the time spend on RT/DL tasks is visible as 'lost' time to
> >        * CFS tasks and we use the same metric to track the effective
> >        * utilization (PELT windows are synchronized) we can directly add them
> >        * to obtain the CPU's actual utilization.
> > -      *
> > -      * CFS and RT utilization can be boosted or capped, depending on
> > -      * utilization clamp constraints requested by currently RUNNABLE
> > -      * tasks.
> > -      * When there are no CFS RUNNABLE tasks, clamps are released and
> > -      * frequency will be gracefully reduced with the utilization decay.
> >        */
> >       util = util_cfs + cpu_util_rt(rq);
> > -     if (type == FREQUENCY_UTIL)
> > -             util = uclamp_rq_util_with(rq, util, p);
> > -
> > -     dl_util = cpu_util_dl(rq);
> > -
> > -     /*
> > -      * For frequency selection we do not make cpu_util_dl() a permanent part
> > -      * of this sum because we want to use cpu_bw_dl() later on, but we need
> > -      * to check if the CFS+RT+DL sum is saturated (ie. no idle time) such
> > -      * that we select f_max when there is no idle time.
> > -      *
> > -      * NOTE: numerical errors or stop class might cause us to not quite hit
> > -      * saturation when we should -- something for later.
> > -      */
> > -     if (util + dl_util >= max)
> > -             return max;
> > +     util += cpu_util_dl(rq);
> >
> > -     /*
> > -      * OTOH, for energy computation we need the estimated running time, so
> > -      * include util_dl and ignore dl_bw.
> > -      */
> > -     if (type == ENERGY_UTIL)
> > -             util += dl_util;
> > +     if (util >= scale) {
> > +             if (max)
> > +                     *max = scale;
>
> But that means that ucamp_max cannot constrain a system in which the
> 'util > ucamp_max'. I guess that's related to you saying uclamp_min is a
> hard req and uclamp_max is a soft req. I don't think that's in sync with
> the rest of the uclamp_max implantation.

That's a mistake, I made a shortcut here. I wanted to save the
scale_irq_capacity() step but forgot to update max 1st.

Will fix it


>
> > +             return scale;
> > +     }
> >
> >       /*
> >        * There is still idle time; further improve the number by using the
> > @@ -7449,28 +7435,21 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> >        *   U' = irq + --------- * U
> >        *                 max
> >        */
> > -     util = scale_irq_capacity(util, irq, max);
> > +     util = scale_irq_capacity(util, irq, scale);
> >       util += irq;
> >
> > -     /*
> > -      * Bandwidth required by DEADLINE must always be granted while, for
> > -      * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> > -      * to gracefully reduce the frequency when no tasks show up for longer
> > -      * periods of time.
> > -      *
> > -      * Ideally we would like to set bw_dl as min/guaranteed freq and util +
> > -      * bw_dl as requested freq. However, cpufreq is not yet ready for such
> > -      * an interface. So, we only do the latter for now.
> > +     /* The maximum hint is a soft bandwidth requirement which can be lower
> > +      * than the actual utilization because of max uclamp requirments
> >        */
> > -     if (type == FREQUENCY_UTIL)
> > -             util += cpu_bw_dl(rq);
> > +     if (max)
> > +             *max = min(scale, uclamp_rq_get(rq, UCLAMP_MAX));
> >
> > -     return min(max, util);
> > +     return min(scale, util);
> >  }
>
> effective_cpu_util for FREQUENCY_UTIL (i.e. (*min != NULL && *max !=
> NULL)) is slightly different.
>
>   missing:
>
>   if (!uclamp_is_used() && rt_rq_is_runnable(&rq->rt)
>     return max
>
>   probably moved into sugov_effective_cpu_perf() (which is only called
>   for `FREQUENCY_UTIL`) ?

yes, it's in sugov_effective_cpu_perf()

>
>
>   old:
>
>   irq_cap_scaling(util_cfs, util_rt) + irq + cpu_bw_dl()
>                                              ^^^^^^^^^^^
>
>   new:
>
>   irq_cap_scaling(util_cfs + util_rt + util_dl) + irq
>                                        ^^^^^^^

yes, cpu_bw_dl() input moved in the min value instead

>
> [...]
>
> > +unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
> > +                              unsigned long min,
> > +                              unsigned long max)
> > +{
> > +     unsigned long target;
> > +     struct rq *rq = cpu_rq(cpu);
> > +
> > +     if (rt_rq_is_runnable(&rq->rt))
> > +             return max;
> > +
> > +     /* Provide at least enough capacity for DL + irq */
> > +     target =  min;
> > +
> > +     actual = map_util_perf(actual);
> > +     /* Actually we don't need to target the max performance */
> > +     if (actual < max)
> > +             max = actual;
> > +
> > +     /*
> > +      * Ensure at least minimum performance while providing more compute
> > +      * capacity when possible.
> > +      */
> > +     return max(target, max);
>
> Can you not just use:
>
>        return max(min, max)
>
> and skip target?

yes. I had an intermediate value that I removed in the final version
but forgot to remove the intermediate variable

>
> > +}
> > +
> >  static void sugov_get_util(struct sugov_cpu *sg_cpu)
> >  {
> > -     unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu);
> > -     struct rq *rq = cpu_rq(sg_cpu->cpu);
> > +     unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> >
> > -     sg_cpu->bw_dl = cpu_bw_dl(rq);
> > -     sg_cpu->util = effective_cpu_util(sg_cpu->cpu, util,
> > -                                       FREQUENCY_UTIL, NULL);
> > +     util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
> > +     sg_cpu->bw_min = map_util_perf(min);
> > +     sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> >  }
> >
> >  /**
> > @@ -306,7 +329,7 @@ static inline bool sugov_cpu_is_busy(struct sugov_cpu *sg_cpu) { return false; }
> >   */
> >  static inline void ignore_dl_rate_limit(struct sugov_cpu *sg_cpu)
> >  {
> > -     if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_dl)
> > +     if (cpu_bw_dl(cpu_rq(sg_cpu->cpu)) > sg_cpu->bw_min)
>
> bw_min is more than DL right?

yes

Interruptions are preempting DL so we should include them
And now that we can take into account uclamp_min, use it when
computing the min perf parameter of cpufreq_driver_adjust_perf()


>
> bw_min = head_room(max(irq + cpu_bw_dl, rq's UCLAMP_MIN)
>
> [...]