linux-kernel - Re: [PATCH v2] sched: Consolidate cpufreq updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDvBAFauUfyWZhYRUz6f42iMAJcwcdDDQh+V8+QfDwc2Q@mail.gmail.com>
Date: Tue, 7 May 2024 10:58:42 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Qais Yousef <qyousef@...alina.io>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>, Viresh Kumar <viresh.kumar@...aro.org>, 
	Ingo Molnar <mingo@...nel.org>, Peter Zijlstra <peterz@...radead.org>, 
	Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>, 
	Mel Gorman <mgorman@...e.de>, Daniel Bristot de Oliveira <bristot@...hat.com>, 
	Valentin Schneider <vschneid@...hat.com>, Christian Loehle <christian.loehle@....com>, linux-pm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched: Consolidate cpufreq updates

On Mon, 6 May 2024 at 01:31, Qais Yousef <qyousef@...alina.io> wrote:
>
> Improve the interaction with cpufreq governors by making the
> cpufreq_update_util() calls more intentional.
>
> At the moment we send them when load is updated for CFS, bandwidth for
> DL and at enqueue/dequeue for RT. But this can lead to too many updates
> sent in a short period of time and potentially be ignored at a critical
> moment due to the rate_limit_us in schedutil.
>
> For example, simultaneous task enqueue on the CPU where 2nd task is
> bigger and requires higher freq. The trigger to cpufreq_update_util() by
> the first task will lead to dropping the 2nd request until tick. Or
> another CPU in the same policy triggers a freq update shortly after.
>
> Updates at enqueue for RT are not strictly required. Though they do help
> to reduce the delay for switching the frequency and the potential
> observation of lower frequency during this delay. But current logic
> doesn't intentionally (at least to my understanding) try to speed up the
> request.
>
> To help reduce the amount of cpufreq updates and make them more
> purposeful, consolidate them into these locations:
>
> 1. context_switch()

I don't see any cpufreq update when switching from idle to CFS. We
have to wait for the next tick to get a freq update whatever the value
of util_est and uclamp

> 2. task_tick_fair()

Updating only during tick is ok with a tick at 1000hz/1000us when we
compare it with the1048us slice of pelt but what about 4ms or even
10ms tick ? we can have an increase of almost 200 in 10ms

> 3. {attach, detach}_entity_load_avg()

At enqueue/dequeue, the util_est will be updated and can make cpu
utilization quite different especially with long sleeping tasks. The
same applies for uclamp_min/max hints of a newly enqueued task. We
might end up waiting 4/10ms depending of the tick period.

> 4. update_blocked_averages()
>
> The update at context switch should help guarantee that DL and RT get
> the right frequency straightaway when they're RUNNING. As mentioned
> though the update will happen slightly after enqueue_task(); though in
> an ideal world these tasks should be RUNNING ASAP and this additional
> delay should be negligible. For fair tasks we need to make sure we send
> a single update for every decay for the root cfs_rq. Any changes to the
> rq will be deferred until the next task is ready to run, or we hit TICK.
> But we are guaranteed the task is running at a level that meets its
> requirements after enqueue.
>
> To guarantee RT and DL tasks updates are never missed, we add a new
> SCHED_CPUFREQ_FORCE_UPDATE to ignore the rate_limit_us. If we are
> already running at the right freq, the governor will end up doing
> nothing, but we eliminate the risk of the task ending up accidentally
> running at the wrong freq due to rate_limit_us.
>
> Similarly for iowait boost, we ignore rate limits. We also handle a case
> of a boost reset prematurely by adding a guard in sugov_iowait_apply()
> to reduce the boost after 1ms which seems iowait boost mechanism relied
> on rate_limit_us and cfs_rq.decay preventing any updates to happen soon
> after iowait boost.
>
> The new SCHED_CPUFREQ_FORCE_UPDATE should not impact the rate limit
> time stamps otherwise we can end up delaying updates for normal
> requests.
>
> As a simple optimization, we avoid sending cpufreq updates when
> switching from RT to another RT as RT tasks run at max freq by default.
> If CONFIG_UCLAMP_TASK is enabled, we can do a simple check to see if
> uclamp_min is different to avoid unnecessary cpufreq update as most RT
> tasks are likely to be running at the same performance level, so we can
> avoid unnecessary overhead of forced updates when there's nothing to do.
>
> We also also ensure to ignore cpufreq udpates for sugov workers at
> context switch. It doesn't make sense for the kworker that applies the
> frequency update (which is a DL task) to trigger a frequency update
> itself.
>
> The update at task_tick_fair will guarantee that the governor will
> follow any updates to load for tasks/CPU or due to new enqueues/dequeues
> to the rq. Since DL and RT always run at constant frequencies and have
> no load tracking, this is only required for fair tasks.
>
> The update at attach/detach_entity_load_avg() will ensure we adapt to
> big changes when tasks are added/removed from cgroups.
>
> The update at update_blocked_averages() will ensure we decay frequency
> as the CPU becomes idle for long enough.
>
> Results of
>
>         taskset 1 perf stat --repeat 10 -e cycles,instructions,task-clock perf bench sched pipe
>
> on AMD 3900X to verify any potential overhead because of the addition at
> context switch against v6.8.7 stable kernel
>
> v6.8.7: schedutil:
> ------------------
>
>  Performance counter stats for 'perf bench sched pipe' (10 runs):
>
>        850,276,689      cycles:u                  #    0.078 GHz                      ( +-  0.88% )
>         82,724,245      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
>          10,881.41 msec task-clock:u              #    0.995 CPUs utilized            ( +-  0.12% )
>
>            10.9377 +- 0.0135 seconds time elapsed  ( +-  0.12% )
>
> v6.8.7: performance:
> --------------------
>
>  Performance counter stats for 'perf bench sched pipe' (10 runs):
>
>        874,154,415      cycles:u                  #    0.080 GHz                      ( +-  0.78% )
>         82,724,420      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
>          10,916.47 msec task-clock:u              #    0.999 CPUs utilized            ( +-  0.09% )
>
>            10.9308 +- 0.0100 seconds time elapsed  ( +-  0.09% )
>
> v6.8.7+patch: schedutil:
> ------------------------
>
>  Performance counter stats for 'perf bench sched pipe' (10 runs):
>
>        816,938,281      cycles:u                  #    0.075 GHz                      ( +-  0.84% )
>         82,724,163      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
>          10,907.62 msec task-clock:u              #    1.004 CPUs utilized            ( +-  0.11% )
>
>            10.8627 +- 0.0121 seconds time elapsed  ( +-  0.11% )
>
> v6.8.7+patch: performance:
> --------------------------
>
>  Performance counter stats for 'perf bench sched pipe' (10 runs):
>
>        814,038,416      cycles:u                  #    0.074 GHz                      ( +-  1.21% )
>         82,724,356      instructions:u            #    0.10  insn per cycle           ( +-  0.00% )
>          10,886.69 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.17% )
>
>            10.9298 +- 0.0181 seconds time elapsed  ( +-  0.17% )
>
> Note worthy that we still have the following race condition on systems
> that have shared policy:
>
> * CPUs with shared policy can end up sending simultaneous cpufreq
>   updates requests where the 2nd one will be unlucky and get blocked by
>   the rate_limit_us (schedutil).
>
> We can potentially address this limitation later, but it is out of the
> scope of this patch.
>
> Signed-off-by: Qais Yousef <qyousef@...alina.io>
> ---
>
> Changes since v1:
>
>         * Use taskset and measure with performance governor as Ingo suggested
>         * Remove the static key as I found out we always register a function
>           for cpu_dbs in cpufreq_governor.c; and as Christian pointed out it
>           trigger a lock debug warning.
>         * Improve detection of sugov workers by using SCHED_FLAG_SUGOV
>         * Guard against NSEC_PER_MSEC instead of TICK_USEC to avoid prematurely
>           reducing iowait boost as the latter was a NOP and like
>           sugov_iowait_reset() like Christian pointed out.
>
> v1 discussion: https://lore.kernel.org/all/20240324020139.1032473-1-qyousef@layalina.io/
>
>  include/linux/sched/cpufreq.h    |  3 +-
>  kernel/sched/core.c              | 68 +++++++++++++++++++++++++++++++-
>  kernel/sched/cpufreq_schedutil.c | 55 +++++++++++++++++++-------
>  kernel/sched/deadline.c          |  4 --
>  kernel/sched/fair.c              | 53 ++++---------------------
>  kernel/sched/rt.c                |  8 +---
>  kernel/sched/sched.h             |  5 +++
>  7 files changed, 122 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
> index bdd31ab93bc5..2d0a45aba16f 100644
> --- a/include/linux/sched/cpufreq.h
> +++ b/include/linux/sched/cpufreq.h
> @@ -8,7 +8,8 @@
>   * Interface between cpufreq drivers and the scheduler:
>   */
>
> -#define SCHED_CPUFREQ_IOWAIT   (1U << 0)
> +#define SCHED_CPUFREQ_IOWAIT           (1U << 0)
> +#define SCHED_CPUFREQ_FORCE_UPDATE     (1U << 1) /* ignore transition_delay_us */
>
>  #ifdef CONFIG_CPU_FREQ
>  struct cpufreq_policy;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1a914388144a..e6fe7dbd1f89 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5134,6 +5134,65 @@ static inline void balance_callbacks(struct rq *rq, struct balance_callback *hea
>
>  #endif
>
> +static inline void update_cpufreq_ctx_switch(struct rq *rq, struct task_struct *prev)
> +{
> +#ifdef CONFIG_CPU_FREQ
> +       unsigned int flags = 0;
> +
> +#ifdef CONFIG_SMP
> +       if (unlikely(current->sched_class == &stop_sched_class))
> +               return;
> +#endif
> +
> +       if (unlikely(current->sched_class == &idle_sched_class))
> +               return;
> +
> +       if (unlikely(task_has_idle_policy(current)))
> +               return;
> +
> +       if (likely(fair_policy(current->policy))) {
> +
> +               if (unlikely(current->in_iowait)) {
> +                       flags |= SCHED_CPUFREQ_IOWAIT | SCHED_CPUFREQ_FORCE_UPDATE;
> +                       goto force_update;
> +               }
> +
> +#ifdef CONFIG_SMP
> +               /*
> +                * Allow cpufreq updates once for every update_load_avg() decay.
> +                */
> +               if (unlikely(rq->cfs.decayed)) {
> +                       rq->cfs.decayed = false;
> +                       goto force_update;
> +               }
> +#endif
> +               return;
> +       }
> +
> +       /*
> +        * RT and DL should always send a freq update. But we can do some
> +        * simple checks to avoid it when we know it's not necessary.
> +        */
> +       if (rt_task(current) && rt_task(prev)) {
> +#ifdef CONFIG_UCLAMP_TASK
> +               unsigned long curr_uclamp_min = uclamp_eff_value(current, UCLAMP_MIN);
> +               unsigned long prev_uclamp_min = uclamp_eff_value(prev, UCLAMP_MIN);
> +
> +               if (curr_uclamp_min == prev_uclamp_min)
> +#endif
> +                       return;
> +       } else if (dl_task(current) && current->dl.flags & SCHED_FLAG_SUGOV) {
> +               /* Ignore sugov kthreads, they're responding to our requests */
> +               return;
> +       }
> +
> +       flags |= SCHED_CPUFREQ_FORCE_UPDATE;
> +
> +force_update:
> +       cpufreq_update_util(rq, flags);
> +#endif
> +}
> +
>  static inline void
>  prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf)
>  {
> @@ -5151,7 +5210,7 @@ prepare_lock_switch(struct rq *rq, struct task_struct *next, struct rq_flags *rf
>  #endif
>  }
>
> -static inline void finish_lock_switch(struct rq *rq)
> +static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
>  {
>         /*
>          * If we are tracking spinlock dependencies then we have to
> @@ -5160,6 +5219,11 @@ static inline void finish_lock_switch(struct rq *rq)
>          */
>         spin_acquire(&__rq_lockp(rq)->dep_map, 0, 0, _THIS_IP_);
>         __balance_callbacks(rq);
> +       /*
> +        * Request freq update after __balance_callbacks to take into account
> +        * any changes to rq.
> +        */
> +       update_cpufreq_ctx_switch(rq, prev);
>         raw_spin_rq_unlock_irq(rq);
>  }
>
> @@ -5278,7 +5342,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
>         perf_event_task_sched_in(prev, current);
>         finish_task(prev);
>         tick_nohz_task_switch();
> -       finish_lock_switch(rq);
> +       finish_lock_switch(rq, prev);
>         finish_arch_post_lock_switch();
>         kcov_finish_switch(current);
>         /*
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eece6244f9d2..e8b65b75e7f3 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -59,7 +59,8 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
>  /************************ Governor internals ***********************/
>
> -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
> +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time,
> +                                    unsigned int flags)
>  {
>         s64 delta_ns;
>
> @@ -87,13 +88,16 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
>                 return true;
>         }
>
> +       if (unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> +               return true;
> +
>         delta_ns = time - sg_policy->last_freq_update_time;
>
>         return delta_ns >= sg_policy->freq_update_delay_ns;
>  }
>
>  static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
> -                                  unsigned int next_freq)
> +                                  unsigned int next_freq, unsigned int flags)
>  {
>         if (sg_policy->need_freq_update)
>                 sg_policy->need_freq_update = cpufreq_driver_test_flags(CPUFREQ_NEED_UPDATE_LIMITS);
> @@ -101,7 +105,9 @@ static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
>                 return false;
>
>         sg_policy->next_freq = next_freq;
> -       sg_policy->last_freq_update_time = time;
> +
> +       if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> +               sg_policy->last_freq_update_time = time;
>
>         return true;
>  }
> @@ -249,9 +255,10 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
>                                unsigned int flags)
>  {
>         bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> +       bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
>
>         /* Reset boost if the CPU appears to have been idle enough */
> -       if (sg_cpu->iowait_boost &&
> +       if (sg_cpu->iowait_boost && !forced_update &&
>             sugov_iowait_reset(sg_cpu, time, set_iowait_boost))
>                 return;
>
> @@ -294,17 +301,34 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
>   * being more conservative on tasks which does sporadic IO operations.
>   */
>  static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
> -                              unsigned long max_cap)
> +                              unsigned long max_cap, unsigned int flags)
>  {
> +       bool forced_update = flags & SCHED_CPUFREQ_FORCE_UPDATE;
> +       s64 delta_ns = time - sg_cpu->last_update;
> +
>         /* No boost currently required */
>         if (!sg_cpu->iowait_boost)
>                 return 0;
>
> +       if (forced_update)
> +               goto apply_boost;
> +
>         /* Reset boost if the CPU appears to have been idle enough */
>         if (sugov_iowait_reset(sg_cpu, time, false))
>                 return 0;
>
>         if (!sg_cpu->iowait_boost_pending) {
> +               /*
> +                * This logic relied on PELT signal decays happening once every
> +                * 1ms. But due to changes to how updates are done now, we can
> +                * end up with more request coming up leading to iowait boost
> +                * to be prematurely reduced. Make the assumption explicit
> +                * until we improve the iowait boost logic to be better in
> +                * general as it is due for an overhaul.
> +                */
> +               if (delta_ns <= NSEC_PER_MSEC)
> +                       goto apply_boost;
> +
>                 /*
>                  * No boost pending; reduce the boost value.
>                  */
> @@ -315,6 +339,7 @@ static unsigned long sugov_iowait_apply(struct sugov_cpu *sg_cpu, u64 time,
>                 }
>         }
>
> +apply_boost:
>         sg_cpu->iowait_boost_pending = false;
>
>         /*
> @@ -358,10 +383,10 @@ static inline bool sugov_update_single_common(struct sugov_cpu *sg_cpu,
>
>         ignore_dl_rate_limit(sg_cpu);
>
> -       if (!sugov_should_update_freq(sg_cpu->sg_policy, time))
> +       if (!sugov_should_update_freq(sg_cpu->sg_policy, time, flags))
>                 return false;
>
> -       boost = sugov_iowait_apply(sg_cpu, time, max_cap);
> +       boost = sugov_iowait_apply(sg_cpu, time, max_cap, flags);
>         sugov_get_util(sg_cpu, boost);
>
>         return true;
> @@ -397,7 +422,7 @@ static void sugov_update_single_freq(struct update_util_data *hook, u64 time,
>                 sg_policy->cached_raw_freq = cached_freq;
>         }
>
> -       if (!sugov_update_next_freq(sg_policy, time, next_f))
> +       if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
>                 return;
>
>         /*
> @@ -449,10 +474,12 @@ static void sugov_update_single_perf(struct update_util_data *hook, u64 time,
>         cpufreq_driver_adjust_perf(sg_cpu->cpu, sg_cpu->bw_min,
>                                    sg_cpu->util, max_cap);
>
> -       sg_cpu->sg_policy->last_freq_update_time = time;
> +       if (!unlikely(flags & SCHED_CPUFREQ_FORCE_UPDATE))
> +               sg_cpu->sg_policy->last_freq_update_time = time;
>  }
>
> -static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
> +static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time,
> +                                          unsigned int flags)
>  {
>         struct sugov_policy *sg_policy = sg_cpu->sg_policy;
>         struct cpufreq_policy *policy = sg_policy->policy;
> @@ -465,7 +492,7 @@ static unsigned int sugov_next_freq_shared(struct sugov_cpu *sg_cpu, u64 time)
>                 struct sugov_cpu *j_sg_cpu = &per_cpu(sugov_cpu, j);
>                 unsigned long boost;
>
> -               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap);
> +               boost = sugov_iowait_apply(j_sg_cpu, time, max_cap, flags);
>                 sugov_get_util(j_sg_cpu, boost);
>
>                 util = max(j_sg_cpu->util, util);
> @@ -488,10 +515,10 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags)
>
>         ignore_dl_rate_limit(sg_cpu);
>
> -       if (sugov_should_update_freq(sg_policy, time)) {
> -               next_f = sugov_next_freq_shared(sg_cpu, time);
> +       if (sugov_should_update_freq(sg_policy, time, flags)) {
> +               next_f = sugov_next_freq_shared(sg_cpu, time, flags);
>
> -               if (!sugov_update_next_freq(sg_policy, time, next_f))
> +               if (!sugov_update_next_freq(sg_policy, time, next_f, flags))
>                         goto unlock;
>
>                 if (sg_policy->policy->fast_switch_enabled)
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index a04a436af8cc..02c9c2488091 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -252,8 +252,6 @@ void __add_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
>         dl_rq->running_bw += dl_bw;
>         SCHED_WARN_ON(dl_rq->running_bw < old); /* overflow */
>         SCHED_WARN_ON(dl_rq->running_bw > dl_rq->this_bw);
> -       /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> -       cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
>  }
>
>  static inline
> @@ -266,8 +264,6 @@ void __sub_running_bw(u64 dl_bw, struct dl_rq *dl_rq)
>         SCHED_WARN_ON(dl_rq->running_bw > old); /* underflow */
>         if (dl_rq->running_bw > old)
>                 dl_rq->running_bw = 0;
> -       /* kick cpufreq (see the comment in kernel/sched/sched.h). */
> -       cpufreq_update_util(rq_of_dl_rq(dl_rq), 0);
>  }
>
>  static inline
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9eb63573110c..cbe79c8ac2ed 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3982,29 +3982,6 @@ static inline void update_cfs_group(struct sched_entity *se)
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> -{
> -       struct rq *rq = rq_of(cfs_rq);
> -
> -       if (&rq->cfs == cfs_rq) {
> -               /*
> -                * There are a few boundary cases this might miss but it should
> -                * get called often enough that that should (hopefully) not be
> -                * a real problem.
> -                *
> -                * It will not get called when we go idle, because the idle
> -                * thread is a different class (!fair), nor will the utilization
> -                * number include things like RT tasks.
> -                *
> -                * As is, the util number is not freq-invariant (we'd have to
> -                * implement arch_scale_freq_capacity() for that).
> -                *
> -                * See cpu_util_cfs().
> -                */
> -               cpufreq_update_util(rq, flags);
> -       }
> -}
> -
>  #ifdef CONFIG_SMP
>  static inline bool load_avg_is_decayed(struct sched_avg *sa)
>  {
> @@ -4682,7 +4659,7 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
>         add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
>
> -       cfs_rq_util_change(cfs_rq, 0);
> +       cpufreq_update_util(rq_of(cfs_rq), 0);
>
>         trace_pelt_cfs_tp(cfs_rq);
>  }
> @@ -4712,7 +4689,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
>         add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
>
> -       cfs_rq_util_change(cfs_rq, 0);
> +       cpufreq_update_util(rq_of(cfs_rq), 0);
>
>         trace_pelt_cfs_tp(cfs_rq);
>  }
> @@ -4729,7 +4706,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  {
>         u64 now = cfs_rq_clock_pelt(cfs_rq);
> -       int decayed;
>
>         /*
>          * Track task load average for carrying it to new CPU after migrated, and
> @@ -4738,8 +4714,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
>                 __update_load_avg_se(now, cfs_rq, se);
>
> -       decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> -       decayed |= propagate_entity_load_avg(se);
> +       cfs_rq->decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> +       cfs_rq->decayed |= propagate_entity_load_avg(se);
>
>         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
>
> @@ -4760,11 +4736,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>                  */
>                 detach_entity_load_avg(cfs_rq, se);
>                 update_tg_load_avg(cfs_rq);
> -       } else if (decayed) {
> -               cfs_rq_util_change(cfs_rq, 0);
> -
> -               if (flags & UPDATE_TG)
> -                       update_tg_load_avg(cfs_rq);
> +       } else if (cfs_rq->decayed && (flags & UPDATE_TG)) {
> +               update_tg_load_avg(cfs_rq);
>         }
>  }
>
> @@ -5139,7 +5112,6 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>
>  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
>  {
> -       cfs_rq_util_change(cfs_rq, 0);
>  }
>
>  static inline void remove_entity_load_avg(struct sched_entity *se) {}
> @@ -6754,14 +6726,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>          */
>         util_est_enqueue(&rq->cfs, p);
>
> -       /*
> -        * If in_iowait is set, the code below may not trigger any cpufreq
> -        * utilization updates, so do it here explicitly with the IOWAIT flag
> -        * passed.
> -        */
> -       if (p->in_iowait)
> -               cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -
>         for_each_sched_entity(se) {
>                 if (se->on_rq)
>                         break;
> @@ -9351,10 +9315,6 @@ static bool __update_blocked_others(struct rq *rq, bool *done)
>         unsigned long hw_pressure;
>         bool decayed;
>
> -       /*
> -        * update_load_avg() can call cpufreq_update_util(). Make sure that RT,
> -        * DL and IRQ signals have been updated before updating CFS.
> -        */
>         curr_class = rq->curr->sched_class;
>
>         hw_pressure = arch_scale_hw_pressure(cpu_of(rq));
> @@ -12685,6 +12645,7 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>
>         update_misfit_status(curr, rq);
>         check_update_overutilized_status(task_rq(curr));
> +       cpufreq_update_util(rq, 0);
>
>         task_tick_core(rq, curr);
>  }
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 3261b067b67e..fe6d8b0ffa95 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -556,11 +556,8 @@ static void sched_rt_rq_dequeue(struct rt_rq *rt_rq)
>
>         rt_se = rt_rq->tg->rt_se[cpu];
>
> -       if (!rt_se) {
> +       if (!rt_se)
>                 dequeue_top_rt_rq(rt_rq, rt_rq->rt_nr_running);
> -               /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> -               cpufreq_update_util(rq_of_rt_rq(rt_rq), 0);
> -       }
>         else if (on_rt_rq(rt_se))
>                 dequeue_rt_entity(rt_se, 0);
>  }
> @@ -1065,9 +1062,6 @@ enqueue_top_rt_rq(struct rt_rq *rt_rq)
>                 add_nr_running(rq, rt_rq->rt_nr_running);
>                 rt_rq->rt_queued = 1;
>         }
> -
> -       /* Kick cpufreq (see the comment in kernel/sched/sched.h). */
> -       cpufreq_update_util(rq, 0);
>  }
>
>  #if defined CONFIG_SMP
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index cb3792c04eea..86cec2145221 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -632,6 +632,11 @@ struct cfs_rq {
>                 unsigned long   runnable_avg;
>         } removed;
>
> +       /*
> +        * Store whether last update_load_avg() has decayed
> +        */
> +       bool                    decayed;
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>         u64                     last_update_tg_load_avg;
>         unsigned long           tg_load_avg_contrib;
> --
> 2.34.1
>