linux-kernel - Re: [PATCH v2 0/8] uclamp sum aggregation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAB8ipk9LpbiUDnbcV6+59+Sa=Ai7tFzO===mpLD3obNdV4=J-A@mail.gmail.com>
Date: Thu, 6 Mar 2025 19:38:32 +0800
From: Xuewen Yan <xuewen.yan94@...il.com>
To: Hongyan Xia <hongyan.xia2@....com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, 
	Vincent Guittot <vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>, 
	Morten Rasmussen <morten.rasmussen@....com>, Lukasz Luba <lukasz.luba@....com>, 
	Christian Loehle <christian.loehle@....com>, Pierre Gondois <pierre.gondois@....com>, 
	linux-kernel@...r.kernel.org, Xuewen Yan <xuewen.yan@...soc.com>
Subject: Re: [PATCH v2 0/8] uclamp sum aggregation

On Thu, Mar 6, 2025 at 7:32 PM Hongyan Xia <hongyan.xia2@....com> wrote:
>
> Hi Xuewen,
>
> On 06/03/2025 11:12, Xuewen Yan wrote:
> > Hi Hongyan,
> >
> > On Tue, Mar 4, 2025 at 10:26 PM Hongyan Xia <hongyan.xia2@....com> wrote:
> >>
> >> This series gives an alternative implementation that addresses some of
> >> the problems in uclamp max aggregation. Sum aggregation mostly gives:
> >>
> >> 1. Simplicity. Sum aggregation implements uclamp with less than half of
> >>     code than max aggregation.
> >> 2. Effectiveness. Sum aggregation shows better uclamp effectiveness,
> >>     either in benchmark scores or sometimes in energy efficiency.
> >>
> >> The key idea of sum aggregation is fairly simple. Each task has a
> >> util_avg_bias, which is obtained by:
> >>
> >>      util_avg_bias = clamp(util_avg, uclamp_min, uclamp_max) - util_avg;
> >>
> >> If a CPU has N tasks, p1, p2, p3... pN, then we sum the biases up and
> >> obtain a rq total bias:
> >>
> >>      rq_bias = util_avg_bias1 + util_avg_bias2... + util_avg_biasN;
> >>
> >> Then we use the biased rq utilization rq_util + rq_bias to select OPP
> >> and to schedule tasks.
> >>
> >> PATCH BREAKDOWN:
> >>
> >> Patch 1/6 reverts a patch that accommodate uclamp_max tasks under max
> >> aggregation. This patch is not needed and creates other problems for sum
> >> aggregation. It is discussed elsewhere that this patch will be improved
> >> and there may not be the need to revert it in the future.
> >>
> >> Patch 2, 3 and 4 implement sum aggregation.
> >>
> >> Patch 5 and 6 remove max aggregation.
> >>
> >> Patch 7 applies PELT decay on negative util_avg_bias. This improves
> >> energy efficiency and task placement, but is not strictly necessary.
> >>
> >> Patch 8 addresses sum aggregation under-utilization problem.
> >>
> >> TESTING:
> >>
> >> Two notebooks are shared at
> >>
> >> https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/max.ipynb
> >> https://nbviewer.org/github/honxia02/notebooks/blob/aac12d9becae2b2fe4690cbb672439fd884ede30/whitebox/sum-offset.ipynb
> >>
> >> The experiments done in notebooks are on Arm Juno r2 board. CPU0-3 are
> >> little cores with capacity of 383. CPU4-5 are big cores. The rt-app
> >> profiles used for these experiments are included in the notebooks.
> >>
> >> Scenario 1: Scheduling 4 tasks with UCLAMP_MAX at 110.
> >>
> >> The scheduling decisions are plotted in Out[11]. Both max and sum
> >> aggregation understand the UCLAMP_MAX hint and schedule all 4 tasks on
> >> the little cluster. Max aggregation sometimes schedule 2 tasks on 1 CPU,
> >> and this is the reason why sum aggregation reverts the 1st commit.
> >>
> >> Scenario 2: Scheduling 4 tasks with UCLAMP_MIN and UCLAMP_MAX at a value
> >> slightly above the capacity of the little CPU.
> >>
> >> Results are in Out[17]. The purpose is to use UCLAMP_MIN to place tasks
> >> on the big core. Both max and sum aggregation handle this correctly.
> >>
> >> Scenario 3: Task A is a task with a small utilization pinned to CPU4.
> >> Task B is an always-running task pinned to CPU5, but UCLAMP_MAX capped
> >> at 300. After a while, task A is then pinned to CPU5, joining B.
> >>
> >> Results are in Out[23]. Max aggregation sees a frequency spike at
> >> 873.64s. When zoomed in, one can see square-wave-like utilization values
> >> because of A periodically going to sleep. When A wakes up, its default
> >> UCLAMP_MAX of 1024 will uncap B and reach the highest CPU frequency.
> >> When A sleeps, B's UCLAMP_MAX will be in effect and will reduce rq
> >> utilization. This happens repeatedly, hence the square wave. In
> >> contrast, sum aggregation sees a normal increase in utilization when A
> >> joins B, without any square-wave behavior.
> >>
> >> Scenario 4: 4 always-running tasks with UCLAMP_MAX of 110 pinned to the
> >> little PD (CPU0-3). 4 same tasks pinned to the big PD (CPU4-5).
> >> After a while, remove the CPU pinning of the 4 tasks on the big PD.
> >>
> >> Results are in Out[29]. After unpinning, max aggregation moves all 8
> >> tasks to the little cluster, but schedules 5 tasks on CPU0 and 1 each on
> >> CPU1-3. In contrast, sum aggregation schedules 2 on each little CPU
> >> after unpinning, which is the desired balanced task placement.
> >>
> >> EVALUATION:
> >>
> >> We backport patches to GKI kernel v6.1 on Pixel 9 and run Android
> >> benchmarks.
> >>
> >> Speedometer:
> >>
> >> We run Speedometer 2.1 on Chrome v131 to test ADPF/uclamp effectiveness.
> >> Because sum aggregation does not circumvent the 25% OPP margin, we
> >> reduce uclamp values to 80% to be fair.
> >>
> >> |   score   | score |   %    | CPU power % |
> >> |    max    | 192.4 |        |             |
> >> |  sum_0.8  | 230.8 | +19.96 |   +31.54    |
> >> | sum_tuned | 201.8 |  +4.89 |    -0.41    |
> >>
> >> We see a consistant higher score and higher average power consumption.
> >> Note that a higher score also means a reduction in run-time, total
> >> energy increase for sum_0.8 is only 9.65%.
> >>
> >> We then reduce uclamp values so that power consumption is roughly
> >> the same. If we do so, then sum aggregation achieves slightly better
> >> scores, shown in the sum_tuned row.
> >>
> >> UIBench:
> >>
> >> |   score   | jank percentage |   %    | CPU power (mW) |   %   |
> >> |    max    |     0.115%      |        |     158.1      |       |
> >> |  sum_0.8  |     0.129%      | +11.96 |     154.9      | -4.19 |
> >>
> >> UIBench on Pixel 9 by default already has a low enough jank percentage.
> >> Moving to sum aggregation gives slightly higher jank percentage and
> >> lower power consumption.
> >>
> >> ---
> >> Changed in v2:
> >> - Completely separate uclamp component from PELT and util_est.
> >> - Separate util_est_uclamp into an individual patch.
> >> - Address the under-utilization problem.
> >> - Update Python notebooks to reflect the latest sched/tip.
> >>
> >> Hongyan Xia (8):
> >>    Revert "sched/uclamp: Set max_spare_cap_cpu even if max_spare_cap is
> >>      0"
> >>    sched/uclamp: Track a new util_avg_bias signal
> >>    sched/uclamp: Add util_est_uclamp
> >>    sched/fair: Use util biases for utilization and frequency
> >>    sched/uclamp: Remove all uclamp bucket logic
> >
> > I’ve recently been looking into the issue with uclamp and
> > delayed-dequeue, and I found that uclamp_rq_inc should be placed
> > before enqueue_task, which led to a patch.
> > Before sending the patch, I came across your series of patches. I
> > haven’t fully understood your patch yet, but it seems like
> > uclamp_rq_inc is no longer needed.
> > Do you think the patch below is still necessary?
> >
>
> I posted a fix of the issue you mentioned days ago here
>
> https://lore.kernel.org/lkml/84441660bef0a5e67fd09dc3787178d0276dad31.1740664400.git.hongyan.xia2@arm.com/
>
> I think we found the same issue, but our approaches are different. I
> fear that as more complexity goes into each sched_class like delayed
> dequeue, it's better to just let the sched_class handle how uclamp is
> enqueued and dequeued within itself rather than leaking into core.c.
> Would be nice if you could take a look at my fix.
>
> Your patch is definitely necessary. The thing with this uclamp sum
> aggregation series is that there are still debates around it and it
> might take a while before everything is settled. So, at the moment we
> should view this series and the uclamp enqueue fix as separate things.

Okay, I will look at the patch.

Thanks!

>
> > --->
> >
> > Subject: [PATCH] sched/uclamp: Update the rq's uclamp before enqueue task
> >
> > When task's uclamp is set, we hope that the CPU frequency
> > can increase as quickly as possible when the task is enqueued.
> > Because the cpu frequency updating happens during the enqueue_task(),
> > so the rq's uclamp needs to be updated before the task is enqueued.
> > For sched-delayed tasks, the rq uclamp should only be updated
> > when they are enqueued upon being awakened.
> >
> > Signed-off-by: Xuewen Yan <xuewen.yan@...soc.com>
> > ---
> >   kernel/sched/core.c | 14 ++++++--------
> >   1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 67189907214d..b07e78910221 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -1747,7 +1747,7 @@ static inline void uclamp_rq_dec_id(struct rq
> > *rq, struct task_struct *p,
> >          }
> >   }
> >
> > -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
> > +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
> > *p, int flags)
> >   {
> >          enum uclamp_id clamp_id;
> >
> > @@ -1763,7 +1763,8 @@ static inline void uclamp_rq_inc(struct rq *rq,
> > struct task_struct *p)
> >          if (unlikely(!p->sched_class->uclamp_enabled))
> >                  return;
> >
> > -       if (p->se.sched_delayed)
> > +       /* Only inc the delayed task which is being woken up. */
> > +       if (p->se.sched_delayed && !(flags & ENQUEUE_DELAYED))
> >                  return;
> >
> >          for_each_clamp_id(clamp_id)
> > @@ -2031,7 +2032,7 @@ static void __init init_uclamp(void)
> >   }
> >
> >   #else /* !CONFIG_UCLAMP_TASK */
> > -static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { }
> > +static inline void uclamp_rq_inc(struct rq *rq, struct task_struct
> > *p, int flags) { }
> >   static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { }
> >   static inline void uclamp_fork(struct task_struct *p) { }
> >   static inline void uclamp_post_fork(struct task_struct *p) { }
> > @@ -2067,12 +2068,9 @@ void enqueue_task(struct rq *rq, struct
> > task_struct *p, int flags)
> >          if (!(flags & ENQUEUE_NOCLOCK))
> >                  update_rq_clock(rq);
> >
> > +       uclamp_rq_inc(rq, p, flags);
> > +
> >          p->sched_class->enqueue_task(rq, p, flags);
> > -       /*
> > -        * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
> > -        * ->sched_delayed.
> > -        */
> > -       uclamp_rq_inc(rq, p);
> >
> >          psi_enqueue(p, flags);
> >
> > --
> >
> > Thanks!
> >
> > BR
> > ---
> >
> >>    sched/uclamp: Simplify uclamp_eff_value()
> >>    sched/uclamp: Propagate negative bias
> >>    sched/uclamp: Solve under-utilization problem
> >>
> >>   include/linux/sched.h            |   8 +-
> >>   init/Kconfig                     |  32 ---
> >>   kernel/sched/core.c              | 308 ++--------------------
> >>   kernel/sched/cpufreq_schedutil.c |   6 +-
> >>   kernel/sched/debug.c             |   2 +-
> >>   kernel/sched/fair.c              | 430 ++++++++++++++++---------------
> >>   kernel/sched/pelt.c              |  62 +++++
> >>   kernel/sched/rt.c                |   4 -
> >>   kernel/sched/sched.h             | 132 +++-------
> >>   kernel/sched/syscalls.c          |   2 +
> >>   10 files changed, 341 insertions(+), 645 deletions(-)
> >>
> >> --
> >> 2.34.1
> >>
> >>
>