linux-kernel - Re: [PATCH] sched/fair: Fix pelt lost idle time detection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG2Kctq_bqxYTVuvvQAQhgDmJge+Tdq4KC1C=agFpv5cYF4jbg@mail.gmail.com>
Date: Thu, 15 Jan 2026 15:27:30 -0800
From: Samuel Wu <wusamuel@...gle.com>
To: Qais Yousef <qyousef@...alina.io>
Cc: Vincent Guittot <vincent.guittot@...aro.org>, mingo@...hat.com, peterz@...radead.org, 
	juri.lelli@...hat.com, dietmar.eggemann@....com, rostedt@...dmis.org, 
	bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, 
	linux-kernel@...r.kernel.org, Android Kernel Team <kernel-team@...roid.com>
Subject: Re: [PATCH] sched/fair: Fix pelt lost idle time detection

On Mon, Jan 12, 2026 at 7:11 AM Qais Yousef <qyousef@...alina.io> wrote:
>
> On 01/05/26 12:08, Samuel Wu wrote:
> > On Tue, Dec 23, 2025 at 10:49 AM Qais Yousef <qyousef@...alina.io> wrote:
> > >
> > > On 12/23/25 17:27, Qais Yousef wrote:
> > > > On 12/13/25 04:54, Vincent Guittot wrote:
> > > >
> > > > > > For completeness, here are some Perfetto traces that show threads
> > > > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > > > respectively).
> > > > >
> > > > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > > > related because the CPUs are often idle in your trace. But it also
> > > > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > > > cpu is idle.
> > > > >
> > > > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > > > were used to decay the util_avg of cfs task before migrating them
> > > > > which would ends up with decaying too much util_avg
> > > > >
> > > > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > > > fields above in mainline. Does android kernel make some changes for rt
> > > > > util_avg tracking ?
> > > >
> > > > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > > > before the patch. The new values make more sense to me as RT tasks are running
> > > > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > > > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > > > little core.
> > > >
> > > > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > > > lost_idle_time which we now ensure is updated in this corner case?
> > > >
> > > > I guess the first question is which do you think is the right behavior for the
> > > > RT pressure?
> > > >
> > > > And 2nd question, does it make sense to take RT pressure into account in
> > > > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > > > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > > > the RT pressure is meaningless on its own as they should run at max or whatever
> > > > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > > > default for RT, so expected not to cause frequency to rise on their own.
> > >
> > > Something like this
> > >
> > > --->8---
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index da46c3164537..80b526c40dab 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8059,7 +8059,7 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > >                                  unsigned long *min,
> > >                                  unsigned long *max)
> > >  {
> > > -       unsigned long util, irq, scale;
> > > +       unsigned long util = 0, irq, scale;
> > >         struct rq *rq = cpu_rq(cpu);
> > >
> > >         scale = arch_scale_cpu_capacity(cpu);
> > > @@ -8100,9 +8100,14 @@ unsigned long effective_cpu_util(int cpu, unsigned long util_cfs,
> > >          * CFS tasks and we use the same metric to track the effective
> > >          * utilization (PELT windows are synchronized) we can directly add them
> > >          * to obtain the CPU's actual utilization.
> > > +        *
> > > +        * Only applicable if there are fair tasks queued. When a new fair task
> > > +        * wakes up it should trigger a freq update.
> > >          */
> > > -       util = util_cfs + cpu_util_rt(rq);
> > > -       util += cpu_util_dl(rq);
> > > +       if (rq->cfs.h_nr_queued) {
> > > +               util = util_cfs + cpu_util_rt(rq);
> > > +               util += cpu_util_dl(rq);
> > > +       }
> > >
> > >         /*
> > >          * The maximum hint is a soft bandwidth requirement, which can be lower
> >
> > I tested Qais's patch with the same use case. There are 3 builds to
> > reference now:
> > 1. baseline
> > 2. baseline + Vincent's patch
> > 3. baseline + Vincent's patch + Qais's patch
> >
> > Scheduling behavior seems more proper now. I agree with Qais that RT
> > values seemed a little off in build 1, and also CPU freq values in
> > build 2 seemed too high for the given workload. With build 3, the
> > Wattson power values are back to baseline build 1, while keeping RT
> > util_avg in build 3 similar to that of build 2.
> >
> > - build 1: https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f
> > - build 2: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > - build 3: https://ui.perfetto.dev/#!/?s=c0a1585c31e51ab3e6b38d948829a4c0196f338c
>
> Thanks for verifying! I think we can handle it better with uclamp_max; could
> you give that other approach a try if possible please? Thanks!

Thanks for the analysis and experimental patch Qais. Here is a trace
with baseline + uclamp min/max patch, and labeling it as build 4.
build 4: https://ui.perfetto.dev/#!/?s=79cc53a2a13886571c498dea07e35525869a0657

It's hard for me to say what is right or wrong now, but the rt
util_avg and Wattson power estimates are both high and comparable to
that of build 2, which is the baseline with just Vincent's patch. One
new element with this build is that there seems to be more variance in
terms of both the rt util_avg and power estimates metrics, which
doesn't seem desirable, especially for a relatively constant workload.