linux-kernel - Re: [PATCH] sched/fair: Fix pelt lost idle time detection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260112151016.oweyxcej4fw6k6kr@airbuntu>
Date: Mon, 12 Jan 2026 15:10:16 +0000
From: Qais Yousef <qyousef@...alina.io>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Samuel Wu <wusamuel@...gle.com>, mingo@...hat.com, peterz@...radead.org,
	juri.lelli@...hat.com, dietmar.eggemann@....com,
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
	vschneid@...hat.com, linux-kernel@...r.kernel.org,
	Android Kernel Team <kernel-team@...roid.com>
Subject: Re: [PATCH] sched/fair: Fix pelt lost idle time detection

On 01/07/26 08:50, Vincent Guittot wrote:
> On Tue, 23 Dec 2025 at 18:27, Qais Yousef <qyousef@...alina.io> wrote:
> >
> > On 12/13/25 04:54, Vincent Guittot wrote:
> >
> > > > For completeness, here are some Perfetto traces that show threads
> > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > respectively).
> > >
> > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > related because the CPUs are often idle in your trace. But it also
> > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > cpu is idle.
> > >
> > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > were used to decay the util_avg of cfs task before migrating them
> > > which would ends up with decaying too much util_avg
> > >
> > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > fields above in mainline. Does android kernel make some changes for rt
> > > util_avg tracking ?
> >
> > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > before the patch. The new values make more sense to me as RT tasks are running
> > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > little core.
> >
> > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > lost_idle_time which we now ensure is updated in this corner case?
> 
> But according to the Samuel's trace, there is a lot of idle time and
> the CPUs are far from being over utilized so we don't miss any lost
> idle time which happens when util_avg reaches 1024 (util_sum >
> 47791490)

Hmm I didn't look deeply but something is 'fixing' the values. clocks being
updated?

> 
> >
> > I guess the first question is which do you think is the right behavior for the
> > RT pressure?
> 
> RT pressure reflects the utilization of CPU by RT tasks. The tracking
> is simpler than cfs (no per task tracking, no migration tracking, no
> util est ..) but it is the utilization of CPU by RT and as a result
> should be used when selecting OPP. The reality is that this RT
> utilization is almost always hidden because we jump to a high OPP as
> soon as a RT task is runnable and the max(rt util_avg, rt uclamp_min)
> often/always returns uclamp_min

Yes, but this value shouldn't be driving RT tasks frequency selection.

> 
> >
> > And 2nd question, does it make sense to take RT pressure into account in
> > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > the RT pressure is meaningless on its own as they should run at max or whatever
> > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > default for RT, so expected not to cause frequency to rise on their own.
> 
> Interesting that uclamp_min is set to 0 :-). And this is another
> reason to keep rt util_avg otherwise we might not have enough capacity
> to run RT task
> 
> I'm worried that if we don't take into account rt util_avg the
> execution of RT task will be too much delayed when cfs task will wake
> up and the increase of OPP will not be enough.

Oh, we should absolutely take it into account. I wasn't proposing not to take
it into account for fair. Just to handle the corner cases when there are no
fair tasks and RT is running at higher frequency (to what it has requested).
The design of RT is that it has no DVFS support but runs at a constant
frequency which can be controlled by uclamp_min. When there are no fair tasks
running, RT util_avg shouldn't cause RT to run higher as the util_avg is for
fair OPP selection, not RT.

I think I see better where the real problem lies. uclamp_max is set to 1024 by
default for RT which means they can run higher than what they requested.
I think RT tasks should ensure that uclamp_max is always equal to uclamp_min to
deliver on the constant freq request design. Which I think a more graceful way
to handle this corner case. If uclamp_max is set to 0, then this RT tasks can't
cause the frequency to rise on its own. As I wrote the patch I realized we
don't have explicit handling of tasks switching_to/from RT tasks. I need to
think about it. We should enforce the default rt uclamp_min value when we
switch to RT if the task is !user_defined. For uclamp_max the logic and the
rule will becoming complicated with this proposal. Maybe we can simplify the
rules by forcing policy switch to restore the task's uclamp_min/max to the
policy default values?

Only compile tested the patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 41ba0be16911..0df03a419a53 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1575,6 +1575,11 @@ static void __uclamp_update_util_min_rt_default(struct task_struct *p)

        default_util_min = sysctl_sched_uclamp_util_min_rt_default;
        uclamp_se_set(uc_se, default_util_min, false);
+       /*
+        * UCLAMP_MAX should always follow UCLAMP_MIN for RT so that it
+        * requests a constant frequency which is the intended design
+        */
+       uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], default_util_min, false);
 }

 static void uclamp_update_util_min_rt_default(struct task_struct *p)
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 0496dc29ed0f..94ec08a0693e 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -330,6 +330,13 @@ static int uclamp_validate(struct task_struct *p,
        if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
                util_max = attr->sched_util_max;

+               /*
+                * UCLAMP_MAX follows UCLAMP_MIN for RT to guarantee constant
+                * freq request.
+                */
+               if (unlikely(rt_task(p)))
+                       return -EINVAL;
+
                if (util_max + 1 > SCHED_CAPACITY_SCALE + 1)
                        return -EINVAL;
        }
@@ -388,9 +395,10 @@ static void __setscheduler_uclamp(struct task_struct *p,

                /*
                 * RT by default have a 100% boost value that could be modified
-                * at runtime.
+                * at runtime. RT should also always request a constant value,
+                * which means UCLAMP_MIN === UCLAMP_MAX.
                 */
-               if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
+               if (unlikely(rt_task(p)))
                        value = sysctl_sched_uclamp_util_min_rt_default;
                else
                        value = uclamp_none(clamp_id);
@@ -406,6 +414,12 @@ static void __setscheduler_uclamp(struct task_struct *p,
            attr->sched_util_min != -1) {
                uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
                              attr->sched_util_min, true);
+
+               /* UCLAMP_MAX follows UCLAMP_MIN for RT */
+               if (unlikely(rt_task(p)) {
+                       uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
+                                     attr->sched_util_max, true);
+               }
        }

        if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&

> 
> Let's take the case of a RT task that runs 2ms every 10ms.
> now we have a cfs task that runs 2ms every 10ms but wakes up 8ms after
> the RT task
> 
> In order to run these 2 tasks: 4ms every 10ms, we need to use a
> minimum compute capacity which is rt util_avg + cfs util_avg
> 
> >
> > >
> > > >
> > > > - with patch: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > > > - without patch:
> > > > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f