linux-kernel - Re: [PATCH] sched/fair: Fix pelt lost idle time detection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDqobG0YqhDOWF_VR1JoLcEeLLz1U-ZeWVLK7Gk6jWZnw@mail.gmail.com>
Date: Fri, 16 Jan 2026 15:02:57 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Qais Yousef <qyousef@...alina.io>
Cc: Samuel Wu <wusamuel@...gle.com>, mingo@...hat.com, peterz@...radead.org, 
	juri.lelli@...hat.com, dietmar.eggemann@....com, rostedt@...dmis.org, 
	bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, 
	linux-kernel@...r.kernel.org, Android Kernel Team <kernel-team@...roid.com>
Subject: Re: [PATCH] sched/fair: Fix pelt lost idle time detection

On Mon, 12 Jan 2026 at 16:10, Qais Yousef <qyousef@...alina.io> wrote:
>
> On 01/07/26 08:50, Vincent Guittot wrote:
> > On Tue, 23 Dec 2025 at 18:27, Qais Yousef <qyousef@...alina.io> wrote:
> > >
> > > On 12/13/25 04:54, Vincent Guittot wrote:
> > >
> > > > > For completeness, here are some Perfetto traces that show threads
> > > > > running, CPU frequency, and PELT related stats. I've pinned the
> > > > > util_avg track for a CPU on the little cluster, as the util_avg metric
> > > > > shows an obvious increase (~66 vs ~3 for with patch and without patch
> > > > > respectively).
> > > >
> > > > I was focusing on the update of rq->lost_idle_time but It can't be
> > > > related because the CPUs are often idle in your trace. But it also
> > > > updates the rq->clock_idle and rq->clock_pelt_idle which are used to
> > > > sync cfs task util_avg at wakeup when it is about to migrate and prev
> > > > cpu is idle.
> > > >
> > > > before the patch we could have old clock_pelt_idle and clock_idle that
> > > > were used to decay the util_avg of cfs task before migrating them
> > > > which would ends up with decaying too much util_avg
> > > >
> > > > But I noticed that you put the util_avg_rt which doesn't use the 2
> > > > fields above in mainline. Does android kernel make some changes for rt
> > > > util_avg tracking ?
> > >
> > > We shouldn't be doing that. I think we were not updating RT pressure correctly
> > > before the patch. The new values make more sense to me as RT tasks are running
> > > 2ms every 10ms and a util_avg_rt of ~150 range makes more sense than the
> > > previous 5-6 values? If we add the 20% headroom that can easily saturate the
> > > little core.
> > >
> > > update_rt_rq_load_avg() uses rq_clock_pelt() which takes into account the
> > > lost_idle_time which we now ensure is updated in this corner case?
> >
> > But according to the Samuel's trace, there is a lot of idle time and
> > the CPUs are far from being over utilized so we don't miss any lost
> > idle time which happens when util_avg reaches 1024 (util_sum >
> > 47791490)
>
> Hmm I didn't look deeply but something is 'fixing' the values. clocks being
> updated?
>
> >
> > >
> > > I guess the first question is which do you think is the right behavior for the
> > > RT pressure?
> >
> > RT pressure reflects the utilization of CPU by RT tasks. The tracking
> > is simpler than cfs (no per task tracking, no migration tracking, no
> > util est ..) but it is the utilization of CPU by RT and as a result
> > should be used when selecting OPP. The reality is that this RT
> > utilization is almost always hidden because we jump to a high OPP as
> > soon as a RT task is runnable and the max(rt util_avg, rt uclamp_min)
> > often/always returns uclamp_min
>
> Yes, but this value shouldn't be driving RT tasks frequency selection.

I would add as long as the min freq for RT is high enough

>
> >
> > >
> > > And 2nd question, does it make sense to take RT pressure into account in
> > > schedutil if there are no fair tasks? It is supposed to help compensate for the
> > > stolen time by RT so we make fair run faster. But if there are no fair tasks,
> > > the RT pressure is meaningless on its own as they should run at max or whatever
> > > value specified by uclamp_min? I think in this test uclamp_min is set to 0 by
> > > default for RT, so expected not to cause frequency to rise on their own.
> >
> > Interesting that uclamp_min is set to 0 :-). And this is another
> > reason to keep rt util_avg otherwise we might not have enough capacity
> > to run RT task
> >
> > I'm worried that if we don't take into account rt util_avg the
> > execution of RT task will be too much delayed when cfs task will wake
> > up and the increase of OPP will not be enough.
>
> Oh, we should absolutely take it into account. I wasn't proposing not to take
> it into account for fair. Just to handle the corner cases when there are no
> fair tasks and RT is running at higher frequency (to what it has requested).
> The design of RT is that it has no DVFS support but runs at a constant
> frequency which can be controlled by uclamp_min. When there are no fair tasks
> running, RT util_avg shouldn't cause RT to run higher as the util_avg is for
> fair OPP selection, not RT.

fair enough

>
> I think I see better where the real problem lies. uclamp_max is set to 1024 by
> default for RT which means they can run higher than what they requested.
> I think RT tasks should ensure that uclamp_max is always equal to uclamp_min to
> deliver on the constant freq request design. Which I think a more graceful way
> to handle this corner case. If uclamp_max is set to 0, then this RT tasks can't
> cause the frequency to rise on its own. As I wrote the patch I realized we
> don't have explicit handling of tasks switching_to/from RT tasks. I need to
> think about it. We should enforce the default rt uclamp_min value when we
> switch to RT if the task is !user_defined. For uclamp_max the logic and the
> rule will becoming complicated with this proposal. Maybe we can simplify the
> rules by forcing policy switch to restore the task's uclamp_min/max to the
> policy default values?
>
> Only compile tested the patch
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 41ba0be16911..0df03a419a53 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1575,6 +1575,11 @@ static void __uclamp_update_util_min_rt_default(struct task_struct *p)
>
>         default_util_min = sysctl_sched_uclamp_util_min_rt_default;
>         uclamp_se_set(uc_se, default_util_min, false);
> +       /*
> +        * UCLAMP_MAX should always follow UCLAMP_MIN for RT so that it
> +        * requests a constant frequency which is the intended design
> +        */
> +       uclamp_se_set(&p->uclamp_req[UCLAMP_MAX], default_util_min, false);
>  }
>
>  static void uclamp_update_util_min_rt_default(struct task_struct *p)
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index 0496dc29ed0f..94ec08a0693e 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -330,6 +330,13 @@ static int uclamp_validate(struct task_struct *p,
>         if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX) {
>                 util_max = attr->sched_util_max;
>
> +               /*
> +                * UCLAMP_MAX follows UCLAMP_MIN for RT to guarantee constant
> +                * freq request.
> +                */
> +               if (unlikely(rt_task(p)))
> +                       return -EINVAL;
> +
>                 if (util_max + 1 > SCHED_CAPACITY_SCALE + 1)
>                         return -EINVAL;
>         }
> @@ -388,9 +395,10 @@ static void __setscheduler_uclamp(struct task_struct *p,
>
>                 /*
>                  * RT by default have a 100% boost value that could be modified
> -                * at runtime.
> +                * at runtime. RT should also always request a constant value,
> +                * which means UCLAMP_MIN === UCLAMP_MAX.
>                  */
> -               if (unlikely(rt_task(p) && clamp_id == UCLAMP_MIN))
> +               if (unlikely(rt_task(p)))
>                         value = sysctl_sched_uclamp_util_min_rt_default;
>                 else
>                         value = uclamp_none(clamp_id);
> @@ -406,6 +414,12 @@ static void __setscheduler_uclamp(struct task_struct *p,
>             attr->sched_util_min != -1) {
>                 uclamp_se_set(&p->uclamp_req[UCLAMP_MIN],
>                               attr->sched_util_min, true);
> +
> +               /* UCLAMP_MAX follows UCLAMP_MIN for RT */
> +               if (unlikely(rt_task(p)) {
> +                       uclamp_se_set(&p->uclamp_req[UCLAMP_MAX],
> +                                     attr->sched_util_max, true);
> +               }
>         }
>
>         if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP_MAX &&
>
> >
> > Let's take the case of a RT task that runs 2ms every 10ms.
> > now we have a cfs task that runs 2ms every 10ms but wakes up 8ms after
> > the RT task
> >
> > In order to run these 2 tasks: 4ms every 10ms, we need to use a
> > minimum compute capacity which is rt util_avg + cfs util_avg
> >
> > >
> > > >
> > > > >
> > > > > - with patch: https://ui.perfetto.dev/#!/?s=964594d07a5a5ba51a159ba6c90bb7ab48e09326
> > > > > - without patch:
> > > > > https://ui.perfetto.dev/#!/?s=6ff6854c87ea187e4ca488acd2e6501b90ec9f6f