linux-kernel - Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAyWMfpLwK_YTrMC67oo02-qOihcEau53wgxSAt1H+A-w@mail.gmail.com>
Date: Wed, 3 Dec 2025 14:32:06 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Hillf Danton <hdanton@...a.com>
Cc: peterz@...radead.org, linux-kernel@...r.kernel.org, pierre.gondois@....com, 
	kprateek.nayak@....com, qyousef@...alina.io, christian.loehle@....com, 
	luis.machado@....com
Subject: Re: [RFC PATCH 6/6 v7] sched/fair: Add EAS and idle cpu push trigger

On Wed, 3 Dec 2025 at 10:00, Hillf Danton <hdanton@...a.com> wrote:
>
> On Tue, 2 Dec 2025 14:01:39 +0100 Vincent Guittot wrote:
> >On Tue, 2 Dec 2025 at 10:45, Hillf Danton <hdanton@...a.com> wrote:
> >> On Mon,  1 Dec 2025 10:13:08 +0100 Vincent Guittot wrote:
> >> > EAS is based on wakeup events to efficiently place tasks on the system, but
> >> > there are cases where a task doesn't have wakeup events anymore or at a far
> >> > too low pace. For such cases, we check if it's worht pushing hte task on
> >> > another CPUs instead of putting it back in the enqueued list.
> >> >
> >> > Wake up events remain the main way to migrate tasks but we now detect
> >> > situation where a task is stuck on a CPU by checking that its utilization
> >> > is larger than the max available compute capacity (max cpu capacity or
> >> > uclamp max setting)
> >> >
> >> > When the system becomes overutilized and some CPUs are idle, we try to
> >> > push tasks instead of waiting periodic load balance.
> >> >
> >> > Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> >> > ---
> >> >  kernel/sched/fair.c     | 65 +++++++++++++++++++++++++++++++++++++++++
> >> >  kernel/sched/topology.c |  3 ++
> >> >  2 files changed, 68 insertions(+)
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index 9af8d0a61856..e9e1d0c05805 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -6990,6 +6990,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >> >  }
> >> >
> >> >  static void fair_remove_pushable_task(struct rq *rq, struct task_struct *p);
> >> > +
> >> >  /*
> >> >   * Basically dequeue_task_fair(), except it can deal with dequeue_entity()
> >> >   * failing half-way through and resume the dequeue later.
> >> > @@ -8499,8 +8500,72 @@ static inline bool sched_push_task_enabled(void)
> >> >       return static_branch_unlikely(&sched_push_task);
> >> >  }
> >> >
> >> > +static inline bool task_stuck_on_cpu(struct task_struct *p, int cpu)
> >> > +{
> >> > +     unsigned long max_capa, util;
> >> > +
> >> > +     max_capa = min(get_actual_cpu_capacity(cpu),
> >> > +                    uclamp_eff_value(p, UCLAMP_MAX));
> >> > +     util = max(task_util_est(p), task_runnable(p));
> >> > +
> >> > +     /*
> >> > +      * Return true only if the task might not sleep/wakeup because of a low
> >> > +      * compute capacity. Tasks, which wake up regularly, will be handled by
> >> > +      * feec().
> >> > +      */
> >> > +     return (util > max_capa);
> >> > +}
> >> > +
> >> > +static inline bool sched_energy_push_task(struct task_struct *p, struct rq *rq)
> >> > +{
> >> > +     if (!sched_energy_enabled())
> >> > +             return false;
> >> > +
> >> > +     if (is_rd_overutilized(rq->rd))
> >> > +             return false;
> >> > +
> >> > +     if (task_stuck_on_cpu(p, cpu_of(rq)))
> >> > +             return true;
> >> > +
> >> > +     if (!task_fits_cpu(p, cpu_of(rq)))
> >> > +             return true;
> >> > +
> >> > +     return false;
> >> > +}
> >> > +
> >> > +static inline bool sched_idle_push_task(struct task_struct *p, struct rq *rq)
> >> > +{
> >> > +     if (rq->nr_running == 1)
> >> > +             return false;
> >> > +
> >> > +     if (!is_rd_overutilized(rq->rd))
> >> > +             return false;
> >> > +
> >> > +     /* If there are idle cpus in the llc then try to push the task on it */
> >> > +     if (test_idle_cores(cpu_of(rq)))
> >> > +             return true;
> >> > +
> >> > +     return false;
> >> > +}
> >> > +
> >> > +
> >> >  static bool fair_push_task(struct rq *rq, struct task_struct *p)
> >> >  {
> >> > +     if (!task_on_rq_queued(p))
> >> > +             return false;
> >>
> >> Task is queued on rq.
> >> > +
> >> > +     if (p->se.sched_delayed)
> >> > +             return false;
> >> > +
> >> > +     if (p->nr_cpus_allowed == 1)
> >> > +             return false;
> >> > +
> >> > +     if (sched_energy_push_task(p, rq))
> >> > +             return true;
> >>
> >> If task is stuck on CPU, it could not be on rq. Weird.
> >
> > May be it comes from my description and I should use task_stuck_on_rq
> > By stuck, I mean that the task doesn't have any opportunity to migrate
> > on another cpu/rq and stay "forever"  (at least until next sleep) on
> > this cpu/rq because load balancing is disabled/bypassed w/ EAS
> > Here Stuck does not mean blocked/sleeping
> >
> Given task queued on rq, I find the correct phrase, stack, in the cover
> letter instead of stuck, and the long-standing stacking tasks mean load
> balancer fails to cure that stack. 1/7 fixes that failure, no?

It's not just stacked because we sometimes/often want to stack tasks
on the same CPU. EAS is based on the assumption that tasks will sleep
and wake up regularly and EAS will select a new CPU at each wakeup but
it's not always true. We can have situations where task A has been put
on CPU0when waking up, sharing the CPU with others tasks. But after
some time, task A should be better on CPUB now not because of not
fitting anymore on CPU0 but just because the system state has changed
since its wakeup. Because task A shares the CPU0 with other tasks, it
can takes dozen/hundreds of ms to finish its works and to sleep and we
don't wait those hundreds of ms whereas a CPU1 might be a better
choice now.
Patch 1 fixes a case where a CPU was wrongly classified as overloaded
whereas it's not the case (because of uclamp max as an example)