linux-kernel - Re: [RESEND][RFC] sched: Introduce removed.load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtCyL=ofg4yLtfG7zsoBMDnP48KCeUELT_Hddd3gnWeYEw@mail.gmail.com>
Date: Thu, 11 Sep 2025 10:01:44 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: hupu <hupu.gm@...il.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

On Wed, 10 Sept 2025 at 10:43, hupu <hupu.gm@...il.com> wrote:
>
> Currently, load_sum to be propagated is estimated from
> (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> runnable_avg as an approximation. This approach can introduce precision
> loss due to the shift operation, and the error may become more visible
> when small tasks frequently enter and leave the queue.

Do you have a level of error ? Do you have a typical use case ?

>
> This patch introduces removed.load_sum to directly accumulate
> se->avg.load_sum when tasks dequeue, and uses it during load
> propagation. By doing so:
>
>   a) Avoid relying on runnable_avg-based approximation and obtain
>      higher precision in load_sum propagation;
>   b) Eliminate precision loss from the shift operation, especially
>      with frequent short-lived tasks;
>   c) Reduce one multiplication and shift in the hotpath, which
>      theoretically lowers overhead (though the impact is minor).

This doesn't work because rq's load_sum tracks current weight whereas
se's load_sum doesn't include the weight which is only applied on se's
load_avg. So you can't directly add/sub se's load_sum and rq's
load_sum. Only load_avg of both se and rq use the same unit.

Also we don't want to track both load_sum and load_avg, only one is
enough and by the above it is load_avg

Then, why is it a problem only for load and not util or runnable ?

>
> Signed-off-by: hupu <hupu.gm@...il.com>
> ---
>  kernel/sched/fair.c  | 8 +++++---
>  kernel/sched/sched.h | 1 +
>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..b8cf3687804b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct sched_entity *se) {}
>  static inline int
>  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>  {
> -       unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> +       unsigned long removed_load_sum = 0, removed_load = 0;
> +       unsigned long removed_util = 0, removed_runnable = 0;
>         struct sched_avg *sa = &cfs_rq->avg;
>         int decayed = 0;
>
> @@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                 raw_spin_lock(&cfs_rq->removed.lock);
>                 swap(cfs_rq->removed.util_avg, removed_util);
>                 swap(cfs_rq->removed.load_avg, removed_load);
> +               swap(cfs_rq->removed.load_sum, removed_load_sum);
>                 swap(cfs_rq->removed.runnable_avg, removed_runnable);
>                 cfs_rq->removed.nr = 0;
>                 raw_spin_unlock(&cfs_rq->removed.lock);
> @@ -4609,8 +4611,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>                  * removed_runnable is the unweighted version of removed_load so we
>                  * can use it to estimate removed_load_sum.
>                  */
> -               add_tg_cfs_propagate(cfs_rq,
> -                       -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> +               add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
>
>                 decayed = 1;
>         }
> @@ -4792,6 +4793,7 @@ static void remove_entity_load_avg(struct sched_entity *se)
>         ++cfs_rq->removed.nr;
>         cfs_rq->removed.util_avg        += se->avg.util_avg;
>         cfs_rq->removed.load_avg        += se->avg.load_avg;
> +       cfs_rq->removed.load_sum        += se->avg.load_sum;
>         cfs_rq->removed.runnable_avg    += se->avg.runnable_avg;
>         raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..659935a5c694 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -682,6 +682,7 @@ struct cfs_rq {
>         struct {
>                 raw_spinlock_t  lock ____cacheline_aligned;
>                 int             nr;
> +               unsigned long   load_sum;
>                 unsigned long   load_avg;
>                 unsigned long   util_avg;
>                 unsigned long   runnable_avg;
> --
> 2.43.0
>