linux-kernel - Re: [RESEND][RFC] sched: Introduce removed.load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1081f8ca-3c1b-46fe-a7d4-31303e5e2298@arm.com>
Date: Tue, 30 Sep 2025 09:45:19 +0200
From: Pierre Gondois <pierre.gondois@....com>
To: hupu <hupu.gm@...il.com>, vincent.guittot@...aro.org
Cc: peterz@...radead.org, juri.lelli@...hat.com, mingo@...hat.com,
 rostedt@...dmis.org, dietmar.eggemann@....com, vschneid@...hat.com,
 bsegall@...gle.com, linux-kernel@...r.kernel.org, mgorman@...e.de
Subject: Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load
 propagation

Hello Hupu,

On 9/10/25 10:43, hupu wrote:
> Currently, load_sum to be propagated is estimated from
> (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> runnable_avg as an approximation. This approach can introduce precision
> loss due to the shift operation, and the error may become more visible
> when small tasks frequently enter and leave the queue.
>
> This patch introduces removed.load_sum to directly accumulate
> se->avg.load_sum when tasks dequeue, and uses it during load
> propagation. By doing so:
>
>    a) Avoid relying on runnable_avg-based approximation and obtain
>       higher precision in load_sum propagation;
(runnable_sum == load_sum) is not exactly accurate anymore since:
static inline long se_runnable(struct sched_entity *se)
{
     if (se->sched_delayed)
         return false;

     return !!se->on_rq;
}

So obtaining load_[sum|avg] from the runnable_avg signal seems compromised.

It is possible to compute load_sum value without the runnable_signal, cf.
40f5aa4c5eae ("sched/pelt: Fix attach_entity_load_avg() corner case")
https://lore.kernel.org/all/20220414090229.342-1-kuyo.chang@mediatek.com/T/#u

I.e.:
+       se->avg.load_sum = se->avg.load_avg * divider;
+       if (se_weight(se) < se->avg.load_sum)
+               se->avg.load_sum = div_u64(se->avg.load_sum, se_weight(se));
+       else
+               se->avg.load_sum = 1;

As a side note, as a counterpart of the above patch, the lower the niceness,
the lower the weight (in sched_prio_to_weight[]) and the lower the task
load signal.
This means that the unweighted load_sum value looses granularity.
E.g.:
A task with weight=15 can have load_avg values in [0:15]. So all the values
for load_sum in the range [X * (47742/15) : (X + 1) * (47742/15)]
are floored to load_avg=X, but load_sum is not reset when computing 
load_avg.
attach_entity_load_avg() however resets load_sum to X * (47742/15).

>    b) Eliminate precision loss from the shift operation, especially
>       with frequent short-lived tasks;

It might also help aggregating multiple tasks. Right now, if 1.000 tasks
with load_sum=1 and load_avg=0 are removed, the rq's load signal will
not be impacted at all.

On the other side, there might also be questions about the PELT windows
of all these tasks and the rq being aligned, cf.
https://lore.kernel.org/all/20170512171336.148578996@infradead.org/

>    c) Reduce one multiplication and shift in the hotpath, which
>       theoretically lowers overhead (though the impact is minor).
>
> Signed-off-by: hupu <hupu.gm@...il.com>
> ---
>   kernel/sched/fair.c  | 8 +++++---
>   kernel/sched/sched.h | 1 +
>   2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..b8cf3687804b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct sched_entity *se) {}
>   static inline int
>   update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>   {
> -	unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> +	unsigned long removed_load_sum = 0, removed_load = 0;
> +	unsigned long removed_util = 0, removed_runnable = 0;
>   	struct sched_avg *sa = &cfs_rq->avg;
>   	int decayed = 0;
>   
> @@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>   		raw_spin_lock(&cfs_rq->removed.lock);
>   		swap(cfs_rq->removed.util_avg, removed_util);
>   		swap(cfs_rq->removed.load_avg, removed_load);
> +		swap(cfs_rq->removed.load_sum, removed_load_sum);
>   		swap(cfs_rq->removed.runnable_avg, removed_runnable);
>   		cfs_rq->removed.nr = 0;
>   		raw_spin_unlock(&cfs_rq->removed.lock);
> @@ -4609,8 +4611,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>   		 * removed_runnable is the unweighted version of removed_load so we
>   		 * can use it to estimate removed_load_sum.
>   		 */
> -		add_tg_cfs_propagate(cfs_rq,
> -			-(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> +		add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
>   
>   		decayed = 1;
>   	}
> @@ -4792,6 +4793,7 @@ static void remove_entity_load_avg(struct sched_entity *se)
>   	++cfs_rq->removed.nr;
>   	cfs_rq->removed.util_avg	+= se->avg.util_avg;
>   	cfs_rq->removed.load_avg	+= se->avg.load_avg;
> +	cfs_rq->removed.load_sum	+= se->avg.load_sum;
>   	cfs_rq->removed.runnable_avg	+= se->avg.runnable_avg;
>   	raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
>   }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index be9745d104f7..659935a5c694 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -682,6 +682,7 @@ struct cfs_rq {
>   	struct {
>   		raw_spinlock_t	lock ____cacheline_aligned;
>   		int		nr;
> +		unsigned long	load_sum;
>   		unsigned long	load_avg;
>   		unsigned long	util_avg;
>   		unsigned long	runnable_avg;