linux-kernel - Re: [RESEND][RFC] sched: Introduce removed.load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADHxFxTkexicChcg3To4=AsX8c+s2RNWZ5NfA9UBLMfYRZtmKg@mail.gmail.com>
Date: Fri, 12 Sep 2025 14:44:29 +0800
From: hupu <hupu.gm@...il.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Hi Vincent Guittot
Thank you very much for your reply.

On Thu, Sep 11, 2025 at 4:01 PM Vincent Guittot
<vincent.guittot@...aro.org> wrote:
>
> On Wed, 10 Sept 2025 at 10:43, hupu <hupu.gm@...il.com> wrote:
> >
> > Currently, load_sum to be propagated is estimated from
> > (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> > runnable_avg as an approximation. This approach can introduce precision
> > loss due to the shift operation, and the error may become more visible
> > when small tasks frequently enter and leave the queue.
>
> Do you have a level of error ? Do you have a typical use case ?
>

In fact, I derived the error here from the underlying mathematical
relationship. The error mainly comes from two sources:
a) The approximation of load_sum using runnable_avg;
b) The truncation introduced by the right shift (SCHED_CAPACITY_SHIFT).

Below is the detailed derivation and explanation.

removed_runnable records the sum of se->avg.runnable_avg for tasks
that have migrated to another CPU. It represents the decayed
cumulative contribution of a task’s runtime, with the unit being
microseconds (μs). Right-shifting by SCHED_CAPACITY_SHIFT (10 bits) is
equivalent to truncating the low part below 1024 μs. In other words,
if a task has accumulated less than 1024 μs of runtime before
dequeueing, its load contribution will be completely discarded by the
shift operation. Even if the accumulated runtime exceeds 1024 μs, the
shift may still introduce up to nearly 1024 μs of truncation error
(about 1 ms).

For example, suppose a task has accumulated 4095 μs (4.095 ms) of
runtime on CPU0, then goes to sleep and is migrated to CPU1 upon
wakeup. Ideally, CPU0 should remove that task’s contribution fully.
However, after the shift, the result is 4095 >> 10 = 3 ms, which means
CPU0 will still retain about 1023 μs (~= 1.023 ms) of the task’s
contribution.

Experimental results also confirm this. By adding debug output before
add_tg_cfs_propagate and comparing the actual removed_load_sum (the
true value to be removed) with the approximate value obtained through
the shift, we observed that the latter is smaller in most cases. This
discrepancy is exactly due to the approximation and truncation
introduced by the shift.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
old mode 100644
new mode 100755
index b173a059315c..92396da04520
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct
sched_entity *se) {}
 static inline int
 update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
 {
-       unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
+       unsigned long removed_load_sum = 0, removed_load = 0;
+       unsigned long removed_util = 0, removed_runnable = 0;
        struct sched_avg *sa = &cfs_rq->avg;
        int decayed = 0;

@@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
                raw_spin_lock(&cfs_rq->removed.lock);
                swap(cfs_rq->removed.util_avg, removed_util);
                swap(cfs_rq->removed.load_avg, removed_load);
+               swap(cfs_rq->removed.load_sum, removed_load_sum);
                swap(cfs_rq->removed.runnable_avg, removed_runnable);
                cfs_rq->removed.nr = 0;
                raw_spin_unlock(&cfs_rq->removed.lock);
@@ -4609,8 +4611,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
                 * removed_runnable is the unweighted version of
removed_load so we
                 * can use it to estimate removed_load_sum.
                 */
-               add_tg_cfs_propagate(cfs_rq,
-                       -(long)(removed_runnable * divider) >>
SCHED_CAPACITY_SHIFT);
+               trace_printk("DEBUG BYHP: removed_load_sum=%lu,
raw_removed_runnable_sum=%lu\n",
+                               (long)removed_load_sum,
+                               (long)((removed_runnable * divider) >>
SCHED_CAPACITY_SHIFT));
+               add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);

                decayed = 1;
        }
@@ -4792,6 +4796,7 @@ static void remove_entity_load_avg(struct
sched_entity *se)
        ++cfs_rq->removed.nr;
        cfs_rq->removed.util_avg        += se->avg.util_avg;
        cfs_rq->removed.load_avg        += se->avg.load_avg;
+       cfs_rq->removed.load_sum        += se->avg.load_sum;
        cfs_rq->removed.runnable_avg    += se->avg.runnable_avg;
        raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index be9745d104f7..659935a5c694 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -682,6 +682,7 @@ struct cfs_rq {
        struct {
                raw_spinlock_t  lock ____cacheline_aligned;
                int             nr;
+               unsigned long   load_sum;
                unsigned long   load_avg;
                unsigned long   util_avg;
                unsigned long   runnable_avg;

The logs are as follows: raw_removed_runnable_sum is often smaller
than removed_load_sum. This difference is exactly caused by the
approximate calculation and the truncation introduced by bit shifting.

   stress-ng-cpu-183     (-------) [001] dnh1.   144.338335:
update_load_avg: DEBUG BYHP: removed_load_sum=35463,
raw_removed_runnable_sum=35429
   stress-ng-cpu-184     (-------) [007] dNs1.   144.346203:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=20607, raw_removed_runnable_sum=20496
   stress-ng-cpu-185     (-------) [001] d.h1.   144.568803:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-183     (-------) [000] d.h1.   145.526897:
update_load_avg: DEBUG BYHP: removed_load_sum=11103,
raw_removed_runnable_sum=11072
   stress-ng-cpu-183     (-------) [000] d.h1.   145.563980:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-185     (-------) [002] d..2.   145.593563:
sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-181     (-------) [005] d.s1.   145.653525:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=2537, raw_removed_runnable_sum=2508
   stress-ng-cpu-183     (-------) [003] d.s1.   145.657599:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=28510, raw_removed_runnable_sum=28473
   stress-ng-cpu-180     (-------) [007] d.h1.   146.049167:
update_load_avg: DEBUG BYHP: removed_load_sum=9548,
raw_removed_runnable_sum=9526
   stress-ng-cpu-184     (-------) [005] d.h1.   146.057200:
update_load_avg: DEBUG BYHP: removed_load_sum=5974,
raw_removed_runnable_sum=5963
   stress-ng-cpu-182     (-------) [000] d.s1.   146.062025:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=55, raw_removed_runnable_sum=45
      kcompactd0-65      (-------) [001] d..2.   146.095334:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
      kcompactd0-65      (-------) [001] d..2.   146.095433:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=17493, raw_removed_runnable_sum=17461
   stress-ng-cpu-186     (-------) [006] d.h1.   146.118910:
update_load_avg: DEBUG BYHP: removed_load_sum=11404,
raw_removed_runnable_sum=11389
   stress-ng-cpu-186     (-------) [000] d.h1.   147.112614:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
             cat-234     (-------) [005] d.s2.   147.161900:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=36778, raw_removed_runnable_sum=36768
   stress-ng-cpu-181     (-------) [004] d.h1.   147.406979:
update_load_avg: DEBUG BYHP: removed_load_sum=3029,
raw_removed_runnable_sum=3014
   stress-ng-cpu-185     (-------) [003] d.s1.   147.474502:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=1242, raw_removed_runnable_sum=1205
   stress-ng-cpu-186     (-------) [000] d.h1.   147.533368:
update_load_avg: DEBUG BYHP: removed_load_sum=11,
raw_removed_runnable_sum=0
   stress-ng-cpu-181     (-------) [001] d.s1.   148.341639:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=15837, raw_removed_runnable_sum=15804
     migration/7-51      (-------) [007] d..2.   148.384219:
sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
          <idle>-0       (-------) [004] d.s2.   148.431501:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=4292, raw_removed_runnable_sum=1924
             cat-234     (-------) [007] d.h1.   148.434474:
update_load_avg: DEBUG BYHP: removed_load_sum=10380,
raw_removed_runnable_sum=9945
   stress-ng-cpu-184     (-------) [001] d.h1.   148.853949:
update_load_avg: DEBUG BYHP: removed_load_sum=15896,
raw_removed_runnable_sum=15869
   stress-ng-cpu-185     (-------) [007] d.s1.   148.862267:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=31, raw_removed_runnable_sum=0
   stress-ng-cpu-183     (-------) [006] d.h1.   149.157805:
update_load_avg: DEBUG BYHP: removed_load_sum=20553,
raw_removed_runnable_sum=20527
   stress-ng-cpu-179     (-------) [007] d.h1.   149.330189:
update_load_avg: DEBUG BYHP: removed_load_sum=9204,
raw_removed_runnable_sum=9177
   stress-ng-cpu-185     (-------) [002] d.h1.   149.434768:
update_load_avg: DEBUG BYHP: removed_load_sum=11198,
raw_removed_runnable_sum=11176
          <idle>-0       (-------) [006] dNs2.   149.456004:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=2826, raw_removed_runnable_sum=465
   stress-ng-cpu-184     (-------) [005] d.s1.   149.483636:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=11607, raw_removed_runnable_sum=11595
   stress-ng-cpu-186     (-------) [001] d.h1.   149.668063:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
          <idle>-0       (-------) [001] d.h2.   149.672477:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-183     (-------) [007] d.s1.   149.684045:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=5657, raw_removed_runnable_sum=5458
     ksoftirqd/1-22      (-------) [001] d.s1.   149.700089:
update_load_avg: DEBUG BYHP: removed_load_sum=0,
raw_removed_runnable_sum=0
   stress-ng-cpu-183     (-------) [004] d.h1.   149.807666:
update_load_avg: DEBUG BYHP: removed_load_sum=10481,
raw_removed_runnable_sum=10474
   stress-ng-cpu-184     (-------) [000] d.h1.   149.817148:
update_load_avg: DEBUG BYHP: removed_load_sum=3,
raw_removed_runnable_sum=0
          <idle>-0       (-------) [001] d.s2.   149.866309:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=4010, raw_removed_runnable_sum=3999
   stress-ng-cpu-184     (-------) [000] d.s1.   149.914423:
sched_balance_update_blocked_averages: DEBUG BYHP:
removed_load_sum=1920, raw_removed_runnable_sum=1876




> >
> > This patch introduces removed.load_sum to directly accumulate
> > se->avg.load_sum when tasks dequeue, and uses it during load
> > propagation. By doing so:
> >
> >   a) Avoid relying on runnable_avg-based approximation and obtain
> >      higher precision in load_sum propagation;
> >   b) Eliminate precision loss from the shift operation, especially
> >      with frequent short-lived tasks;
> >   c) Reduce one multiplication and shift in the hotpath, which
> >      theoretically lowers overhead (though the impact is minor).
>
> This doesn't work because rq's load_sum tracks current weight whereas
> se's load_sum doesn't include the weight which is only applied on se's
> load_avg. So you can't directly add/sub se's load_sum and rq's
> load_sum. Only load_avg of both se and rq use the same unit.
>

I understand and agree with your point: cfs_rq->avg.load_sum includes
the weight while se->avg.load_sum does not, so the two are indeed in
different units and cannot be directly added or subtracted.

However, in this patch we DO NOT directly add or subtract
se->avg.load_sum to/from cfs_rq->avg.load_sum. Instead, the load_sum
of dequeued tasks is accumulated into cfs_rq->prop_runnable_sum.
Later, in update_tg_cfs_load, this prop_runnable_sum is used to
recompute the load_sum and load_avg of both gse and cfs_rq.

In other words, the update here is performed via recomputation rather
than direct arithmetic, so the “unit mismatch” issue you mentioned
does not occur.

update_tg_cfs_load
  |-- long delta_avg, runnable_sum = gcfs_rq->prop_runnable_sum;
  |
  |   runnable_sum += gse->avg.load_sum;
  |
  |   load_sum = se_weight(gse) * runnable_sum;
  |   load_avg = div_u64(load_sum, divider);
  |
  |   delta_avg = load_avg - gse->avg.load_avg;
  |   delta_sum = load_sum - (s64)se_weight(gse) * gse->avg.load_sum;
  |
  |-- /* Recalculate the load_sum and load_avg of gse. */
  |   gse->avg.load_sum = runnable_sum;
  |   gse->avg.load_avg = load_avg;
  |
  |-- /* Recalculate the load_sum and load_avg of cfs_rq. */
  |   add_positive(&cfs_rq->avg.load_avg, delta_avg);
  |   add_positive(&cfs_rq->avg.load_sum, delta_sum);




>
> Then, why is it a problem only for load and not util or runnable ?
>

I broke this question into three parts and derived the related
formulas step by step.


Q1: Why doesn’t util_avg have the same problem?
=================================================
A1: Because the formulas of util_avg / util_sum are exactly the same
for both se and cfs_rq. Neither involves weight, and the units are
consistent, so direct addition and subtraction are valid.

The formulas for a sched_entity (tse) are:

             util_sum
util_avg = ------------
              divider


    decay(history) + contrib(running) * 1024
  = ----------------------------------------
                  divider


    (decay(history) + contrib(running)) * 1024
 ~= ------------------------------------------
                  divider


    decay(history) + contrib(running)
  = --------------------------------- * 1024
                  divider


Where:
util_sum = decay(history) + contrib(running) * 1024
util_avg < 1024
decay(history): represents geometric decay of the historical contribution (time)
contrib(running): contribution added when the task is in running state


For cfs_rq, the formulas of util_avg / util_sum are:

             util_sum
util_avg = ------------
              divider


    decay(history) + contrib(running) * 1024
  = ----------------------------------------
                  divider


    (decay(history) + contrib(running)) * 1024
 ~= ------------------------------------------
                      divider


    decay(history) + contrib(running)
  = --------------------------------- * 1024
                 divider


Where:
util_sum = decay(history) + contrib(running) * 1024
util_avg < 1024
decay(history): represents geometric decay of the historical contribution (time)
contrib(running): contribution added when the task is in running state


Therefore, se and cfs_rq share the same units for util_avg / util_sum,
which makes direct addition and subtraction valid. This is also why
update_tg_cfs_util performs direct subtraction when updating:

update_tg_cfs_util
  |-- /* Calculate the delta between gse and gcfs_rq directly by subtraction. */
  |   long delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
  |
  |-- /* Set new sched_entity's utilization */
  |   se->avg.util_avg = gcfs_rq->avg.util_avg;
  |   new_sum = se->avg.util_avg * divider;
  |   delta_sum = (long)new_sum - (long)se->avg.util_sum;
  |   se->avg.util_sum = new_sum;
  |
  |-- /* Update parent cfs_rq utilization */
  |   add_positive(&cfs_rq->avg.util_avg, delta_avg);
  |   add_positive(&cfs_rq->avg.util_sum, delta_sum);


Q2: Why doesn’t runnable_avg have the same problem?
===================================================
A2: Similarly, the runnable_avg / runnable_sum of se and cfs_rq also
do not include weight.

The calculation formulas for tse's runnable_avg and runnable_sum are as follows:

                 runnable_sum
runnable_avg = ----------------
                   divider


    decay(history) + contrib(running + runnable) * 1024
  = ---------------------------------------------------
                        divider


    (decay(history) + contrib(running + runnable)) * 1024
 ~= -----------------------------------------------------
                         divider


    decay(history) + contrib(running + runnable)
  = -------------------------------------------- * 1024
                      divider


Where:
runnable_sum = decay(history) + contrib(running + runnable) * 1024
runnable_avg < 1024
decay(history): represents geometric decay of the historical contribution (time)
contrib(running + runnable): contribution added when the task is in
running and runnable states


The calculation formulas for cfs_rq's runnable_avg and runnable_sum
are as follows:

                 runnable_sum
runnable_avg = ----------------
                   divider


    decay(history) + contrib(running + runnable) * cfs_rq->h_nr_running * 1024
  = --------------------------------------------------------------------------
                                     divider


  (decay(history) + contrib(running + runnable)) * cfs_rq->h_nr_running * 1024
 ~= --------------------------------------------------------------------------
                                   divider


    decay(history) + contrib(running + runnable)
  = -------------------------------------------- * cfs_rq->h_nr_running * 1024
                      divider


Where:
runnable_sum = decay(history) + contrib(running + runnable) *
cfs_rq->h_nr_running * 1024
runnable_avg < cfs_rq->h_nr_running * 1024
decay(history): represents geometric decay of the historical contribution (time)
contrib(running + runnable): contribution added when the task is in
running and runnable states


The runnable statistic of cfs_rq is represented by h_nr_running, which
indicates the number of tasks and can be regarded as the accumulated
runnable of all se. Therefore, in update_tg_cfs_runnable, the update
can also be done directly using subtraction.

update_tg_cfs_runnable
  |-- /* Calculate the delta directly by subtraction. */
  |   long delta_avg = gcfs_rq->avg.runnable_avg - gse->avg.runnable_avg;
  |
  |-- /* Set new sched_entity's runnable */
  |   gse->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
  |   new_sum = gse->avg.runnable_avg * divider;
  |   delta_sum = (long)new_sum - (long)gse->avg.runnable_sum;
  |   gse->avg.runnable_sum = new_sum;
  |
  |-- /* Update parent cfs_rq runnable */
  |   add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
  |   add_positive(&cfs_rq->avg.runnable_sum, delta_sum);


Q3: Why does load_avg have this problem?
========================================
A3: The key difference is that cfs_rq’s load_avg and load_sum include
weight information, while a task’s load_sum does not. Moreover, we
cannot reconstruct load_sum from a task’s load_avg.

For a task (tse), the formulas are:

             load_sum
load_avg = ------------ * se_weight(se)
              divider


   decay(history) + contrib(running + runnable)
 = -------------------------------------------- * se_weight(se)
                     divider


Where:
load_sum = decay(history) + contrib(running + runnable)
load_avg < se_weight(se)
decay(history): represents geometric decay of the historical contribution (time)
contrib(running + runnable): contribution added when the task is in
running and runnable states


For a cfs_rq, the formulas are:

             load_sum
load_avg = ------------
              divider


    decay(history) + contrib(running + runnable) * cfs_rq->load.weight
  = ------------------------------------------------------------------
                                  divider


    (decay(history) + contrib(running + runnable)) * cfs_rq->load.weight
 ~= --------------------------------------------------------------------
                                 divider


    decay(history) + contrib(running + runnable)
  = -------------------------------------------- * cfs_rq->load.weight
                      divider


Where:
load_sum = decay(history) + contrib(running + runnable) * cfs_rq->load.weight
load_avg < cfs_rq->load.weight
decay(history): represents geometric decay of the historical contribution (time)
contrib(running + runnable): contribution added when the task is in
running and runnable states


>From these formulas, we can see that tse’s load_sum does not include
weight, while cfs_rq’s does. Their units differ, so they cannot be
directly added or subtracted. In addition, weight is a "historical
variable" that changes over time, which means load_sum cannot be
derived from se’s load_avg.

To propagate load_sum changes from a child cfs_rq, the upstream
implementation uses runnable_avg to APPROXIMATE load_sum. The
derivation is as follows.

For tse:

                 runnable_sum
runnable_avg = ----------------
                   divider

    decay(history) + contrib(running + runnable) * 1024
  = ---------------------------------------------------
                     divider

     decay(history) + contrib(running + runnable)
 ~= --------------------------------------------- * 1024
                     divider


Thus:

decay(history) + contrib(running + runnable)


     runnable_avg * divider
 ~= -----------------------
             1024


 ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT        (1)


Equation (1) represents the contribution when a task is in running or
runnable state. The kernel refers to this as the "unweighted version
of removed_load", which is essentially time with exponential decay
applied.

It is worth noting that equation (1) happens to be equal to tse’s
load_sum. Therefore, for tse we have:

load_sum ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT     (2)

However, equation (2) itself is only an approximation, and the
right-shift operation introduces further truncation error.

My idea is that since the task’s load_sum already exists when it
dequeues, it is more reasonable to record this value directly. By
doing so, we can avoid both the approximation and the shift-induced
error. This is the motivation behind introducing removed.load_sum in
my patch.




> Also we don't want to track both load_sum and load_avg, only one is
> enough and by the above it is load_avg
>

My view is consistent with yours, but I personally lean towards
keeping load_sum and deprecating load_avg, since load_avg does not
seem to be accurate.

That said, this is only my personal perspective and may not be
comprehensive. I would like to further discuss the feasibility of this
approach with you.

Thanks.
hupu