linux-kernel - Re: [RESEND][RFC] sched: Introduce removed.load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDCaTP_mX+5RKc9Xcw0m6ZxzyUk1FNPNjO1OOQ+zp_DkQ@mail.gmail.com>
Date: Tue, 23 Sep 2025 14:14:51 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: hupu <hupu.gm@...il.com>
Cc: mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org
Subject: Re: [RESEND][RFC] sched: Introduce removed.load_sum for precise load propagation

Hi hupu,

I have been busy with other stuff so far and haven't been able to look
at your reply but It's on my todo list.

Vincent

On Tue, 23 Sept 2025 at 13:17, hupu <hupu.gm@...il.com> wrote:
>
> Hi Vincent Guittot and fellow maintainers,
>
> As I mentioned in my previous email, I think there may be a
> misunderstanding on your part regarding the modification in this
> patch.
>
> This change does *NOT* directly add `se->avg.load_sum` to
> `cfs_rq->avg.load_sum`. In the end, it will also *NOT* result in
> tracking both `load_avg` and `load_sum` simultaneously. My current
> preference is to retain `load_sum` and eventually deprecate
> `load_avg`, but only after we’ve confirmed through discussion that
> this approach is reasonable.
>
> The intent of this proposal lies in two key aspects, which I believe
> can improve the accuracy of `cfs_rq` load statistics:
> 1. Avoid using `runnable_avg` to approximate `load_sum`, as such
> approximations inherently introduce errors. Instead, recording the
> `load_sum` of the task at the time of dequeue is more reasonable.
> 2. Avoid the additional error introduced by right-shift operations.
>
> Could you please review the feasibility of this proposal again when
> you have time?
> Your feedback would be greatly appreciated.
>
> Thanks and best regards,
> hupu
>
> On Fri, Sep 12, 2025 at 2:44 PM hupu <hupu.gm@...il.com> wrote:
> >
> > Hi Vincent Guittot
> > Thank you very much for your reply.
> >
> > On Thu, Sep 11, 2025 at 4:01 PM Vincent Guittot
> > <vincent.guittot@...aro.org> wrote:
> > >
> > > On Wed, 10 Sept 2025 at 10:43, hupu <hupu.gm@...il.com> wrote:
> > > >
> > > > Currently, load_sum to be propagated is estimated from
> > > > (removed_runnable * divider) >> SCHED_CAPACITY_SHIFT, which relies on
> > > > runnable_avg as an approximation. This approach can introduce precision
> > > > loss due to the shift operation, and the error may become more visible
> > > > when small tasks frequently enter and leave the queue.
> > >
> > > Do you have a level of error ? Do you have a typical use case ?
> > >
> >
> > In fact, I derived the error here from the underlying mathematical
> > relationship. The error mainly comes from two sources:
> > a) The approximation of load_sum using runnable_avg;
> > b) The truncation introduced by the right shift (SCHED_CAPACITY_SHIFT).
> >
> > Below is the detailed derivation and explanation.
> >
> > removed_runnable records the sum of se->avg.runnable_avg for tasks
> > that have migrated to another CPU. It represents the decayed
> > cumulative contribution of a task’s runtime, with the unit being
> > microseconds (μs). Right-shifting by SCHED_CAPACITY_SHIFT (10 bits) is
> > equivalent to truncating the low part below 1024 μs. In other words,
> > if a task has accumulated less than 1024 μs of runtime before
> > dequeueing, its load contribution will be completely discarded by the
> > shift operation. Even if the accumulated runtime exceeds 1024 μs, the
> > shift may still introduce up to nearly 1024 μs of truncation error
> > (about 1 ms).
> >
> > For example, suppose a task has accumulated 4095 μs (4.095 ms) of
> > runtime on CPU0, then goes to sleep and is migrated to CPU1 upon
> > wakeup. Ideally, CPU0 should remove that task’s contribution fully.
> > However, after the shift, the result is 4095 >> 10 = 3 ms, which means
> > CPU0 will still retain about 1023 μs (~= 1.023 ms) of the task’s
> > contribution.
> >
> > Experimental results also confirm this. By adding debug output before
> > add_tg_cfs_propagate and comparing the actual removed_load_sum (the
> > true value to be removed) with the approximate value obtained through
> > the shift, we observed that the latter is smaller in most cases. This
> > discrepancy is exactly due to the approximation and truncation
> > introduced by the shift.
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > old mode 100644
> > new mode 100755
> > index b173a059315c..92396da04520
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4561,7 +4561,8 @@ static void migrate_se_pelt_lag(struct
> > sched_entity *se) {}
> >  static inline int
> >  update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >  {
> > -       unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > +       unsigned long removed_load_sum = 0, removed_load = 0;
> > +       unsigned long removed_util = 0, removed_runnable = 0;
> >         struct sched_avg *sa = &cfs_rq->avg;
> >         int decayed = 0;
> >
> > @@ -4572,6 +4573,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >                 raw_spin_lock(&cfs_rq->removed.lock);
> >                 swap(cfs_rq->removed.util_avg, removed_util);
> >                 swap(cfs_rq->removed.load_avg, removed_load);
> > +               swap(cfs_rq->removed.load_sum, removed_load_sum);
> >                 swap(cfs_rq->removed.runnable_avg, removed_runnable);
> >                 cfs_rq->removed.nr = 0;
> >                 raw_spin_unlock(&cfs_rq->removed.lock);
> > @@ -4609,8 +4611,10 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >                  * removed_runnable is the unweighted version of
> > removed_load so we
> >                  * can use it to estimate removed_load_sum.
> >                  */
> > -               add_tg_cfs_propagate(cfs_rq,
> > -                       -(long)(removed_runnable * divider) >>
> > SCHED_CAPACITY_SHIFT);
> > +               trace_printk("DEBUG BYHP: removed_load_sum=%lu,
> > raw_removed_runnable_sum=%lu\n",
> > +                               (long)removed_load_sum,
> > +                               (long)((removed_runnable * divider) >>
> > SCHED_CAPACITY_SHIFT));
> > +               add_tg_cfs_propagate(cfs_rq, -(long)removed_load_sum);
> >
> >                 decayed = 1;
> >         }
> > @@ -4792,6 +4796,7 @@ static void remove_entity_load_avg(struct
> > sched_entity *se)
> >         ++cfs_rq->removed.nr;
> >         cfs_rq->removed.util_avg        += se->avg.util_avg;
> >         cfs_rq->removed.load_avg        += se->avg.load_avg;
> > +       cfs_rq->removed.load_sum        += se->avg.load_sum;
> >         cfs_rq->removed.runnable_avg    += se->avg.runnable_avg;
> >         raw_spin_unlock_irqrestore(&cfs_rq->removed.lock, flags);
> >  }
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index be9745d104f7..659935a5c694 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -682,6 +682,7 @@ struct cfs_rq {
> >         struct {
> >                 raw_spinlock_t  lock ____cacheline_aligned;
> >                 int             nr;
> > +               unsigned long   load_sum;
> >                 unsigned long   load_avg;
> >                 unsigned long   util_avg;
> >                 unsigned long   runnable_avg;
> >
> > The logs are as follows: raw_removed_runnable_sum is often smaller
> > than removed_load_sum. This difference is exactly caused by the
> > approximate calculation and the truncation introduced by bit shifting.
> >
> >    stress-ng-cpu-183     (-------) [001] dnh1.   144.338335:
> > update_load_avg: DEBUG BYHP: removed_load_sum=35463,
> > raw_removed_runnable_sum=35429
> >    stress-ng-cpu-184     (-------) [007] dNs1.   144.346203:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=20607, raw_removed_runnable_sum=20496
> >    stress-ng-cpu-185     (-------) [001] d.h1.   144.568803:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-183     (-------) [000] d.h1.   145.526897:
> > update_load_avg: DEBUG BYHP: removed_load_sum=11103,
> > raw_removed_runnable_sum=11072
> >    stress-ng-cpu-183     (-------) [000] d.h1.   145.563980:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-185     (-------) [002] d..2.   145.593563:
> > sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-181     (-------) [005] d.s1.   145.653525:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=2537, raw_removed_runnable_sum=2508
> >    stress-ng-cpu-183     (-------) [003] d.s1.   145.657599:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=28510, raw_removed_runnable_sum=28473
> >    stress-ng-cpu-180     (-------) [007] d.h1.   146.049167:
> > update_load_avg: DEBUG BYHP: removed_load_sum=9548,
> > raw_removed_runnable_sum=9526
> >    stress-ng-cpu-184     (-------) [005] d.h1.   146.057200:
> > update_load_avg: DEBUG BYHP: removed_load_sum=5974,
> > raw_removed_runnable_sum=5963
> >    stress-ng-cpu-182     (-------) [000] d.s1.   146.062025:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=55, raw_removed_runnable_sum=45
> >       kcompactd0-65      (-------) [001] d..2.   146.095334:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >       kcompactd0-65      (-------) [001] d..2.   146.095433:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=17493, raw_removed_runnable_sum=17461
> >    stress-ng-cpu-186     (-------) [006] d.h1.   146.118910:
> > update_load_avg: DEBUG BYHP: removed_load_sum=11404,
> > raw_removed_runnable_sum=11389
> >    stress-ng-cpu-186     (-------) [000] d.h1.   147.112614:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >              cat-234     (-------) [005] d.s2.   147.161900:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=36778, raw_removed_runnable_sum=36768
> >    stress-ng-cpu-181     (-------) [004] d.h1.   147.406979:
> > update_load_avg: DEBUG BYHP: removed_load_sum=3029,
> > raw_removed_runnable_sum=3014
> >    stress-ng-cpu-185     (-------) [003] d.s1.   147.474502:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=1242, raw_removed_runnable_sum=1205
> >    stress-ng-cpu-186     (-------) [000] d.h1.   147.533368:
> > update_load_avg: DEBUG BYHP: removed_load_sum=11,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-181     (-------) [001] d.s1.   148.341639:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=15837, raw_removed_runnable_sum=15804
> >      migration/7-51      (-------) [007] d..2.   148.384219:
> > sched_balance_update_blocked_averages: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >           <idle>-0       (-------) [004] d.s2.   148.431501:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=4292, raw_removed_runnable_sum=1924
> >              cat-234     (-------) [007] d.h1.   148.434474:
> > update_load_avg: DEBUG BYHP: removed_load_sum=10380,
> > raw_removed_runnable_sum=9945
> >    stress-ng-cpu-184     (-------) [001] d.h1.   148.853949:
> > update_load_avg: DEBUG BYHP: removed_load_sum=15896,
> > raw_removed_runnable_sum=15869
> >    stress-ng-cpu-185     (-------) [007] d.s1.   148.862267:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=31, raw_removed_runnable_sum=0
> >    stress-ng-cpu-183     (-------) [006] d.h1.   149.157805:
> > update_load_avg: DEBUG BYHP: removed_load_sum=20553,
> > raw_removed_runnable_sum=20527
> >    stress-ng-cpu-179     (-------) [007] d.h1.   149.330189:
> > update_load_avg: DEBUG BYHP: removed_load_sum=9204,
> > raw_removed_runnable_sum=9177
> >    stress-ng-cpu-185     (-------) [002] d.h1.   149.434768:
> > update_load_avg: DEBUG BYHP: removed_load_sum=11198,
> > raw_removed_runnable_sum=11176
> >           <idle>-0       (-------) [006] dNs2.   149.456004:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=2826, raw_removed_runnable_sum=465
> >    stress-ng-cpu-184     (-------) [005] d.s1.   149.483636:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=11607, raw_removed_runnable_sum=11595
> >    stress-ng-cpu-186     (-------) [001] d.h1.   149.668063:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >           <idle>-0       (-------) [001] d.h2.   149.672477:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-183     (-------) [007] d.s1.   149.684045:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=5657, raw_removed_runnable_sum=5458
> >      ksoftirqd/1-22      (-------) [001] d.s1.   149.700089:
> > update_load_avg: DEBUG BYHP: removed_load_sum=0,
> > raw_removed_runnable_sum=0
> >    stress-ng-cpu-183     (-------) [004] d.h1.   149.807666:
> > update_load_avg: DEBUG BYHP: removed_load_sum=10481,
> > raw_removed_runnable_sum=10474
> >    stress-ng-cpu-184     (-------) [000] d.h1.   149.817148:
> > update_load_avg: DEBUG BYHP: removed_load_sum=3,
> > raw_removed_runnable_sum=0
> >           <idle>-0       (-------) [001] d.s2.   149.866309:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=4010, raw_removed_runnable_sum=3999
> >    stress-ng-cpu-184     (-------) [000] d.s1.   149.914423:
> > sched_balance_update_blocked_averages: DEBUG BYHP:
> > removed_load_sum=1920, raw_removed_runnable_sum=1876
> >
> >
> >
> >
> > > >
> > > > This patch introduces removed.load_sum to directly accumulate
> > > > se->avg.load_sum when tasks dequeue, and uses it during load
> > > > propagation. By doing so:
> > > >
> > > >   a) Avoid relying on runnable_avg-based approximation and obtain
> > > >      higher precision in load_sum propagation;
> > > >   b) Eliminate precision loss from the shift operation, especially
> > > >      with frequent short-lived tasks;
> > > >   c) Reduce one multiplication and shift in the hotpath, which
> > > >      theoretically lowers overhead (though the impact is minor).
> > >
> > > This doesn't work because rq's load_sum tracks current weight whereas
> > > se's load_sum doesn't include the weight which is only applied on se's
> > > load_avg. So you can't directly add/sub se's load_sum and rq's
> > > load_sum. Only load_avg of both se and rq use the same unit.
> > >
> >
> > I understand and agree with your point: cfs_rq->avg.load_sum includes
> > the weight while se->avg.load_sum does not, so the two are indeed in
> > different units and cannot be directly added or subtracted.
> >
> > However, in this patch we DO NOT directly add or subtract
> > se->avg.load_sum to/from cfs_rq->avg.load_sum. Instead, the load_sum
> > of dequeued tasks is accumulated into cfs_rq->prop_runnable_sum.
> > Later, in update_tg_cfs_load, this prop_runnable_sum is used to
> > recompute the load_sum and load_avg of both gse and cfs_rq.
> >
> > In other words, the update here is performed via recomputation rather
> > than direct arithmetic, so the “unit mismatch” issue you mentioned
> > does not occur.
> >
> > update_tg_cfs_load
> >   |-- long delta_avg, runnable_sum = gcfs_rq->prop_runnable_sum;
> >   |
> >   |   runnable_sum += gse->avg.load_sum;
> >   |
> >   |   load_sum = se_weight(gse) * runnable_sum;
> >   |   load_avg = div_u64(load_sum, divider);
> >   |
> >   |   delta_avg = load_avg - gse->avg.load_avg;
> >   |   delta_sum = load_sum - (s64)se_weight(gse) * gse->avg.load_sum;
> >   |
> >   |-- /* Recalculate the load_sum and load_avg of gse. */
> >   |   gse->avg.load_sum = runnable_sum;
> >   |   gse->avg.load_avg = load_avg;
> >   |
> >   |-- /* Recalculate the load_sum and load_avg of cfs_rq. */
> >   |   add_positive(&cfs_rq->avg.load_avg, delta_avg);
> >   |   add_positive(&cfs_rq->avg.load_sum, delta_sum);
> >
> >
> >
> >
> > >
> > > Then, why is it a problem only for load and not util or runnable ?
> > >
> >
> > I broke this question into three parts and derived the related
> > formulas step by step.
> >
> >
> > Q1: Why doesn’t util_avg have the same problem?
> > =================================================
> > A1: Because the formulas of util_avg / util_sum are exactly the same
> > for both se and cfs_rq. Neither involves weight, and the units are
> > consistent, so direct addition and subtraction are valid.
> >
> > The formulas for a sched_entity (tse) are:
> >
> >              util_sum
> > util_avg = ------------
> >               divider
> >
> >
> >     decay(history) + contrib(running) * 1024
> >   = ----------------------------------------
> >                   divider
> >
> >
> >     (decay(history) + contrib(running)) * 1024
> >  ~= ------------------------------------------
> >                   divider
> >
> >
> >     decay(history) + contrib(running)
> >   = --------------------------------- * 1024
> >                   divider
> >
> >
> > Where:
> > util_sum = decay(history) + contrib(running) * 1024
> > util_avg < 1024
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running): contribution added when the task is in running state
> >
> >
> > For cfs_rq, the formulas of util_avg / util_sum are:
> >
> >              util_sum
> > util_avg = ------------
> >               divider
> >
> >
> >     decay(history) + contrib(running) * 1024
> >   = ----------------------------------------
> >                   divider
> >
> >
> >     (decay(history) + contrib(running)) * 1024
> >  ~= ------------------------------------------
> >                       divider
> >
> >
> >     decay(history) + contrib(running)
> >   = --------------------------------- * 1024
> >                  divider
> >
> >
> > Where:
> > util_sum = decay(history) + contrib(running) * 1024
> > util_avg < 1024
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running): contribution added when the task is in running state
> >
> >
> > Therefore, se and cfs_rq share the same units for util_avg / util_sum,
> > which makes direct addition and subtraction valid. This is also why
> > update_tg_cfs_util performs direct subtraction when updating:
> >
> > update_tg_cfs_util
> >   |-- /* Calculate the delta between gse and gcfs_rq directly by subtraction. */
> >   |   long delta_avg = gcfs_rq->avg.util_avg - se->avg.util_avg;
> >   |
> >   |-- /* Set new sched_entity's utilization */
> >   |   se->avg.util_avg = gcfs_rq->avg.util_avg;
> >   |   new_sum = se->avg.util_avg * divider;
> >   |   delta_sum = (long)new_sum - (long)se->avg.util_sum;
> >   |   se->avg.util_sum = new_sum;
> >   |
> >   |-- /* Update parent cfs_rq utilization */
> >   |   add_positive(&cfs_rq->avg.util_avg, delta_avg);
> >   |   add_positive(&cfs_rq->avg.util_sum, delta_sum);
> >
> >
> > Q2: Why doesn’t runnable_avg have the same problem?
> > ===================================================
> > A2: Similarly, the runnable_avg / runnable_sum of se and cfs_rq also
> > do not include weight.
> >
> > The calculation formulas for tse's runnable_avg and runnable_sum are as follows:
> >
> >                  runnable_sum
> > runnable_avg = ----------------
> >                    divider
> >
> >
> >     decay(history) + contrib(running + runnable) * 1024
> >   = ---------------------------------------------------
> >                         divider
> >
> >
> >     (decay(history) + contrib(running + runnable)) * 1024
> >  ~= -----------------------------------------------------
> >                          divider
> >
> >
> >     decay(history) + contrib(running + runnable)
> >   = -------------------------------------------- * 1024
> >                       divider
> >
> >
> > Where:
> > runnable_sum = decay(history) + contrib(running + runnable) * 1024
> > runnable_avg < 1024
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running + runnable): contribution added when the task is in
> > running and runnable states
> >
> >
> > The calculation formulas for cfs_rq's runnable_avg and runnable_sum
> > are as follows:
> >
> >                  runnable_sum
> > runnable_avg = ----------------
> >                    divider
> >
> >
> >     decay(history) + contrib(running + runnable) * cfs_rq->h_nr_running * 1024
> >   = --------------------------------------------------------------------------
> >                                      divider
> >
> >
> >   (decay(history) + contrib(running + runnable)) * cfs_rq->h_nr_running * 1024
> >  ~= --------------------------------------------------------------------------
> >                                    divider
> >
> >
> >     decay(history) + contrib(running + runnable)
> >   = -------------------------------------------- * cfs_rq->h_nr_running * 1024
> >                       divider
> >
> >
> > Where:
> > runnable_sum = decay(history) + contrib(running + runnable) *
> > cfs_rq->h_nr_running * 1024
> > runnable_avg < cfs_rq->h_nr_running * 1024
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running + runnable): contribution added when the task is in
> > running and runnable states
> >
> >
> > The runnable statistic of cfs_rq is represented by h_nr_running, which
> > indicates the number of tasks and can be regarded as the accumulated
> > runnable of all se. Therefore, in update_tg_cfs_runnable, the update
> > can also be done directly using subtraction.
> >
> > update_tg_cfs_runnable
> >   |-- /* Calculate the delta directly by subtraction. */
> >   |   long delta_avg = gcfs_rq->avg.runnable_avg - gse->avg.runnable_avg;
> >   |
> >   |-- /* Set new sched_entity's runnable */
> >   |   gse->avg.runnable_avg = gcfs_rq->avg.runnable_avg;
> >   |   new_sum = gse->avg.runnable_avg * divider;
> >   |   delta_sum = (long)new_sum - (long)gse->avg.runnable_sum;
> >   |   gse->avg.runnable_sum = new_sum;
> >   |
> >   |-- /* Update parent cfs_rq runnable */
> >   |   add_positive(&cfs_rq->avg.runnable_avg, delta_avg);
> >   |   add_positive(&cfs_rq->avg.runnable_sum, delta_sum);
> >
> >
> > Q3: Why does load_avg have this problem?
> > ========================================
> > A3: The key difference is that cfs_rq’s load_avg and load_sum include
> > weight information, while a task’s load_sum does not. Moreover, we
> > cannot reconstruct load_sum from a task’s load_avg.
> >
> > For a task (tse), the formulas are:
> >
> >              load_sum
> > load_avg = ------------ * se_weight(se)
> >               divider
> >
> >
> >    decay(history) + contrib(running + runnable)
> >  = -------------------------------------------- * se_weight(se)
> >                      divider
> >
> >
> > Where:
> > load_sum = decay(history) + contrib(running + runnable)
> > load_avg < se_weight(se)
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running + runnable): contribution added when the task is in
> > running and runnable states
> >
> >
> > For a cfs_rq, the formulas are:
> >
> >              load_sum
> > load_avg = ------------
> >               divider
> >
> >
> >     decay(history) + contrib(running + runnable) * cfs_rq->load.weight
> >   = ------------------------------------------------------------------
> >                                   divider
> >
> >
> >     (decay(history) + contrib(running + runnable)) * cfs_rq->load.weight
> >  ~= --------------------------------------------------------------------
> >                                  divider
> >
> >
> >     decay(history) + contrib(running + runnable)
> >   = -------------------------------------------- * cfs_rq->load.weight
> >                       divider
> >
> >
> > Where:
> > load_sum = decay(history) + contrib(running + runnable) * cfs_rq->load.weight
> > load_avg < cfs_rq->load.weight
> > decay(history): represents geometric decay of the historical contribution (time)
> > contrib(running + runnable): contribution added when the task is in
> > running and runnable states
> >
> >
> > From these formulas, we can see that tse’s load_sum does not include
> > weight, while cfs_rq’s does. Their units differ, so they cannot be
> > directly added or subtracted. In addition, weight is a "historical
> > variable" that changes over time, which means load_sum cannot be
> > derived from se’s load_avg.
> >
> > To propagate load_sum changes from a child cfs_rq, the upstream
> > implementation uses runnable_avg to APPROXIMATE load_sum. The
> > derivation is as follows.
> >
> > For tse:
> >
> >                  runnable_sum
> > runnable_avg = ----------------
> >                    divider
> >
> >     decay(history) + contrib(running + runnable) * 1024
> >   = ---------------------------------------------------
> >                      divider
> >
> >      decay(history) + contrib(running + runnable)
> >  ~= --------------------------------------------- * 1024
> >                      divider
> >
> >
> > Thus:
> >
> > decay(history) + contrib(running + runnable)
> >
> >
> >      runnable_avg * divider
> >  ~= -----------------------
> >              1024
> >
> >
> >  ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT        (1)
> >
> >
> > Equation (1) represents the contribution when a task is in running or
> > runnable state. The kernel refers to this as the "unweighted version
> > of removed_load", which is essentially time with exponential decay
> > applied.
> >
> > It is worth noting that equation (1) happens to be equal to tse’s
> > load_sum. Therefore, for tse we have:
> >
> > load_sum ~= (runnable_avg * divider) >> SCHED_CAPACITY_SHIFT     (2)
> >
> > However, equation (2) itself is only an approximation, and the
> > right-shift operation introduces further truncation error.
> >
> > My idea is that since the task’s load_sum already exists when it
> > dequeues, it is more reasonable to record this value directly. By
> > doing so, we can avoid both the approximation and the shift-induced
> > error. This is the motivation behind introducing removed.load_sum in
> > my patch.
> >
> >
> >
> >
> > > Also we don't want to track both load_sum and load_avg, only one is
> > > enough and by the above it is load_avg
> > >
> >
> > My view is consistent with yours, but I personally lean towards
> > keeping load_sum and deprecating load_avg, since load_avg does not
> > seem to be accurate.
> >
> > That said, this is only my personal perspective and may not be
> > comprehensive. I would like to further discuss the feasibility of this
> > approach with you.
> >
> > Thanks.
> > hupu