linux-kernel - Re: [PATCH 0/4] sched: Various reweight

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAhKJ0+4=a27XNbatwYt=4zEnhQNe0NbS3i4WixyoAnbQ@mail.gmail.com>
Date: Tue, 10 Feb 2026 21:52:29 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: K Prateek Nayak <kprateek.nayak@....com>, mingo@...nel.org, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com, 
	mgorman@...e.de, vschneid@...hat.com, linux-kernel@...r.kernel.org, 
	wangtao554@...wei.com, quzicheng@...wei.com, wuyun.abel@...edance.com, 
	dsmythies@...us.net
Subject: Re: [PATCH 0/4] sched: Various reweight_entity() fixes

On Mon, 9 Feb 2026 at 16:47, Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Wed, Feb 04, 2026 at 03:45:58PM +0530, K Prateek Nayak wrote:
>
> >        # Overflow on enqueue
> >
> >            <...>-102371  [255] ... : __enqueue_entity: Overflowed cfs_rq:
> >            <...>-102371  [255] ... : dump_h_overflow_cfs_rq: cfs_rq: depth(0) weight(90894772) nr_queued(2) sum_w_vruntime(0) sum_weight(0) zero_vruntime(701164930256050) sum_shift(0) avg_vruntime(701809615900788)
> >            <...>-102371  [255] ... : dump_h_overflow_entity: se: weight(3508) vruntime(701809615900788) slice(2800000) deadline(701810568648095) curr?(1) task?(1)       <-------- cfs_rq->curr
> >            <...>-102371  [255] ... : __enqueue_entity: Overflowed se:
> >            <...>-102371  [255] ... : dump_h_overflow_entity: se: weight(90891264) vruntime(701808975077099) slice(2800000) deadline(701808975109401) curr?(0) task?(0)   <-------- new se
>
> So I spend a whole time trying to reproduce the splat, but alas.
>
> That said, I did spot something 'funny' in the above, note that
> zero_vruntime and avg_vruntime/curr->vruntime are significantly apart.
> That is not something that should happen. zero_vruntime is supposed to
> closely track avg_vruntime.
>
> That lead me to hypothesise that there is a problem tracking
> zero_vruntime when there is but a single runnable task, and sure
> enough, I could reproduce that, albeit not at such a scale as to lead to
> such problems (probably too much noise on my machine).
>
> I ended up with the below; and I've already pushed out a fresh
> queue/sched/core. Could you please test again?
>
> ---
> Subject: sched/fair: Fix zero_vruntime tracking
> From: Peter Zijlstra <peterz@...radead.org>
> Date: Mon Feb  9 15:28:16 CET 2026
>
> It turns out that zero_vruntime tracking is broken when there is but a single
> task running. Current update paths are through __{en,de}queue_entity(), and
> when there is but a single task, pick_next_task() will always return that one
> task, and put_prev_set_next_task() will end up in neither function.
>
> This can cause entity_key() to grow indefinitely large and cause overflows,
> leading to much pain and suffering.
>
> Furtermore, doing update_zero_vruntime() from __{de,en}queue_entity(), which
> are called from {set_next,put_prev}_entity() has problems because:
>
>  - set_next_entity() calls __dequeue_entity() before it does cfs_rq->curr = se.
>    This means the avg_vruntime() will see the removal but not current, missing
>    the entity for accounting.
>
>  - put_prev_entity() calls __enqueue_entity() before it does cfs_rq->curr =
>    NULL. This means the avg_vruntime() will see the addition *and* current,
>    leading to double accounting.
>
> Both cases are incorrect.
>
> Noting that avg_vruntime is already called on each {en,de}queue, remove the
> explicit avg_vruntime() calls (which removes an extra 64bit division for each
> {en,de}queue) and have avg_vruntime() update zero_vruntime itself.
>
> Additionally, have the tick call avg_vruntime() -- discarding the result, but
> for the side-effect of updating zero_vruntime.
>
> While there, optimize avg_vruntime() by noting that the average of one value is
> rather trivial to compute.
>
> Test case:
>   # taskset -c -p 1 $$
>   # taskset -c 2 bash -c 'while :; do :; done&'
>   # cat /sys/kernel/debug/sched/debug  | awk '/^cpu#/ {P=0} /^cpu#2,/ {P=1} {if (P) print $0}' | grep -e zero_vruntime -e "^>"
>
> PRE:
>     .zero_vruntime                 : 31316.407903
>   >R            bash   487     50787.345112   E       50789.145972           2.800000     50780.298364        16     120         0.000000         0.000000         0.000000        /
>     .zero_vruntime                 : 382548.253179
>   >R            bash   487    427275.204288   E      427276.003584           2.800000    427268.157540        23     120         0.000000         0.000000         0.000000        /
>
> POST:
>     .zero_vruntime                 : 17259.709467
>   >R            bash   526     17259.709467   E       17262.509467           2.800000     16915.031624         9     120         0.000000         0.000000         0.000000        /
>     .zero_vruntime                 : 18702.723356
>   >R            bash   526     18702.723356   E       18705.523356           2.800000     18358.045513         9     120         0.000000         0.000000         0.000000        /
>
> Fixes: 79f3f9bedd14 ("sched/eevdf: Fix min_vruntime vs avg_vruntime")
> Reported-by: K Prateek Nayak <kprateek.nayak@....com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>

Hi Peter,

This patch w/ the patchset on top of tip/sched/core create regressions
for hackbench (tbench doesn't seem to be impacted) on my dragonboard
rb5
All hackbench tests are regressing. Some results below

hackbench 1 group process socket
0.579(+/-1.3%) vs 0.697-+/-2.7%) : -20%

hackbench 8 group process socket
0.650(+/-1%) vs 2.361(+/-8.8%) : -263%

hackbench 1 group thread pipe
0.885(+/-1.6%) vs 0.969-+/-1.4%) : -10%

hackbench 8 group thread pipe
0.689(+/-3.2%) vs 1.916(+/-6.7%) : -178%

The regression starts with patch 3
I haven't found the root cause and will look more deeply tomorrow


> ---
>  kernel/sched/fair.c |   81 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 55 insertions(+), 26 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -589,6 +589,21 @@ static inline bool entity_before(const s
>         return vruntime_cmp(a->deadline, "<", b->deadline);
>  }
>
> +/*
> + * Per avg_vruntime() below, cfs_rq::zero_vruntime is only slightly stale
> + * and this value should be no more than two lag bounds. Which puts it in the
> + * general order of:
> + *
> + *     (slice + TICK_NSEC) << NICE_0_LOAD_SHIFT
> + *
> + * which is around 44 bits in size (on 64bit); that is 20 for
> + * NICE_0_LOAD_SHIFT, another 20 for NSEC_PER_MSEC and then a handful for
> + * however many msec the actual slice+tick ends up begin.
> + *
> + * (disregarding the actual divide-by-weight part makes for the worst case
> + * weight of 2, which nicely cancels vs the fuzz in zero_vruntime not actually
> + * being the zero-lag point).
> + */
>  static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         return vruntime_op(se->vruntime, "-", cfs_rq->zero_vruntime);
> @@ -676,39 +691,60 @@ sum_w_vruntime_sub(struct cfs_rq *cfs_rq
>  }
>
>  static inline
> -void sum_w_vruntime_update(struct cfs_rq *cfs_rq, s64 delta)
> +void update_zero_vruntime(struct cfs_rq *cfs_rq, s64 delta)
>  {
>         /*
>          * v' = v + d ==> sum_w_vruntime' = sum_runtime - d*sum_weight
>          */
>         cfs_rq->sum_w_vruntime -= cfs_rq->sum_weight * delta;
> +       cfs_rq->zero_vruntime += delta;
>  }
>
>  /*
> - * Specifically: avg_runtime() + 0 must result in entity_eligible() := true
> + * Specifically: avg_vruntime() + 0 must result in entity_eligible() := true
>   * For this to be so, the result of this function must have a left bias.
> + *
> + * Called in:
> + *  - place_entity()      -- before enqueue
> + *  - update_entity_lag() -- before dequeue
> + *  - entity_tick()
> + *
> + * This means it is one entry 'behind' but that puts it close enough to where
> + * the bound on entity_key() is at most two lag bounds.
>   */
>  u64 avg_vruntime(struct cfs_rq *cfs_rq)
>  {
>         struct sched_entity *curr = cfs_rq->curr;
> -       s64 avg = cfs_rq->sum_w_vruntime;
> -       long load = cfs_rq->sum_weight;
> +       s64 delta = 0, runtime = cfs_rq->sum_w_vruntime;
> +       long weight = cfs_rq->sum_weight;
>
> -       if (curr && curr->on_rq) {
> -               unsigned long weight = scale_load_down(curr->load.weight);
> +       if (curr && !curr->on_rq)
> +               curr = NULL;
>
> -               avg += entity_key(cfs_rq, curr) * weight;
> -               load += weight;
> -       }
> +       if (weight) {
> +               if (curr) {
> +                       unsigned long w = scale_load_down(curr->load.weight);
> +
> +                       runtime += entity_key(cfs_rq, curr) * w;
> +                       weight += w;
> +               }
>
> -       if (load) {
>                 /* sign flips effective floor / ceiling */
> -               if (avg < 0)
> -                       avg -= (load - 1);
> -               avg = div_s64(avg, load);
> +               if (runtime < 0)
> +                       runtime -= (weight - 1);
> +
> +               delta = div_s64(runtime, weight);
> +
> +       } else if (curr) {
> +               /*
> +                * When there is but one element, it is the average.
> +                */
> +               delta = curr->vruntime - cfs_rq->zero_vruntime;
>         }
>
> -       return cfs_rq->zero_vruntime + avg;
> +       update_zero_vruntime(cfs_rq, delta);
> +
> +       return cfs_rq->zero_vruntime;
>  }
>
>  /*
> @@ -777,16 +813,6 @@ int entity_eligible(struct cfs_rq *cfs_r
>         return vruntime_eligible(cfs_rq, se->vruntime);
>  }
>
> -static void update_zero_vruntime(struct cfs_rq *cfs_rq)
> -{
> -       u64 vruntime = avg_vruntime(cfs_rq);
> -       s64 delta = vruntime_op(vruntime, "-", cfs_rq->zero_vruntime);
> -
> -       sum_w_vruntime_update(cfs_rq, delta);
> -
> -       cfs_rq->zero_vruntime = vruntime;
> -}
> -
>  static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
>  {
>         struct sched_entity *root = __pick_root_entity(cfs_rq);
> @@ -856,7 +882,6 @@ RB_DECLARE_CALLBACKS(static, min_vruntim
>  static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
>         sum_w_vruntime_add(cfs_rq, se);
> -       update_zero_vruntime(cfs_rq);
>         se->min_vruntime = se->vruntime;
>         se->min_slice = se->slice;
>         rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
> @@ -868,7 +893,6 @@ static void __dequeue_entity(struct cfs_
>         rb_erase_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
>                                   &min_vruntime_cb);
>         sum_w_vruntime_sub(cfs_rq, se);
> -       update_zero_vruntime(cfs_rq);
>  }
>
>  struct sched_entity *__pick_root_entity(struct cfs_rq *cfs_rq)
> @@ -5524,6 +5548,11 @@ entity_tick(struct cfs_rq *cfs_rq, struc
>         update_load_avg(cfs_rq, curr, UPDATE_TG);
>         update_cfs_group(curr);
>
> +       /*
> +        * Pulls along cfs_rq::zero_vruntime.
> +        */
> +       avg_vruntime(cfs_rq);
> +
>  #ifdef CONFIG_SCHED_HRTICK
>         /*
>          * queued ticks are scheduled to match the slice, so don't bother