linux-kernel - Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9d407717-f861-4f4d-96c5-805c2dbbd225@huawei.com>
Date: Mon, 12 Jan 2026 12:03:37 +0800
From: Zicheng Qu <quzicheng@...wei.com>
To: K Prateek Nayak <kprateek.nayak@....com>, <mingo@...hat.com>,
	<peterz@...radead.org>, <juri.lelli@...hat.com>,
	<vincent.guittot@...aro.org>, <dietmar.eggemann@....com>,
	<rostedt@...dmis.org>, <bsegall@...gle.com>, <mgorman@...e.de>,
	<vschneid@...hat.com>, <linux-kernel@...r.kernel.org>
CC: <tanghui20@...wei.com>, <quzicheng@...wei.com>
Subject: Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag
 scaling during reweight

On 1/9/2026 6:21 PM, K Prateek Nayak wrote:

> Hello Zicheng,
>
> On 1/9/2026 2:10 PM, Zicheng Qu wrote:
>> Hi Prateek,
>>
>> On 1/9/2026 12:50 PM, K Prateek Nayak wrote:
>>> If I'm not mistaken, the problem is that we'll see "curr->on_rq" and
>>> then do:
>>>
>>>       if (curr && curr->on_rq)
>>>           load += scale_load_down(curr->load.weight);
>>>
>>>       lag *= load + scale_load_down(se->load.weight);
>>>
>>>
>>> which shouldn't be the case since we are accounting "se" twice when
>>> it is also the "curr" and avg_vruntime() would have also accounted it
>>> already since "curr->on_rq" and then we do everything twice for "se".
>> Thanks for the analysis — I agree your concern is reasonable, but I
>> think the issue here is slightly different from "accounting se twice",
>> but a semantic mismatch in how place_entity() is used.
>>
>> place_entity() is meant to compensate lag for entities being inserted
>> into the runqueue, accounting for the effect of a new entity on the
>> weighted average vruntime. That assumption holds when an se is joining
>> the rq. However, when se == cfs_rq->curr, the entity never left the
>> runqueue and avg_vruntime() has not changed, so applying enqueue-style
>> lag scaling is not appropriate.
> I believe the intention is to discount the contribution of the
> task and then re-account it again after the reweigh. I don't think
> se being the "curr" makes it any different except for the fact that
> its vruntime and load contribution isn't reflected in the sum and
> is added in by avg_vruntime()
>
>>> I'm wondering if instead of adding a flag, we can do:
>> Yes, I totally agree that adding a new flag is unnecessary. We
>> can handle this directly in place_entity() by skipping lag scaling in
>> case of `se == cfs_rq->curr`, for example:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index da46c3164537..1b279bf43f38 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5123,6 +5123,15 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>
>>                  lag = se->vlag;
>>
>> +              /*
>> +               * place_entity() compensates lag for entities being inserted into the
>> +               * runqueue. When se == cfs_rq->curr, the entity never left the rq and
>> +               * avg_vruntime() did not change, so enqueue-style lag scaling does not
>> +               * apply.
>> +               */
>> +              if (se == cfs_rq->curr)
>> +                      goto skip_lag_scale;
> This affects the place_entity() from enqueue_task() where se is dequeued
> (se->on_rq == 0) but it is still the curr - can happen when rq drops
> lock for newidle balance and a concurrent wakeup is queuing the task.
>
> You need to check for "se == cfs_rq->curr && se->on_rq" here and then
> this bit should be good.
>
> Let me stare more at the avg_vruntime() since "curr->on_rq" would add
> its vruntime too in that calculation.
>
>> +
>>                  /*
>>                   * If we want to place a task and preserve lag, we have to
>>                   * consider the effect of the new entity on the weighted
>> @@ -5185,6 +5194,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>                  lag = div_s64(lag, load);
>>          }
>>
>> +skip_lag_scale:
>>          se->vruntime = vruntime - lag;
>>
>>          if (se->rel_deadline) {
>>
>> Best regards,
>> Zicheng

Hi Prateek,

Thanks for the careful review and for pointing out the concurrent
enqueue case where se can still be curr while se->on_rq == 0. You're
right that skipping lag scaling purely based on se == cfs_rq->curr would
incorrectly affect the enqueue_task() path under newidle balance, and
the condition needs to be refined.

I agree that the correct distinction is whether the entity actually left
the runqueue. Updating the condition to:

     se == cfs_rq->curr && se->on_rq

accurately limits the skip to the reweight case, while preserving the
intended enqueue semantics when the entity was dequeued and reinserted.

On the avg_vruntime() / invariance point, the reasoning I'm relying on
is that a "pure" dequeue → enqueue cycle (with no execution progress and
no other state changes) should not change an entity's relative position
or the avg_vruntime() of the cfs_rq.

Let V denote avg_vruntime() of the cfs_rq and let l denote se->vlag,
where l = V - v (v is se->vruntime). Consider what happens if an entity
is removed and then reinserted without vlag compensating for the change
in V.

   * Case A: l > 0 (entity is lagging; it "pulls" the weighted average 
backwards)
     - Before dequeue: v1 = V1 - l, where l > 0
     - Dequeue removes this lagging entity, causing the weighted average 
to advance: V2 > V1
     - If we re-enqueue with unchanged lag value l (i.e., place_entity() 
sets v3 = V2 - l):
         v3 = V2 - l = V2 - (V1 - v1) = v1 + (V2 - V1)
         Since V2 > V1, we get v3 > v1
     - This means the entity's vruntime has artificially advanced 
despite no execution
     - After reinsertion, the new average V3 will satisfy V1 < V3 < V2
       (as we've reintroduced the lagging entity), but the entity's 
relative position
       has shifted forward
     - The enqueue-style lag scaling compensates by adjusting l to 
maintain the invariant
         that v3 = v1 and V3 = V1, preserving the entity's fair position

   * Case B: l < 0 (The thinking is mostly the same)

This is distinct from the reweight case on `se == cfs_rq->curr &&
se->on_rq`, the entity never actually leaves the cfs_rq, so V does not
change "due to its absence". In that situation, applying the same
enqueue-style lag scaling modifies curr’s vruntime without any dequeue/
enqueue event to compensate for, introducing artificial drift.

This drift has a direct fairness impact under EEVDF. If curr->vlag > 0,
the extra scaling increases its positive lag and biases subsequent picks
in its favor; if curr->vlag < 0, the magnitude becomes more negative and
it becomes less likely to be scheduled. If a se weight is temporarily
adjusted and later restored, multiple place_entity() invocations can
accumulate drift. Even ignoring the "if (curr && curr->on_rq)" load
adjustment, only with `lag *= load + scale_load_down(se->load.weight)`,
the enqueue-style scaling itself is also to move curr’s vruntime in the
running case.

I'll update the patch accordingly with the
refined condition and updated comment if you think this logic is
reasonable.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..51482186fd31 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5123,6 +5123,15 @@ place_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)

                 lag = se->vlag;

+               /*
+                * place_entity() compensates lag for entities being 
inserted into the
+                * runqueue. When se == cfs_rq->curr, the entity never 
left the rq and
+                * avg_vruntime() did not change, so enqueue-style lag 
scaling does not
+                * apply.
+                */
+               if (se == cfs_rq->curr && se->on_rq)
+                       goto skip_lag_scale;
+
                 /*
                  * If we want to place a task and preserve lag, we have to
                  * consider the effect of the new entity on the weighted
@@ -5185,6 +5194,7 @@ place_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
                 lag = div_s64(lag, load);
         }

+skip_lag_scale:
         se->vruntime = vruntime - lag;

         if (se->rel_deadline) {

Thanks again for the careful review.

Best regards,
Zicheng