linux-kernel - RE: [PATCH] sched: fair: make V move forward only

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <84787b6343fd4ac595a185645ced1de5@honor.com>
Date: Tue, 2 Dec 2025 12:43:15 +0000
From: wangtao <tao.wangtao@...or.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: "mingo@...hat.com" <mingo@...hat.com>, "juri.lelli@...hat.com"
	<juri.lelli@...hat.com>, "vincent.guittot@...aro.org"
	<vincent.guittot@...aro.org>, "dietmar.eggemann@....com"
	<dietmar.eggemann@....com>, "rostedt@...dmis.org" <rostedt@...dmis.org>,
	"bsegall@...gle.com" <bsegall@...gle.com>, "mgorman@...e.de"
	<mgorman@...e.de>, "vschneid@...hat.com" <vschneid@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, liulu 00013167
	<liulu.liu@...or.com>, "wangbintian(BintianWang)" <bintian.wang@...or.com>,
	wangzicheng <wangzicheng@...or.com>
Subject: RE: [PATCH] sched: fair: make V move forward only


> -----Original Message-----
> From: Peter Zijlstra <peterz@...radead.org>
> Sent: Friday, November 28, 2025 5:29 PM
> To: wangtao <tao.wangtao@...or.com>
> Cc: mingo@...hat.com; juri.lelli@...hat.com; vincent.guittot@...aro.org;
> dietmar.eggemann@....com; rostedt@...dmis.org; bsegall@...gle.com;
> mgorman@...e.de; vschneid@...hat.com; linux-kernel@...r.kernel.org;
> liulu.liu@...or.com; bintian.wang@...or.com
> Subject: Re: [PATCH] sched: fair: make V move forward only
> 
> On Fri, Nov 28, 2025 at 04:11:18PM +0800, wangtao wrote:
> > V is the weighted average of entities. Adding tasks with positive lag
> > or removing tasks with negative lag may cause V to move backward. This
> > will result in unfair task scheduling,
> 
> Have you actually read the paper? Why do you think this breaks fairness?
> 
> > causing previously eligible tasks to become ineligible, shorter
> > runtimes, and more task switches.
> 
> None of that is a fairness issue. Those are issues related to when, rather than
> how much time is given.
> 
Maybe my earlier explanation was unclear, so here is a detailed example.

The run queue is empty. V is V_0. We add three tasks A, X, B in order.
lag_A = 0, lag_X > 0, lag_B = 0.
To simplify, assume the time gaps of the three additions are so small that
we can treat them as zero.

  Add task A: v_A = V_0. After adding A, V does not change, V_A = V_0.
  Add task X: v_X = V_A = V_0. After adding X, V_B < V_0.
  Add task B: v_B = V_X < V_0 = v_A. After adding B, V_B = V_X < V_0.

X runs first. After X runs, because v_B < v_A, we also get
deadline_B < deadline_A. So the later-added task B runs before A, which
is not fair.

But in reality the three additions have time gaps. Suppose A runs for a
short time before adding X. Let V_A' be V at that moment, and let v_A' be
A's vruntime. We have v_A' = V_A' > V_0.
v_X = V_A' - lag_X < V_A'. After putting X into the queue, V becomes
smaller: V_X < V_A'. Since v_A' > V_X, A becomes uneligible. Even if A is
in protect_slice, it will switch to X immediately, increasing context
switches.

Here is a trace where several tasks sleep for a while and then run briefly:
v_30981 = 16015784959115 = V, schedule 30981
  <idle>-0 (...) place_entity ... pid 30981 vruntime 16015784959115 vlag 0 lag 0
  sched_switch ... next_pid=30981 ...
After adding 30983, since vlag_30983 > 0:
v_30983 = 16015785028217 < V_30981' = v_30981' = 16015785080365
  place_entity ... pid 30983 vruntime 16015785028217 vlag 26074 lag 52148
After adding 30982, even though vlag_30982 < 0, because vlag_30983 is
large: v_30982 = 16015785061209 < V_30981' = 16015785080365
  place_entity ... pid 30982 vruntime 16015785061209 vlag -670 lag -3793
Switch to 30983:
  sched_switch ... next_pid=30983 ...
Then task 30982 with negative vlag runs before 30981:
  sched_switch ... next_pid=30982 ...
  sched_switch ... next_pid=30981 ...
  sched_switch ... next_pid=0 ...

> > Making V move forward only resolves such issues and simplifies the
> > code for adding tasks with positive lag.
> 
> It breaks a metric ton of math. Which you don't provide updates for.
> 
> Yes, the paper is light on dynamic behaviour, but please don't disregard the
> math like this. Either stay inside the constraints laid out, or provide coherent
> alternatives. Notably EEVDF is in the same class of scheduling functions as
> WF2Q and both provide better lag bounds than the simpler WFQ class of
> schedulers.
> 
> The 'zero-lag point is the weighted average of the entities' is a fairly core
> tenet of EEVDF. Mucking with this *will* mess with the lag bounds.
> 
You are right. The reasoning should be based on math. I will submit a new
patch based on the derivation below.

When deriving V, we previously assumed \Sum lag_i = 0, but in many cases
\Sum lag_i is not zero.
The simplest case is when the run queue is empty and we add a task with
lag not zero, Then \Sum lag_i is clearly not zero.

Our goal is \Sum lag_i = 0. So when task i finishes running and is removed
from the queue, its lag is:

  lag_i = S - s_i = w_i * (V - v_i)

After some time, when task i is added back, the other tasks' v have
changed, and V has changed. Let A be V at that moment:

  \Sum lag_i = \Sum w_i * (A - v_i)

We can compute A from v_i and lag_i:

  A = \Sum (w_i * v_i + lag_i) / \Sum w_i = \Sum (w_i * v_i + lag_i) / W

Use A to compute v for the newly added task k. After adding k:

  A' = (\Sum (w_i * v_i + lag_i) + w_k * v_k + lag_k) / (W + w_k)

To preserve lag_k, set v_k = A' - lag_k / w_k, then:

  A' = (\Sum (w_i * v_i + lag_i) + w_k * A') / (W + w_k)
  A' * (W + w_k) = \Sum (w_i * v_i + lag_i) + w_k * A'
  A' = \Sum (w_i * v_i + lag_i) / W = A

This shows that adding task k does not change A.
Similarly, removing task k does not change A.

It is easy to get:

  A = V + (\Sum lag_i) / W

\Sum lag_i stays around 0 when tasks are added or removed.

The roles of V and A are:
  1) Use V to judge whether task is eligible.
  2) Use V to compute lag when a task is removed.
  3) Use A to compute v when a task is added.

> The delayed dequeue feature tries to address some of these concerns by
> keeping non-eligible (negative lag) tasks on the runqueue until such time
> that they become eligible (approximated by getting picked again) at which
> point they get removed (and any positive lag gets truncated, as if they were
> removed at zero-lag). As a consequence you will have much less removal of
> negative lag, additionally such tasks will be eligible the moment they come
> back.
> 
> Also, there is the small matter that your patch simply does not apply.

Thanks,
Tao