linux-kernel - RE: [PATCH] sched: fair: make V move forward only

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <56e5a5e79bc84d5a924835f83f16f162@honor.com>
Date: Fri, 5 Dec 2025 08:00:42 +0000
From: wangtao <tao.wangtao@...or.com>
To: Peter Zijlstra <peterz@...radead.org>
CC: "mingo@...hat.com" <mingo@...hat.com>, "juri.lelli@...hat.com"
	<juri.lelli@...hat.com>, "vincent.guittot@...aro.org"
	<vincent.guittot@...aro.org>, "dietmar.eggemann@....com"
	<dietmar.eggemann@....com>, "rostedt@...dmis.org" <rostedt@...dmis.org>,
	"bsegall@...gle.com" <bsegall@...gle.com>, "mgorman@...e.de"
	<mgorman@...e.de>, "vschneid@...hat.com" <vschneid@...hat.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, liulu 00013167
	<liulu.liu@...or.com>, "wangbintian(BintianWang)" <bintian.wang@...or.com>,
	wangzicheng <wangzicheng@...or.com>
Subject: RE: [PATCH] sched: fair: make V move forward only

> 
> Not unclear, still not sure how its a fairness issue. Yes it increases context
> switches, but fairness is about the amount of time distributed, not about
> when time is given.
> 
> It is entirely reasonable for a task that wakes up to run now if it has positive
> lag.
> 
Tasks with larger lag should run first. If a task with smaller lag runs
earlier, it is unfair in those time slices, or you can say it is not reasonable.

It may be easier to explain this by separating
inqueue_lag (lag), join_lag (jlag), and leave_lag (llag).
For all tasks in the queue, \Sum lag_i is 0, but \Sum jlag_i is not always 0.

Suppose the current V = V0, all weights are 1, and we add four tasks
with jlag values 0, 10, 80, and -12. Considering preserve-lag handling
with vlag_i = (W + w_i) * vlag_i' / W:

-----------------------------
event | jlag | v      | W | V
-----------------------------
add T1 |   0 | V0     | 1 | V0
add T2 |  20 | V0-20  | 2 | V0-10
add T3 | 120 | V0-130 | 3 | V0-50
add T4 | -16 | V0-34  | 4 | V0-46

Because V becomes smaller after adding T2 and T3, even though
lag_T4 < 0, we still have v_T3 < v_T4 < v_T2 < v_T1.
So the schedule order is T3, T4, T2, T1.
A similar issue exists even without preserve-lag.
If tasks are added in order T2, T3, T1, T4, the schedule order becomes
T3, T1, T4, T2, which shows instability.

> 
> Which is creating a new clock that absorbs a non-zero lag sum?
> 
> I don't think there's a bound on the difference between A and V. You can
> extract unbounded lag from the system. We recently had someone
> showcase exactly that, they managed to wrap V backwards far enough to
> make the old min_vruntime thing wrap the s64 space and things went
> sideways real fast.
>

Earlier I did not clearly separate jlag/lag/llag. Intuitively one may think
  sum_jlag = \Sum jlag_j - \Sum llag_l.
But \Sum llag_l may approach 0, while \Sum jlag_j may not have a clear
boundary. For example, if all jlag>0 go to cfs_rq0 and all jlag<0 go to
cfs_rq1, then \Sum jlag_j of cfs_rq0 is unbounded.
 
>From deriving the leave process, we find that jlag_l and llag_l are not always equal.

We define A to stay unchanged when tasks join or leave, and we have:
  v = V - vlag = A - vjlag
  J = A - V
  J * W = sum_jlag = (\Sum jlag_j - \Sum jlag_l)

1) A task leaves and later re-joins. We know its previous llag.
   Set jlag_j = llag. Then v_j = A - vjlag_j keeps A unchanged:
   A_j = (A * W + v_j * w_j + vjlag_j * w_j) / (W + w_j) = A

2) For any task i running in the queue, v_i and V change.
   Then vlag_i = V - v_i also changes, but vjlag_i does not change.

3) When a task leaves, from v_l we get vllag_l = V - v_l
   and vjlag_l = A - v_l.
   vjlag_l = vllag_l + A - V = vllag_l + J
   Because J may not be 0, vjlag_l and vllag_l are not always equal.

Now we analyze the boundary of sum_jlag:

When a task is added:
  sum_jlag_j = sum_jlag + jlag_j
  Here jlag_j is the llag from the last leave.
  Its upper bound is the same as the lag limit q.
  Adding n tasks means sum_jlag does not exceed q * n.

When a task leaves:
  sum_jlag_l = sum_jlag - jlag_l = sum_jlag - vjlag_l * w_l
  sum_jlag_l = sum_jlag - (vllag_l + J) * w_l
  sum_jlag_l = sum_jlag - llag_l - sum_jlag * w_l / W
  sum_jlag_l = sum_jlag * (W - w_l) / W - llag_l
This means sum_jlag is reduced proportionally by weight.
When all tasks leave, sum_jlag becomes the last llag.

If n tasks are added and m tasks leave, the upper bound of sum_jlag
is still no more than q * (n - m).

Using v_j = A - vjlag_j makes scheduling more stable when tasks join
or leave. It is equal to preserve vlag = vjlag - J.

Using the same example with jlag values 0, 10, 80, -12:

------------------------------------------
event | jlag | v       | W | V        | J
------------------------------------------
add T1 |  0  | V0      | 1 | V0       | 0
add T2 | 10  | V0-10   | 2 | V0-5     | 5
add T3 | 80  | V0-80   | 3 | V0-30    | 30
add T4 | -12 | V0+12   | 4 | V0-19.5  | 19.5

As long as the four tasks are added at the same time, no matter the order,
the schedule order will be T3, T2, T1, T4.

Thank,
Tao