linux-kernel - [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251226001731.3730586-1-quzicheng@huawei.com>
Date: Fri, 26 Dec 2025 00:17:31 +0000
From: Zicheng Qu <quzicheng@...wei.com>
To: <mingo@...hat.com>, <peterz@...radead.org>, <juri.lelli@...hat.com>,
	<vincent.guittot@...aro.org>, <dietmar.eggemann@....com>,
	<rostedt@...dmis.org>, <bsegall@...gle.com>, <mgorman@...e.de>,
	<vschneid@...hat.com>, <linux-kernel@...r.kernel.org>
CC: <tanghui20@...wei.com>, <quzicheng@...wei.com>
Subject: [PATCH] sched/fair: Fix vruntime drift by preventing double lag scaling during reweight

In reweight_entity(), when reweighting a currently running entity (se ==
cfs_rq->curr), the entity remains on the runqueue context without
undergoing a full dequeue/enqueue cycle. This means avg_vruntime()
remains constant throughout the reweight operation.

However, the current implementation calls place_entity(..., 0) at the
end of reweight_entity(). Under EEVDF, place_entity() is designed to
handle entities entering the runqueue and calculates the virtual lag
(vlag) to account for the change in the weighted average vruntime (V)
using the formula:

	vlag' = vlag * (W + w_i) / W

Where 'W' is the current aggregate weight (including
cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being
enqueued (in this case, the se is exactly the cfs_rq->curr).

This leads to a "double scaling" logic for running entities:
1. reweight_entity() already rescales se->vlag based on the new weight
   ratio.
2. place_entity() then mistakenly applies the (W + w_i)/W scaling again,
   treating the reweight as a fresh enqueue into a new total weight
pool.

This can cause the entity's vlag to be amplified (if positive) or
suppressed (if negative) incorrectly during the reweight process.

In environments with frequent cgroup throttle/unthrottle operations,
this math error manifests as a vruntime drift.

A hungtask was observed as below:
crash> runq -c 0 -g
CPU 0
  CURRENT: PID: 330440  TASK: ffff00004cd61540  COMMAND: "stress-ng"
  ROOT_TASK_GROUP: ffff8001025fa4c0  RT_RQ: ffff0000fff42500
	 [no tasks queued]
  ROOT_TASK_GROUP: ffff8001025fa4c0  CFS_RQ: ffff0000fff422c0
	 TASK_GROUP: ffff0000c130fc00  CFS_RQ: ffff00009125a400  <test_cg>	cfs_bandwidth: period=100000000, quota=18446744073709551615, gse: 0xffff000091258c00, vruntime=127285708384434, deadline=127285714880550, vlag=11721467, weight=338965, my_q=ffff00009125a400, cfs_rq: avg_vruntime=0, zero_vruntime=2029704519792, avg_load=0, nr_running=1
		TASK_GROUP: ffff0000d7cc8800  CFS_RQ: ffff0000c8f86800  <test_test329274_1>	cfs_bandwidth: period=14000000, quota=14000000, gse: 0xffff0000c8f86400, vruntime=2034894470719, deadline=2034898697770, vlag=0, weight=215291, my_q=ffff0000c8f86800, cfs_rq: avg_vruntime=-422528991, zero_vruntime=8444226681954, avg_load=54, nr_running=19
		   [110] PID: 330440  TASK: ffff00004cd61540  COMMAND: "stress-ng" [CURRENT]    vruntime=8444367524951, deadline=8444932411139, vlag=8444932411139, weight=3072, last_arrival=4002964107010, last_queued=0, exec_start=3872860294100, sum_exec_runtime=22252021900
		   ...
		   [110] PID: 330291  TASK: ffff0000c02c9540  COMMAND: "stress-ng"	vruntime=8444229273009, deadline=8444946073008, vlag=-2701415, weight=3072, last_arrival=4002964076840, last_queued=4002964550990, exec_start=3872859839290, sum_exec_runtime=22310951770
	 [100] PID: 97     TASK: ffff0000c2432a00  COMMAND: "kworker/0:1H"	vruntime=127285720095197, deadline=127285720119423, vlag=48453, weight=90891264, last_arrival=3846600432710, last_queued=3846600721010, exec_start=3743307237970, sum_exec_runtime=413405210
	 [120] PID: 15     TASK: ffff0000c0368080  COMMAND: "ksoftirqd/0"	vruntime=127285722433404, deadline=127285724533404, vlag=0, weight=1048576, last_arrival=3506755665780, last_queued=3506852159390, exec_start=3461615726670, sum_exec_runtime=16341041340
	 [120] PID: 50173  TASK: ffff0000741d8080  COMMAND: "kworker/0:0"	vruntime=127285722960040, deadline=127285725060040, vlag=-414755, weight=1048576, last_arrival=3506828139580, last_queued=3506972354700, exec_start=3461676584440, sum_exec_runtime=84414080
	 [120] PID: 58662  TASK: ffff000091180080  COMMAND: "kworker/0:2"	vruntime=127285723428168, deadline=127285725528168, vlag=3049158, weight=1048576, last_arrival=3505689085070, last_queued=3506848131990, exec_start=3460592328510, sum_exec_runtime=89193000

TASK 1 (systemd) is waiting for cgroup_mutex.
TASK 329296 (sh) holds cgroup_mutex and is waiting for cpus_read_lock.
TASK 50173 (kworker/0:0) holds the cpus_read_lock, but fail to be
scheduled.
test_cg and TASK 97 may have suppressed TASK 50173, causing
it to not be scheduled for a long time, thus failing to release locks in
a timely manner and ultimately causing a hungtask issue.

Fix by adding ENQUEUE_REWEIGHT_CURR flag and skipping vlag recalculation
in place_entity() when reweighting the current running entity. For
non-current entities, the existing logic remains as dequeue/enqueue
changes avg_vruntime().

Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
Signed-off-by: Zicheng Qu <quzicheng@...wei.com>
---
 kernel/sched/fair.c  | 11 ++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..3be42729049e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3787,7 +3787,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 
 	enqueue_load_avg(cfs_rq, se);
 	if (se->on_rq) {
-		place_entity(cfs_rq, se, 0);
+		place_entity(cfs_rq, se, curr ? ENQUEUE_REWEIGHT_CURR : 0);
 		update_load_add(&cfs_rq->load, se->load.weight);
 		if (!curr)
 			__enqueue_entity(cfs_rq, se);
@@ -5123,6 +5123,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 		lag = se->vlag;
 
+		/*
+		 * ENQUEUE_REWEIGHT_CURR:
+		 * current running se (cfs_rq->curr) should skip vlag recalculation,
+		 * because avg_vruntime(...) hasn't changed.
+		 */
+		if (flags & ENQUEUE_REWEIGHT_CURR)
+			goto skip_lag_scale;
+
 		/*
 		 * If we want to place a task and preserve lag, we have to
 		 * consider the effect of the new entity on the weighted
@@ -5185,6 +5193,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		lag = div_s64(lag, load);
 	}
 
+skip_lag_scale:
 	se->vruntime = vruntime - lag;
 
 	if (se->rel_deadline) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d30cca6870f5..e3a43f94dd2f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2412,6 +2412,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATED	0x00040000
 #define ENQUEUE_INITIAL		0x00080000
 #define ENQUEUE_RQ_SELECTED	0x00100000
+#define ENQUEUE_REWEIGHT_CURR	0x00200000
 
 #define RETRY_TASK		((void *)-1UL)
 
-- 
2.34.1