[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <78c71eda-fa38-4a6b-be80-f74ec32751d9@huawei.com>
Date: Thu, 8 Jan 2026 14:27:44 +0800
From: Zicheng Qu <quzicheng@...wei.com>
To: <mingo@...hat.com>, <peterz@...radead.org>, <juri.lelli@...hat.com>,
<vincent.guittot@...aro.org>, <dietmar.eggemann@....com>,
<rostedt@...dmis.org>, <bsegall@...gle.com>, <mgorman@...e.de>,
<vschneid@...hat.com>, <linux-kernel@...r.kernel.org>
CC: <tanghui20@...wei.com>, <quzicheng@...wei.com>
Subject: Re: [PATCH] sched/fair: Fix vruntime drift by preventing double lag
scaling during reweight
Hi,
Just a gentle ping.
I can reproduce the same issue on the mainline 6.19.0-rc4 as well.
The observed behavior matches the problem model described in my previous
patch description.
Sharing this mainline dmesg in case it's useful.
[ 1217.519433] INFO: task systemd:1 blocked for more than 606 seconds.
[ 1217.526904] Not tainted
6.19.0-rc4-qzc-test-hungtask-reweight_entity+ #5
[ 1217.535242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1217.544177] task:systemd state:D stack:0 pid:1 tgid:1
ppid:0 task_flags:0x400100 flags:0x00000000
[ 1217.556312] Call trace:
[ 1217.559665] __switch_to+0xdc/0x108 (T)
[ 1217.564401] __schedule+0x288/0x650
[ 1217.568786] schedule+0x30/0xa8
[ 1217.572821] schedule_preempt_disabled+0x18/0x30
[ 1217.578324] __mutex_lock.constprop.0+0x2fc/0xc20
[ 1217.583906] __mutex_lock_slowpath+0x1c/0x30
[ 1217.589054] mutex_lock+0x50/0x68
[ 1217.593247] cgroup_kn_lock_live+0x60/0x158
[ 1217.598302] cgroup_mkdir+0x44/0x218
[ 1217.602744] kernfs_iop_mkdir+0x6c/0xc8
[ 1217.607444] vfs_mkdir+0x218/0x318
[ 1217.611711] do_mkdirat+0x198/0x200
[ 1217.616056] __arm64_sys_mkdirat+0x38/0x58
[ 1217.621007] invoke_syscall+0x50/0x120
[ 1217.625610] el0_svc_common.constprop.0+0x48/0xf0
[ 1217.631157] do_el0_svc+0x24/0x38
[ 1217.635365] el0_svc+0x34/0x170
[ 1217.639348] el0t_64_sync_handler+0xa0/0xe8
[ 1217.644371] el0t_64_sync+0x190/0x198
[ 1217.649244] INFO: task systemd:1 is blocked on a mutex likely owned
by task cgexec:105632.
[ 1217.658479] INFO: task kworker/0:1:11 blocked for more than 606 seconds.
[ 1217.666204] Not tainted
6.19.0-rc4-qzc-test-hungtask-reweight_entity+ #5
[ 1217.674429] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1217.683211] task:kworker/0:1 state:D stack:0 pid:11 tgid:11
ppid:2 task_flags:0x4208060 flags:0x00000010
[ 1217.695285] Workqueue: events vmstat_shepherd
[ 1217.700480] Call trace:
[ 1217.703767] __switch_to+0xdc/0x108 (T)
[ 1217.708438] __schedule+0x288/0x650
[ 1217.712768] schedule+0x30/0xa8
[ 1217.716750] percpu_rwsem_wait+0xdc/0x208
[ 1217.721592] __percpu_down_read+0x64/0x110
Thanks,
Zicheng
On 12/26/2025 8:17 AM, Zicheng Qu wrote:
> In reweight_entity(), when reweighting a currently running entity (se ==
> cfs_rq->curr), the entity remains on the runqueue context without
> undergoing a full dequeue/enqueue cycle. This means avg_vruntime()
> remains constant throughout the reweight operation.
>
> However, the current implementation calls place_entity(..., 0) at the
> end of reweight_entity(). Under EEVDF, place_entity() is designed to
> handle entities entering the runqueue and calculates the virtual lag
> (vlag) to account for the change in the weighted average vruntime (V)
> using the formula:
>
> vlag' = vlag * (W + w_i) / W
>
> Where 'W' is the current aggregate weight (including
> cfs_rq->curr->load.weight) and 'w_i' is the weight of the entity being
> enqueued (in this case, the se is exactly the cfs_rq->curr).
>
> This leads to a "double scaling" logic for running entities:
> 1. reweight_entity() already rescales se->vlag based on the new weight
> ratio.
> 2. place_entity() then mistakenly applies the (W + w_i)/W scaling again,
> treating the reweight as a fresh enqueue into a new total weight
> pool.
>
> This can cause the entity's vlag to be amplified (if positive) or
> suppressed (if negative) incorrectly during the reweight process.
>
> In environments with frequent cgroup throttle/unthrottle operations,
> this math error manifests as a vruntime drift.
>
> A hungtask was observed as below:
> crash> runq -c 0 -g
> CPU 0
> CURRENT: PID: 330440 TASK: ffff00004cd61540 COMMAND: "stress-ng"
> ROOT_TASK_GROUP: ffff8001025fa4c0 RT_RQ: ffff0000fff42500
> [no tasks queued]
> ROOT_TASK_GROUP: ffff8001025fa4c0 CFS_RQ: ffff0000fff422c0
> TASK_GROUP: ffff0000c130fc00 CFS_RQ: ffff00009125a400 <test_cg> cfs_bandwidth: period=100000000, quota=18446744073709551615, gse: 0xffff000091258c00, vruntime=127285708384434, deadline=127285714880550, vlag=11721467, weight=338965, my_q=ffff00009125a400, cfs_rq: avg_vruntime=0, zero_vruntime=2029704519792, avg_load=0, nr_running=1
> TASK_GROUP: ffff0000d7cc8800 CFS_RQ: ffff0000c8f86800 <test_test329274_1> cfs_bandwidth: period=14000000, quota=14000000, gse: 0xffff0000c8f86400, vruntime=2034894470719, deadline=2034898697770, vlag=0, weight=215291, my_q=ffff0000c8f86800, cfs_rq: avg_vruntime=-422528991, zero_vruntime=8444226681954, avg_load=54, nr_running=19
> [110] PID: 330440 TASK: ffff00004cd61540 COMMAND: "stress-ng" [CURRENT] vruntime=8444367524951, deadline=8444932411139, vlag=8444932411139, weight=3072, last_arrival=4002964107010, last_queued=0, exec_start=3872860294100, sum_exec_runtime=22252021900
> ...
> [110] PID: 330291 TASK: ffff0000c02c9540 COMMAND: "stress-ng" vruntime=8444229273009, deadline=8444946073008, vlag=-2701415, weight=3072, last_arrival=4002964076840, last_queued=4002964550990, exec_start=3872859839290, sum_exec_runtime=22310951770
> [100] PID: 97 TASK: ffff0000c2432a00 COMMAND: "kworker/0:1H" vruntime=127285720095197, deadline=127285720119423, vlag=48453, weight=90891264, last_arrival=3846600432710, last_queued=3846600721010, exec_start=3743307237970, sum_exec_runtime=413405210
> [120] PID: 15 TASK: ffff0000c0368080 COMMAND: "ksoftirqd/0" vruntime=127285722433404, deadline=127285724533404, vlag=0, weight=1048576, last_arrival=3506755665780, last_queued=3506852159390, exec_start=3461615726670, sum_exec_runtime=16341041340
> [120] PID: 50173 TASK: ffff0000741d8080 COMMAND: "kworker/0:0" vruntime=127285722960040, deadline=127285725060040, vlag=-414755, weight=1048576, last_arrival=3506828139580, last_queued=3506972354700, exec_start=3461676584440, sum_exec_runtime=84414080
> [120] PID: 58662 TASK: ffff000091180080 COMMAND: "kworker/0:2" vruntime=127285723428168, deadline=127285725528168, vlag=3049158, weight=1048576, last_arrival=3505689085070, last_queued=3506848131990, exec_start=3460592328510, sum_exec_runtime=89193000
>
> TASK 1 (systemd) is waiting for cgroup_mutex.
> TASK 329296 (sh) holds cgroup_mutex and is waiting for cpus_read_lock.
> TASK 50173 (kworker/0:0) holds the cpus_read_lock, but fail to be
> scheduled.
> test_cg and TASK 97 may have suppressed TASK 50173, causing
> it to not be scheduled for a long time, thus failing to release locks in
> a timely manner and ultimately causing a hungtask issue.
>
> Fix by adding ENQUEUE_REWEIGHT_CURR flag and skipping vlag recalculation
> in place_entity() when reweighting the current running entity. For
> non-current entities, the existing logic remains as dequeue/enqueue
> changes avg_vruntime().
>
> Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag")
> Signed-off-by: Zicheng Qu <quzicheng@...wei.com>
> ---
> kernel/sched/fair.c | 11 ++++++++++-
> kernel/sched/sched.h | 1 +
> 2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da46c3164537..3be42729049e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3787,7 +3787,7 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>
> enqueue_load_avg(cfs_rq, se);
> if (se->on_rq) {
> - place_entity(cfs_rq, se, 0);
> + place_entity(cfs_rq, se, curr ? ENQUEUE_REWEIGHT_CURR : 0);
> update_load_add(&cfs_rq->load, se->load.weight);
> if (!curr)
> __enqueue_entity(cfs_rq, se);
> @@ -5123,6 +5123,14 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>
> lag = se->vlag;
>
> + /*
> + * ENQUEUE_REWEIGHT_CURR:
> + * current running se (cfs_rq->curr) should skip vlag recalculation,
> + * because avg_vruntime(...) hasn't changed.
> + */
> + if (flags & ENQUEUE_REWEIGHT_CURR)
> + goto skip_lag_scale;
> +
> /*
> * If we want to place a task and preserve lag, we have to
> * consider the effect of the new entity on the weighted
> @@ -5185,6 +5193,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> lag = div_s64(lag, load);
> }
>
> +skip_lag_scale:
> se->vruntime = vruntime - lag;
>
> if (se->rel_deadline) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index d30cca6870f5..e3a43f94dd2f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2412,6 +2412,7 @@ extern const u32 sched_prio_to_wmult[40];
> #define ENQUEUE_MIGRATED 0x00040000
> #define ENQUEUE_INITIAL 0x00080000
> #define ENQUEUE_RQ_SELECTED 0x00100000
> +#define ENQUEUE_REWEIGHT_CURR 0x00200000
>
> #define RETRY_TASK ((void *)-1UL)
>
Powered by blists - more mailing lists