[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260106075239.279072-1-kprateek.nayak@amd.com>
Date: Tue, 6 Jan 2026 07:52:39 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
<vincent.guittot@...aro.org>, Sebastian Andrzej Siewior
<bigeasy@...utronix.de>, Clark Williams <clrkwllms@...nel.org>,
<linux-kernel@...r.kernel.org>, <linux-rt-devel@...ts.linux.dev>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, Tejun Heo
<tj@...nel.org>, "Gautham R. Shenoy" <gautham.shenoy@....com>, "K Prateek
Nayak" <kprateek.nayak@....com>
Subject: [RFC PATCH] sched/core: Stash task priority after dequeue and put_prev_task() in sched_change_begin()
When running amd-pstate driver on a PREEMPT_RT kernel on a shared memory
system (Zen3 and prior), the following splat was observed from
triggering the WARN_ON_ONCE() in rq_pin_lock():
------------[ cut here ]------------
WARNING: kernel/sched/sched.h:1807 at __schedule+0x122/0x17c0, CPU#8: swapper/0/1
Modules linked in:
CPU: 8 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.19.0-rc1-rt-amd-pstate+ #153 PREEMPT_{RT,(full)}
Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
RIP: 0010:__schedule+0x122/0x17c0
Code: 3...
RSP: 0018:ffffd2f8800e7a50 EFLAGS: 00010082
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000005
RDX: ffff89f2fd41d1e0 RSI: 0000000000000000 RDI: ffff89f2fd432480
RBP: ffffd2f8800e7af8 R08: 0000000000000643 R09: 000000037328de2f
R10: 0000000373168f59 R11: 000000037328de2f R12: 0000000000000001
R13: ffff89f2fd432480 R14: 0000000000000008 R15: ffff89b4d9072810
FS: 0000000000000000(0000) GS:ffff89f34fee5000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000807dc4a001 CR4: 0000000000f70ef0
PKRU: 55555554
Call Trace:
<TASK>
? srso_alias_return_thunk+0x5/0xfbef5
? psi_group_change+0x1ff/0x460
? srso_alias_return_thunk+0x5/0xfbef5
preempt_schedule+0x41/0x60
preempt_schedule_thunk+0x16/0x30
try_to_wake_up+0x341/0x7c0
autoremove_wake_function+0x12/0x40
__wake_up_common+0x78/0xa0
__wake_up+0x31/0x50
send_pcc_cmd+0x133/0x310
cppc_set_reg_val+0x10e/0x220
? srso_alias_return_thunk+0x5/0xfbef5
? amd_pstate_init_boost_support+0x33/0xb0
amd_pstate_cpu_init+0x159/0x270
? srso_alias_return_thunk+0x5/0xfbef5
cpufreq_online+0x6b0/0xd90
? rtlock_slowlock_locked+0xce1/0xd30
cpufreq_add_dev+0xa9/0xd0
subsys_interface_register+0x10b/0x120
? srso_alias_return_thunk+0x5/0xfbef5
? __pfx_amd_pstate_init+0x10/0x10
cpufreq_register_driver+0x1a7/0x370
amd_pstate_register_driver.part.0+0x2a/0xa0
amd_pstate_init+0xe3/0x3a0
? __pfx_amd_pstate_init+0x10/0x10
do_one_initcall+0x47/0x310
kernel_init_freeable+0x33c/0x500
? __pfx_kernel_init+0x10/0x10
kernel_init+0x1b/0x1f0
? __pfx_kernel_init+0x10/0x10
ret_from_fork+0x222/0x280
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
---[ end trace 0000000000000000 ]---
Inspecting the set of events that led to the warning being triggered
showed the following:
systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!
systemd-1 [008] dN.31 ...: sched_change_begin: Begin!
systemd-1 [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
systemd-1 [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
systemd-1 [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
systemd-1 [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
systemd-1 [008] dN.31 ...: sched_change_begin: Before put_prev_task()!
systemd-1 [008] dN.31 ...: sched_change_end: Before enqueue_task()!
systemd-1 [008] dN.31 ...: sched_change_end: Before put_prev_task()!
systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
systemd-1 [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
systemd-1 [008] dN.31 ...: sched_change_end: End!
systemd-1 [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
systemd-1 [008] dN.21 ...: __schedule: Woops! Balance callback found!
1. sched_change_begin() from guard(sched_change) in
do_set_cpus_allowed() stashes the priority, which for the deadline
task, is "p->dl.deadline".
2. The dequeue of the deadline task replenishes the deadline.
3. The task is enqueued back after guard's scope ends and since there is
no *_CLASS flags set, sched_change_end() calls
dl_sched_class->prio_changed() which compares the deadline.
4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
differ from the stashed value and queues a balance pull callback.
5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
do_balance_callbacks().
6. Grabbing the rq_lock() at subsequent __schedule() triggers the
warning since the balance pull callback was never executed before
dropping the lock.
Since the dequeue on a deadline task can push its deadline, stash the
task prio towards the end of sched_change_begin().
The modification to priority within the sched_change guard's scope will
still be considered as sched_change_end() will supply the priority
stashed at the end of constructor's execution as the old priority to
sched_class->prio_changed().
Fixes: 6455ad5346c9c ("sched: Move sched_class::prio_changed() into the change pattern")
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
Since I'm not too familiar with the deadline bits, I've marked this as
RFC for now. If you require any data from my setup, please do let me
know.
Patches are based on:
git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
at commit 6ab7973f2540 ("sched/fair: Fix sched_avg fold").
To run with amd-pstate on PREEMPT_RT, you'll first need the patches from
https://lore.kernel.org/lkml/20260106073608.278644-1-kprateek.nayak@amd.com/
Most of the testing was done on top of Rafael's tree (v6.19.0-rc4 based)
with the above series where the issue was first seen.
---
kernel/sched/core.c | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b17d8e3cb55..ce05957e8055 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10791,20 +10791,19 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int
.running = task_current_donor(rq, p),
};
- if (!(flags & DEQUEUE_CLASS)) {
- if (p->sched_class->get_prio)
- ctx->prio = p->sched_class->get_prio(rq, p);
- else
- ctx->prio = p->prio;
- }
-
if (ctx->queued)
dequeue_task(rq, p, flags);
if (ctx->running)
put_prev_task(rq, p);
- if ((flags & DEQUEUE_CLASS) && p->sched_class->switched_from)
+ if (!(flags & DEQUEUE_CLASS)) {
+ if (p->sched_class->get_prio)
+ ctx->prio = p->sched_class->get_prio(rq, p);
+ else
+ ctx->prio = p->prio;
+ } else if (p->sched_class->switched_from) {
p->sched_class->switched_from(rq, p);
+ }
return ctx;
}
base-commit: 6ab7973f254071faf20fe5fcc502a3fe9ca14a47
--
2.34.1
Powered by blists - more mailing lists