lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260106075239.279072-1-kprateek.nayak@amd.com>
Date: Tue, 6 Jan 2026 07:52:39 +0000
From: K Prateek Nayak <kprateek.nayak@....com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Sebastian Andrzej Siewior
	<bigeasy@...utronix.de>, Clark Williams <clrkwllms@...nel.org>,
	<linux-kernel@...r.kernel.org>, <linux-rt-devel@...ts.linux.dev>
CC: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, Tejun Heo
	<tj@...nel.org>, "Gautham R. Shenoy" <gautham.shenoy@....com>, "K Prateek
 Nayak" <kprateek.nayak@....com>
Subject: [RFC PATCH] sched/core: Stash task priority after dequeue and put_prev_task() in sched_change_begin()

When running amd-pstate driver on a PREEMPT_RT kernel on a shared memory
system (Zen3 and prior), the following splat was observed from
triggering the WARN_ON_ONCE() in rq_pin_lock():

    ------------[ cut here ]------------
    WARNING: kernel/sched/sched.h:1807 at __schedule+0x122/0x17c0, CPU#8: swapper/0/1
    Modules linked in:
    CPU: 8 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.19.0-rc1-rt-amd-pstate+ #153 PREEMPT_{RT,(full)}
    Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
    RIP: 0010:__schedule+0x122/0x17c0
    Code: 3...
    RSP: 0018:ffffd2f8800e7a50 EFLAGS: 00010082
    RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000005
    RDX: ffff89f2fd41d1e0 RSI: 0000000000000000 RDI: ffff89f2fd432480
    RBP: ffffd2f8800e7af8 R08: 0000000000000643 R09: 000000037328de2f
    R10: 0000000373168f59 R11: 000000037328de2f R12: 0000000000000001
    R13: ffff89f2fd432480 R14: 0000000000000008 R15: ffff89b4d9072810
    FS:  0000000000000000(0000) GS:ffff89f34fee5000(0000) knlGS:0000000000000000
    CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 000000807dc4a001 CR4: 0000000000f70ef0
    PKRU: 55555554
    Call Trace:
     <TASK>
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? psi_group_change+0x1ff/0x460
     ? srso_alias_return_thunk+0x5/0xfbef5
     preempt_schedule+0x41/0x60
     preempt_schedule_thunk+0x16/0x30
     try_to_wake_up+0x341/0x7c0
     autoremove_wake_function+0x12/0x40
     __wake_up_common+0x78/0xa0
     __wake_up+0x31/0x50
     send_pcc_cmd+0x133/0x310
     cppc_set_reg_val+0x10e/0x220
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? amd_pstate_init_boost_support+0x33/0xb0
     amd_pstate_cpu_init+0x159/0x270
     ? srso_alias_return_thunk+0x5/0xfbef5
     cpufreq_online+0x6b0/0xd90
     ? rtlock_slowlock_locked+0xce1/0xd30
     cpufreq_add_dev+0xa9/0xd0
     subsys_interface_register+0x10b/0x120
     ? srso_alias_return_thunk+0x5/0xfbef5
     ? __pfx_amd_pstate_init+0x10/0x10
     cpufreq_register_driver+0x1a7/0x370
     amd_pstate_register_driver.part.0+0x2a/0xa0
     amd_pstate_init+0xe3/0x3a0
     ? __pfx_amd_pstate_init+0x10/0x10
     do_one_initcall+0x47/0x310
     kernel_init_freeable+0x33c/0x500
     ? __pfx_kernel_init+0x10/0x10
     kernel_init+0x1b/0x1f0
     ? __pfx_kernel_init+0x10/0x10
     ret_from_fork+0x222/0x280
     ? __pfx_kernel_init+0x10/0x10
     ret_from_fork_asm+0x1a/0x30
     </TASK>
    ---[ end trace 0000000000000000 ]---

Inspecting the set of events that led to the warning being triggered
showed the following:

    systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed begin!

    systemd-1  [008] dN.31 ...: sched_change_begin: Begin!
    systemd-1  [008] dN.31 ...: sched_change_begin: Before dequeue_task()!
    systemd-1  [008] dN.31 ...: update_curr_dl_se: update_curr_dl_se: ENQUEUE_REPLENISH
    systemd-1  [008] dN.31 ...: enqueue_dl_entity: enqueue_dl_entity: ENQUEUE_REPLENISH
    systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish before: 14815760217
    systemd-1  [008] dN.31 ...: replenish_dl_entity: Replenish after: 14816960047
    systemd-1  [008] dN.31 ...: sched_change_begin: Before put_prev_task()!

    systemd-1  [008] dN.31 ...: sched_change_end: Before enqueue_task()!
    systemd-1  [008] dN.31 ...: sched_change_end: Before put_prev_task()!
    systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing pull task on prio change: 14815760217 -> 14816960047
    systemd-1  [008] dN.31 ...: prio_changed_dl: Queuing balance callback!
    systemd-1  [008] dN.31 ...: sched_change_end: End!

    systemd-1  [008] dN.31 ...: do_set_cpus_allowed: set_cpus_allowed end!
    systemd-1  [008] dN.21 ...: __schedule: Woops! Balance callback found!

1. sched_change_begin() from guard(sched_change) in
   do_set_cpus_allowed() stashes the priority, which for the deadline
   task, is "p->dl.deadline".
2. The dequeue of the deadline task replenishes the deadline.
3. The task is enqueued back after guard's scope ends and since there is
   no *_CLASS flags set, sched_change_end() calls
   dl_sched_class->prio_changed() which compares the deadline.
4. Since deadline was moved on dequeue, prio_changed_dl() sees the value
   differ from the stashed value and queues a balance pull callback.
5. do_set_cpus_allowed() finishes and drops the rq_lock without doing a
   do_balance_callbacks().
6. Grabbing the rq_lock() at subsequent __schedule() triggers the
   warning since the balance pull callback was never executed before
   dropping the lock.

Since the dequeue on a deadline task can push its deadline, stash the
task prio towards the end of sched_change_begin().

The modification to priority within the sched_change guard's scope will
still be considered as sched_change_end() will supply the priority
stashed at the end of constructor's execution as the old priority to
sched_class->prio_changed().

Fixes: 6455ad5346c9c ("sched: Move sched_class::prio_changed() into the change pattern")
Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
---
Since I'm not too familiar with the deadline bits, I've marked this as
RFC for now. If you require any data from my setup, please do let me
know.

Patches are based on:

  git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

at commit 6ab7973f2540 ("sched/fair: Fix sched_avg fold").

To run with amd-pstate on PREEMPT_RT, you'll first need the patches from
https://lore.kernel.org/lkml/20260106073608.278644-1-kprateek.nayak@amd.com/
Most of the testing was done on top of Rafael's tree (v6.19.0-rc4 based)
with the above series where the issue was first seen.
---
 kernel/sched/core.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b17d8e3cb55..ce05957e8055 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -10791,20 +10791,19 @@ struct sched_change_ctx *sched_change_begin(struct task_struct *p, unsigned int
 		.running = task_current_donor(rq, p),
 	};
 
-	if (!(flags & DEQUEUE_CLASS)) {
-		if (p->sched_class->get_prio)
-			ctx->prio = p->sched_class->get_prio(rq, p);
-		else
-			ctx->prio = p->prio;
-	}
-
 	if (ctx->queued)
 		dequeue_task(rq, p, flags);
 	if (ctx->running)
 		put_prev_task(rq, p);
 
-	if ((flags & DEQUEUE_CLASS) && p->sched_class->switched_from)
+	if (!(flags & DEQUEUE_CLASS)) {
+		if (p->sched_class->get_prio)
+			ctx->prio = p->sched_class->get_prio(rq, p);
+		else
+			ctx->prio = p->prio;
+	} else if (p->sched_class->switched_from) {
 		p->sched_class->switched_from(rq, p);
+	}
 
 	return ctx;
 }

base-commit: 6ab7973f254071faf20fe5fcc502a3fe9ca14a47
-- 
2.34.1


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ