linux-kernel - [PATCH] drm/sched: Remove racy hack from drm_sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <20260108083019.63532-2-phasta@kernel.org>
Date: Thu,  8 Jan 2026 09:30:20 +0100
From: Philipp Stanner <phasta@...nel.org>
To: Matthew Brost <matthew.brost@...el.com>,
	Danilo Krummrich <dakr@...nel.org>,
	Philipp Stanner <phasta@...nel.org>,
	Christian König <ckoenig.leichtzumerken@...il.com>
Cc: dri-devel@...ts.freedesktop.org,
	linux-kernel@...r.kernel.org
Subject: [PATCH] drm/sched: Remove racy hack from drm_sched_fini()

drm_sched_fini() contained a hack to work around a race in amdgpu.
According to AMD, the hack should not be necessary anymore. In case
there should have been undetected users,

commit 975ca62a014c ("drm/sched: Add warning for removing hack in drm_sched_fini()")

had added a warning one release cycle ago.

Thus, it can be derived that the hack can be savely removed by now.

Remove the hack.

Signed-off-by: Philipp Stanner <phasta@...nel.org>
---
As hinted at in the commit, I want to cozyly queue this one up for the
next merge window, since we're printing that warning since last merge
window already.

If someone has concerns I'm also happy to delay this patch for a few
more releases.
---
 drivers/gpu/drm/scheduler/sched_main.c | 38 +-------------------------
 1 file changed, 1 insertion(+), 37 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 1d4f1b822e7b..381c1694a12e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1416,48 +1416,12 @@ static void drm_sched_cancel_remaining_jobs(struct drm_gpu_scheduler *sched)
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
-	struct drm_sched_entity *s_entity;
 	int i;
 
 	drm_sched_wqueue_stop(sched);
 
-	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
-		struct drm_sched_rq *rq = sched->sched_rq[i];
-
-		spin_lock(&rq->lock);
-		list_for_each_entry(s_entity, &rq->entities, list) {
-			/*
-			 * Prevents reinsertion and marks job_queue as idle,
-			 * it will be removed from the rq in drm_sched_entity_fini()
-			 * eventually
-			 *
-			 * FIXME:
-			 * This lacks the proper spin_lock(&s_entity->lock) and
-			 * is, therefore, a race condition. Most notably, it
-			 * can race with drm_sched_entity_push_job(). The lock
-			 * cannot be taken here, however, because this would
-			 * lead to lock inversion -> deadlock.
-			 *
-			 * The best solution probably is to enforce the life
-			 * time rule of all entities having to be torn down
-			 * before their scheduler. Then, however, locking could
-			 * be dropped alltogether from this function.
-			 *
-			 * For now, this remains a potential race in all
-			 * drivers that keep entities alive for longer than
-			 * the scheduler.
-			 *
-			 * The READ_ONCE() is there to make the lockless read
-			 * (warning about the lockless write below) slightly
-			 * less broken...
-			 */
-			if (!READ_ONCE(s_entity->stopped))
-				dev_warn(sched->dev, "Tearing down scheduler with active entities!\n");
-			s_entity->stopped = true;
-		}
-		spin_unlock(&rq->lock);
+	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
 		kfree(sched->sched_rq[i]);
-	}
 
 	/* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */
 	wake_up_all(&sched->job_scheduled);
-- 
2.49.0