[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20250731093008.45267-2-phasta@kernel.org>
Date: Thu, 31 Jul 2025 11:30:09 +0200
From: Philipp Stanner <phasta@...nel.org>
To: Matthew Brost <matthew.brost@...el.com>,
Danilo Krummrich <dakr@...nel.org>,
Philipp Stanner <phasta@...nel.org>,
Christian König <ckoenig.leichtzumerken@...il.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>,
David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>
Cc: dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org,
James Flowers <bold.zone2373@...tmail.com>
Subject: [PATCH] drm/sched: Document race condition in drm_sched_fini()
In drm_sched_fini() all entities are marked as stopped - without taking
the appropriate lock, because that would deadlock. That means that
drm_sched_fini() and drm_sched_entity_push_job() can race against each
other.
This should most likely be fixed by establishing the rule that all
entities associated with a scheduler must be torn down first. Then,
however, the locking should be removed from drm_sched_fini() alltogether
with an appropriate comment.
Reported-by: James Flowers <bold.zone2373@...tmail.com>
Link: https://lore.kernel.org/dri-devel/20250720235748.2798-1-bold.zone2373@fastmail.com/
Signed-off-by: Philipp Stanner <phasta@...nel.org>
---
drivers/gpu/drm/scheduler/sched_main.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 5a550fd76bf0..738aefed727c 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1424,6 +1424,22 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
* Prevents reinsertion and marks job_queue as idle,
* it will be removed from the rq in drm_sched_entity_fini()
* eventually
+ *
+ * FIXME:
+ * This lacks the proper spin_lock(&s_entity->lock) and
+ * is, therefore, a race condition. Most notably, it
+ * can race with drm_sched_entity_push_job(). The lock
+ * cannot be taken here, however, because this would
+ * lead to lock inversion -> deadlock.
+ *
+ * The best solution probably is to enforce the life
+ * time rule of all entities having to be torn down
+ * before their scheduler. Then, however, locking could
+ * be dropped alltogether from this function.
+ *
+ * For now, this remains a potential race in all
+ * drivers that keep entities aliver for longer than
+ * the scheduler.
*/
s_entity->stopped = true;
spin_unlock(&rq->lock);
--
2.49.0
Powered by blists - more mailing lists