[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <411190d4-92d7-4e95-acac-b39afa438c0f@ursulin.net>
Date: Fri, 31 Oct 2025 11:50:15 +0000
From: Tvrtko Ursulin <tursulin@...ulin.net>
To: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@....com>,
Matthew Brost <matthew.brost@...el.com>, Danilo Krummrich <dakr@...nel.org>,
Philipp Stanner <phasta@...nel.org>,
Christian König <ckoenig.leichtzumerken@...il.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
Sumit Semwal <sumit.semwal@...aro.org>
Cc: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>,
Christian König <christian.koenig@....com>,
dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org
Subject: Re: [PATCH v2] drm/sched: Fix deadlock in
drm_sched_entity_kill_jobs_cb
On 31/10/2025 09:07, Pierre-Eric Pelloux-Prayer wrote:
> The Mesa issue referenced below pointed out a possible deadlock:
>
> [ 1231.611031] Possible interrupt unsafe locking scenario:
>
> [ 1231.611033] CPU0 CPU1
> [ 1231.611034] ---- ----
> [ 1231.611035] lock(&xa->xa_lock#17);
> [ 1231.611038] local_irq_disable();
> [ 1231.611039] lock(&fence->lock);
> [ 1231.611041] lock(&xa->xa_lock#17);
> [ 1231.611044] <Interrupt>
> [ 1231.611045] lock(&fence->lock);
> [ 1231.611047]
> *** DEADLOCK ***
>
> In this example, CPU0 would be any function accessing job->dependencies
> through the xa_* functions that doesn't disable interrupts (eg:
> drm_sched_job_add_dependency, drm_sched_entity_kill_jobs_cb).
>
> CPU1 is executing drm_sched_entity_kill_jobs_cb as a fence signalling
> callback so in an interrupt context. It will deadlock when trying to
> grab the xa_lock which is already held by CPU0.
>
> Replacing all xa_* usage by their xa_*_irq counterparts would fix
> this issue, but Christian pointed out another issue: dma_fence_signal
> takes fence.lock and so does dma_fence_add_callback.
>
> dma_fence_signal() // locks f1.lock
> -> drm_sched_entity_kill_jobs_cb()
> -> foreach dependencies
> -> dma_fence_add_callback() // locks f2.lock
>
> This will deadlock if f1 and f2 share the same spinlock.
Is it possible to hit this case?
Same lock means same execution timeline, which should mean dependency
should have been squashed in drm_sched_job_add_dependency(), no?
Or would sharing the lock but not sharing the entity->fence_context be
considered legal? It would be surprising at least.
Also, would anyone have time to add a kunit test? ;)
Regards,
Tvrtko
> To fix both issues, the code iterating on dependencies and re-arming them
> is moved out to drm_sched_entity_kill_jobs_work.
>
> Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@...il.com>
> Suggested-by: Christian König <christian.koenig@....com>
> Reviewed-by: Christian König <christian.koenig@....com>
> Signed-off-by: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@....com>
> ---
> drivers/gpu/drm/scheduler/sched_entity.c | 34 +++++++++++++-----------
> 1 file changed, 19 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index c8e949f4a568..fe174a4857be 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -173,26 +173,15 @@ int drm_sched_entity_error(struct drm_sched_entity *entity)
> }
> EXPORT_SYMBOL(drm_sched_entity_error);
>
> +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> + struct dma_fence_cb *cb);
> +
> static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> {
> struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
> -
> - drm_sched_fence_scheduled(job->s_fence, NULL);
> - drm_sched_fence_finished(job->s_fence, -ESRCH);
> - WARN_ON(job->s_fence->parent);
> - job->sched->ops->free_job(job);
> -}
> -
> -/* Signal the scheduler finished fence when the entity in question is killed. */
> -static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> - struct dma_fence_cb *cb)
> -{
> - struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
> - finish_cb);
> + struct dma_fence *f;
> unsigned long index;
>
> - dma_fence_put(f);
> -
> /* Wait for all dependencies to avoid data corruptions */
> xa_for_each(&job->dependencies, index, f) {
> struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
> @@ -220,6 +209,21 @@ static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> dma_fence_put(f);
> }
>
> + drm_sched_fence_scheduled(job->s_fence, NULL);
> + drm_sched_fence_finished(job->s_fence, -ESRCH);
> + WARN_ON(job->s_fence->parent);
> + job->sched->ops->free_job(job);
> +}
> +
> +/* Signal the scheduler finished fence when the entity in question is killed. */
> +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> + struct dma_fence_cb *cb)
> +{
> + struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
> + finish_cb);
> +
> + dma_fence_put(f);
> +
> INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
> schedule_work(&job->work);
> }
Powered by blists - more mailing lists