linux-kernel - Re: [PATCH v1] drm/sched: fix deadlock in drm_sched_entity_kill_jobs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <c51ea5a408ca6d404074be1df219077457ea76f6.camel@mailbox.org>
Date: Thu, 30 Oct 2025 13:26:10 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Pierre-Eric Pelloux-Prayer <pierre-eric@...sy.net>, phasta@...nel.org, 
 Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@....com>, Matthew
 Brost <matthew.brost@...el.com>, Danilo Krummrich <dakr@...nel.org>,
 Christian König <ckoenig.leichtzumerken@...il.com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, Maxime Ripard
 <mripard@...nel.org>,  Thomas Zimmermann <tzimmermann@...e.de>, David
 Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>, Sumit Semwal
 <sumit.semwal@...aro.org>
Cc: Christian König <christian.koenig@....com>, 
	dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org, 
	linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org
Subject: Re: [PATCH v1] drm/sched: fix deadlock in
 drm_sched_entity_kill_jobs_cb

On Thu, 2025-10-30 at 13:06 +0100, Pierre-Eric Pelloux-Prayer wrote:
> 
> 
> Le 30/10/2025 à 12:17, Philipp Stanner a écrit :
> > On Wed, 2025-10-29 at 10:11 +0100, Pierre-Eric Pelloux-Prayer wrote:
> > > https://gitlab.freedesktop.org/mesa/mesa/-/issues/13908 pointed out
> > 
> > This link should be moved to the tag section at the bottom at a Closes:
> > tag. Optionally a Reported-by:, too.
> 
> The bug report is about a different issue. The potential deadlock being fixed by 
> this patch was discovered while investigating it.
> I'll add a Reported-by tag though.
> 
> > 
> > > a possible deadlock:
> > > 
> > > [ 1231.611031]  Possible interrupt unsafe locking scenario:
> > > 
> > > [ 1231.611033]        CPU0                    CPU1
> > > [ 1231.611034]        ----                    ----
> > > [ 1231.611035]   lock(&xa->xa_lock#17);
> > > [ 1231.611038]                                local_irq_disable();
> > > [ 1231.611039]                                lock(&fence->lock);
> > > [ 1231.611041]                                lock(&xa->xa_lock#17);
> > > [ 1231.611044]   <Interrupt>
> > > [ 1231.611045]     lock(&fence->lock);
> > > [ 1231.611047]
> > >                  *** DEADLOCK ***
> > > 
> > 
> > The commit message is lacking an explanation as to _how_ and _when_ the
> > deadlock comes to be. That's a prerequisite for understanding why the
> > below is the proper fix and solution.
> 
> I copy-pasted a small chunk of the full deadlock analysis report included in the 
> ticket because it's 300+ lines long. Copying the full log isn't useful IMO, but 
> I can add more context.

The log wouldn't be useful, but a human-generated explanation as you
detail it below.

> 
> The problem is that a thread (CPU0 above) can lock the job's dependencies 
> xa_array without disabling the interrupts.

Which drm_sched function would that be?

> If a fence signals while CPU0 holds this lock and drm_sched_entity_kill_jobs_cb 
> is called, it will try to grab the xa_array lock which is not possible because 
> CPU0 holds it already.

You mean an *interrupt* signals the fence? Shouldn't interrupt issues
be solved with spin_lock_irqdisable() – but we can't have that because
it's the xarray doing that internally?

You don't have to explain that in this mail-thread, a v2 detailing that
would be suficient.

> 
> 
> > 
> > The issue seems to be that you cannot perform certain tasks from within
> > that work item?

[…]

> > 
> > > +static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
> > > +					  struct dma_fence_cb *cb);
> > > +
> > >   static void drm_sched_entity_kill_jobs_work(struct work_struct *wrk)
> > >   {
> > >   	struct drm_sched_job *job = container_of(wrk, typeof(*job), work);
> > > -
> > > -	drm_sched_fence_scheduled(job->s_fence, NULL);
> > > -	drm_sched_fence_finished(job->s_fence, -ESRCH);
> > > -	WARN_ON(job->s_fence->parent);
> > > -	job->sched->ops->free_job(job);
> > 
> > Can free_job() really not be called from within work item context?
> 
> It's still called from drm_sched_entity_kill_jobs_work but the diff is slightly 
> confusing.

OK, probably my bad. But just asking, do you use
git format-patch --histogram
?

histogram often produces better diffs than default.


P.