linux-kernel - Re: [PATCH v2] drm/sched: Clarify scenarios for separate workqueues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aFGFCc7eiZJM8RKM@pollux>
Date: Tue, 17 Jun 2025 17:08:57 +0200
From: Danilo Krummrich <dakr@...nel.org>
To: Philipp Stanner <phasta@...nel.org>,
	Matthew Brost <matthew.brost@...el.com>,
	Christian König <ckoenig.leichtzumerken@...il.com>,
	David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
	Sumit Semwal <sumit.semwal@...aro.org>,
	dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
	linux-media@...r.kernel.org
Subject: Re: [PATCH v2] drm/sched: Clarify scenarios for separate workqueues

On Tue, Jun 17, 2025 at 04:25:09PM +0200, Simona Vetter wrote:
> On Tue, Jun 17, 2025 at 04:10:40PM +0200, Danilo Krummrich wrote:
> > On Tue, Jun 17, 2025 at 03:51:33PM +0200, Simona Vetter wrote:
> > > On Thu, Jun 12, 2025 at 04:49:54PM +0200, Philipp Stanner wrote:
> > > > + * NOTE that sharing &struct drm_sched_init_args.submit_wq with the driver
> > > > + * theoretically can deadlock. It must be guaranteed that submit_wq never has
> > > > + * more than max_active - 1 active tasks, or if max_active tasks are reached at
> > > > + * least one of them does not execute operations that may block on dma_fences
> > > > + * that potentially make progress through this scheduler instance. Otherwise,
> > > > + * it is possible that all max_active tasks end up waiting on a dma_fence (that
> > > > + * can only make progress through this schduler instance), while the
> > > > + * scheduler's queued work waits for at least one of the max_active tasks to
> > > > + * finish. Thus, this can result in a deadlock.
> > > 
> > > Uh if you have an ordered wq you deadlock with just one misuse. I'd just
> > > explain that the wq must provide sufficient forward-progress guarantees
> > > for the scheduler, specifically that it's on the dma_fence signalling
> > > critical path and leave the concrete examples for people to figure out
> > > when the design a specific locking scheme.
> > 
> > This isn't a concrete example, is it? It's exactly what you say in slightly
> > different words, with the addition of highlighting the impact of the workqueue's
> > max_active configuration.
> > 
> > I think that's relevant, because N - 1 active tasks can be on the dma_fence
> > signalling critical path without issues.
> > 
> > We could change
> > 
> > 	"if max_active tasks are reached at least one of them must not execute
> > 	 operations that may block on dma_fences that potentially make progress
> > 	 through this scheduler instance"
> > 
> > to 
> > 
> > 	"if max_active tasks are reached at least one of them must not be on the
> > 	 dma_fence signalling critical path"
> > 
> > which is a bit more to the point I think.
> 
> My point was to more state that the wq must be suitable for the scheduler
> jobs as the general issue, and specifically then also highlight the
> dma_fence concurrency issue.

Sure, there are more guarantees the driver has to uphold, but this is one of
them, so I think we should explain it.

> But it's not the only one, you can have driver locks and other fun involved
> here too.

Yeah, but it boils down to the same issue, e.g. if a driver takes a lock in
active work, and this lock is taken elsewhere for activities that violate the
dma_fence signalling critical path.

All the cases I have in mind boil down to that we potentially, either directly
or indirectly (through some synchronization primitive), wait for things we are
not allowed to wait for in the dma_fence signalling critical path.

Or do you mean something different?

> Also since all the paragraphs above talk about ordered wq as the example
> where specifying your own wq makes sense, it's a bit confusing to now
> suddenly only talk about the concurrent wq case without again mentioned
> that the ordered wq case is really limited.

I mean, it talks about both cases in a generic way, i.e. if you set
max_active == 1 in the text it covers the ordered case.

Or do you mean to say that we should *only* allow ordered workqueues to be
shared with the driver?