linux-kernel - Re: [PATCH v2] drm/sched: Clarify scenarios for separate workqueues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFFy5aG1eOeMU44S@phenom.ffwll.local>
Date: Tue, 17 Jun 2025 15:51:33 +0200
From: Simona Vetter <simona.vetter@...ll.ch>
To: Philipp Stanner <phasta@...nel.org>
Cc: Matthew Brost <matthew.brost@...el.com>,
	Danilo Krummrich <dakr@...nel.org>,
	Christian König <ckoenig.leichtzumerken@...il.com>,
	David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
	Sumit Semwal <sumit.semwal@...aro.org>,
	dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
	linux-media@...r.kernel.org
Subject: Re: [PATCH v2] drm/sched: Clarify scenarios for separate workqueues

On Thu, Jun 12, 2025 at 04:49:54PM +0200, Philipp Stanner wrote:
> struct drm_sched_init_args provides the possibility of letting the
> scheduler use user-controlled workqueues, instead of the scheduler
> creating its own workqueues. It's currently not documented who would
> want to use that.
> 
> Not sharing the submit_wq between driver and scheduler has the advantage
> of no negative intereference between them being able to occur (e.g.,
> MMU notifier callbacks waiting for fences to get signaled).
> 
> Add a new docu section for concurrency in drm/sched.
> 
> Discourage the usage of own workqueues in the documentation. Document
> when using them can make sense. Warn about pitfalls.
> 
> Co-authored-by: Danilo Krummrich <dakr@...nel.org>
> Signed-off-by: Philipp Stanner <phasta@...nel.org>
> ---
> Changes in v2:
>   - Add new docu section for concurrency in the scheduler. (Sima)
>   - Document what an ordered workqueue passed to the scheduler can be
>     useful for. (Christian, Sima)
>   - Warn more detailed about potential deadlocks. (Danilo)
> ---
>  include/drm/gpu_scheduler.h | 54 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 51 insertions(+), 3 deletions(-)
> 
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 81dcbfc8c223..00c528e62485 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -21,6 +21,49 @@
>   *
>   */
>  
> +/**
> + * DOC: Concurrency in drm/sched

You need to explicitly pull this into the .rst files somewhere suitable,
otherwise it wont show up in the docs. With that please also check that
the hyperlinks all work.

> + *
> + * The DRM GPU Scheduler interacts with drivers through the callbacks defined in
> + * &struct drm_sched_backend_ops. These callbacks can be invoked by the scheduler
> + * at any point in time if not stated otherwise explicitly in the callback
> + * documentation.
> + *
> + * For most use cases, passing the recommended default parameters in &struct
> + * drm_sched_init_args is sufficient. There are some special circumstances,
> + * though:
> + *
> + * For timeout handling, it makes a lot of sense for drivers with HARDWARE
> + * scheduling to have the timeouts (e.g., for different hardware rings) occur
> + * sequentially, for example to allow for detecting which job timedout first
> + * and, therefore, caused the hang. Thereby, it is determined which &struct
> + * drm_sched_entity has to be killed and which entities' jobs must be
> + * resubmitted after a GPU reset.
> + *
> + * This can be achieved by passing an ordered workqueue in &struct
> + * drm_sched_init_args.timeout_wq. Also take a look at the documentation of
> + * &struct drm_sched_backend_ops.timedout_job.

So the main use-case for this is when the reset domain is bigger than a
single engine, and you don't want to deal with the mess of inventing your
own synchronization primitives.

I'm not exactly clear what hardware scheduling scenario you're thinking of
here that needs an ordered timeout queue that's shared across instances.

Note that an ordered timeout queue for a single scheduler is fairly
pointless, because we only ever have one pending timeout handler (at least
I thought that's how it works).

> + * Furthermore, for drivers with hardware that supports FIRMWARE scheduling,

I'm not sure you need a firmware scheduler to make this useful, just that
firmware scheduler submit rings tend to be one case where this is a good
locking design. I'd put this at the end as a concrete example instead and
explain why (you already have a single submit ringbuffer, parallelism in
the kernel doesn't serve a point aside from making locking more complex).

> + * the driver design can be simplified a bit by providing one ordered workqueue
> + * for both &struct drm_sched_init_args.timeout_wq and
> + * &struct drm_sched_init_args.submit_wq. Reason being that firmware schedulers
> + * should always have one scheduler instance per firmware runqueue and one
> + * entity per scheduler instance. If that scheduler instance then shares one
> + * ordered workqueue with the driver, locking can be very lightweight or
> + * dropped alltogether.
> + *
> + * NOTE that sharing &struct drm_sched_init_args.submit_wq with the driver
> + * theoretically can deadlock. It must be guaranteed that submit_wq never has
> + * more than max_active - 1 active tasks, or if max_active tasks are reached at
> + * least one of them does not execute operations that may block on dma_fences
> + * that potentially make progress through this scheduler instance. Otherwise,
> + * it is possible that all max_active tasks end up waiting on a dma_fence (that
> + * can only make progress through this schduler instance), while the
> + * scheduler's queued work waits for at least one of the max_active tasks to
> + * finish. Thus, this can result in a deadlock.

Uh if you have an ordered wq you deadlock with just one misuse. I'd just
explain that the wq must provide sufficient forward-progress guarantees
for the scheduler, specifically that it's on the dma_fence signalling
critical path and leave the concrete examples for people to figure out
when the design a specific locking scheme.

> + */
> +
>  #ifndef _DRM_GPU_SCHEDULER_H_
>  #define _DRM_GPU_SCHEDULER_H_
>  
> @@ -499,7 +542,7 @@ struct drm_sched_backend_ops {
>  	 * timeout handlers of different schedulers. One way to achieve this
>  	 * synchronization is to create an ordered workqueue (using
>  	 * alloc_ordered_workqueue()) at the driver level, and pass this queue
> -	 * as drm_sched_init()'s @timeout_wq parameter. This will guarantee
> +	 * in &struct drm_sched_init_args.timeout_wq. This will guarantee
>  	 * that timeout handlers are executed sequentially.
>  	 *
>  	 * Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
> @@ -590,14 +633,19 @@ struct drm_gpu_scheduler {
>   *
>   * @ops: backend operations provided by the driver
>   * @submit_wq: workqueue to use for submission. If NULL, an ordered wq is
> - *	       allocated and used.
> + *	       allocated and used. It is recommended to pass NULL unless there
> + *	       is a good reason not to. For details, see &DOC: Concurrency in
> + *	       drm/sched.

Does this really link?

>   * @num_rqs: Number of run-queues. This may be at most DRM_SCHED_PRIORITY_COUNT,
>   *	     as there's usually one run-queue per priority, but may be less.
>   * @credit_limit: the number of credits this scheduler can hold from all jobs
>   * @hang_limit: number of times to allow a job to hang before dropping it.
>   *		This mechanism is DEPRECATED. Set it to 0.
>   * @timeout: timeout value in jiffies for submitted jobs.
> - * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is used.
> + * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> + *		used. An ordered workqueue could be passed to achieve timeout
> + *		synchronization. See &DOC: Concurreny in drm/sched and &struct

Same question about linking ...

Might just put a note in the text section below about concurrency design
questions instead of trying to squeeze these in here.

Cheers, Sima

> + *		drm_sched_backend_ops.timedout_job.
>   * @score: score atomic shared with other schedulers. May be NULL.
>   * @name: name (typically the driver's name). Used for debugging
>   * @dev: associated device. Used for debugging
> -- 
> 2.49.0
> 

-- 
Simona Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch