linux-kernel - Re: [PATCH] drm/sched: Extend and update documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <f064a8c305bd2f2c0684251d3cd2470699c28d5e.camel@redhat.com>
Date: Thu, 24 Jul 2025 17:07:11 +0200
From: Philipp Stanner <pstanner@...hat.com>
To: Philipp Stanner <phasta@...nel.org>, Maarten Lankhorst
 <maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>, 
 Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>,
 Simona Vetter <simona@...ll.ch>, Jonathan Corbet <corbet@....net>, Matthew
 Brost <matthew.brost@...el.com>, Danilo Krummrich <dakr@...nel.org>,
 Christian König <ckoenig.leichtzumerken@...il.com>,
 Sumit Semwal <sumit.semwal@...aro.org>
Cc: dri-devel@...ts.freedesktop.org, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, linux-media@...r.kernel.org, Christian
 König
	 <christian.koenig@....com>
Subject: Re: [PATCH] drm/sched: Extend and update documentation

Two comments from myself to open up room for discussion:

On Thu, 2025-07-24 at 16:01 +0200, Philipp Stanner wrote:
> From: Philipp Stanner <pstanner@...hat.com>
> 
> The various objects and their memory lifetime used by the GPU scheduler
> are currently not fully documented.
> 
> Add documentation describing the scheduler's objects. Improve the
> general documentation at a few other places.
> 
> Co-developed-by: Christian König <christian.koenig@....com>
> Signed-off-by: Christian König <christian.koenig@....com>
> Signed-off-by: Philipp Stanner <pstanner@...hat.com>
> ---
> The first draft for this docu was posted by Christian in late 2023 IIRC.
> 
> This is an updated version. Please review.
> 
> @Christian: As we agreed on months (a year?) ago I kept your Signed-off
> by. Just tell me if there's any issue or sth.
> ---
>  Documentation/gpu/drm-mm.rst           |  36 ++++
>  drivers/gpu/drm/scheduler/sched_main.c | 228 ++++++++++++++++++++++---
>  include/drm/gpu_scheduler.h            |   5 +-
>  3 files changed, 238 insertions(+), 31 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index d55751cad67c..95ee95fd987a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -556,12 +556,48 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>  
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler and Run Queue Objects
> +
>  Flow Control
>  ------------
>  
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Flow Control
>  
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Error and Timeout handling
> +
>  Scheduler Function References
>  -----------------------------
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 5a550fd76bf0..2e7bc1e74186 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,48 +24,220 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> + * The GPU scheduler is shared infrastructure intended to help drivers managing
> + * command submission to their hardware.
>   *
> - * The organisation of the scheduler is the following:
> + * To do so, it offers a set of scheduling facilities that interact with the
> + * driver through callbacks which the latter can register.
>   *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *    the hardware.
> + * In particular, the scheduler takes care of:
> + *   - Ordering command submissions
> + *   - Signalling dma_fences, e.g., for finished commands
> + *   - Taking dependencies between command submissions into account
> + *   - Handling timeouts for command submissions
>   *
> - * The jobs in an entity are always scheduled in the order in which they were pushed.
> + * All callbacks the driver needs to implement are restricted by dma_fence
> + * signaling rules to guarantee deadlock free forward progress. This especially
> + * means that for normal operation no memory can be allocated in a callback.
> + * All memory which is needed for pushing the job to the hardware must be
> + * allocated before arming a job. It also means that no locks can be taken
> + * under which memory might be allocated.
>   *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * Optional memory, for example for device core dumping or debugging, *must* be
> + * allocated with GFP_NOWAIT and appropriate error handling if that allocation
> + * fails. GFP_ATOMIC should only be used if absolutely necessary since dipping
> + * into the special atomic reserves is usually not justified for a GPU driver.
> + *
> + * Note especially the following about the scheduler's historic background that
> + * lead to sort of a double role it plays today:
> + *
> + * In classic setups N ("hardware scheduling") entities share one scheduler,
> + * and the scheduler decides which job to pick from which entity and move it to
> + * the hardware ring next (that is: "scheduling").
> + *
> + * Many (especially newer) GPUs, however, can have an almost arbitrary number
> + * of hardware rings and it's a firmware scheduler which actually decides which
> + * job will run next. In such setups, the GPU scheduler is still used (e.g., in
> + * Nouveau) but does not "schedule" jobs in the classical sense anymore. It
> + * merely serves to queue and dequeue jobs and resolve dependencies. In such a
> + * scenario, it is recommended to have one scheduler per entity.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object (&struct drm_sched_job) contains submission dependencies
> + * in the form of &struct dma_fence objects. Drivers can also implement an
> + * optional prepare_job callback which returns additional dependencies as
> + * dma_fence objects. It's important to note that this callback can't allocate
> + * memory or grab locks under which memory is allocated.
> + *
> + * Drivers should use this as base class for an object which contains the
> + * necessary state to push the command submission to the hardware.
> + *
> + * The lifetime of the job object needs to last at least from submitting it to
> + * the scheduler (through drm_sched_job_arm()) until the scheduler has invoked
> + * &struct drm_sched_backend_ops.free_job and, thereby, has indicated that it
> + * does not need the job anymore. Drivers can of course keep their job object
> + * alive for longer than that, but that's outside of the scope of the scheduler
> + * component.
> + *
> + * Job initialization is split into two stages:
> + *   1. drm_sched_job_init() which serves for basic preparation of a job.
> + *      Drivers don't have to be mindful of this function's consequences and
> + *      its effects can be reverted through drm_sched_job_cleanup().
> + *   2. drm_sched_job_arm() which irrevokably arms a job for execution. This
> + *      initializes the job's fences and the job has to be submitted with
> + *      drm_sched_entity_push_job(). Once drm_sched_job_arm() has been called,
> + *      the job structure has to be valid until the scheduler invoked
> + *      drm_sched_backend_ops.free_job().
> + *
> + * It's important to note that after arming a job drivers must follow the
> + * dma_fence rules and can't easily allocate memory or takes locks under which
> + * memory is allocated.
> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object (&struct drm_sched_entity) is a container for jobs which
> + * should execute sequentially. Drivers should create an entity for each
> + * individual context they maintain for command submissions which can run in
> + * parallel.
> + *
> + * The lifetime of the entity *should not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush()
> + * callback. It is possible that an entity object is not alive anymore
> + * while jobs previously fetched from it are still running on the hardware.
> + *
> + * This is done because all results of a command submission should become
> + * visible externally even after a process exits. This is normal POSIX
> + * behavior for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object
> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + *
> + * All entities associated with a scheduler have to be torn down before that
> + * scheduler.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a dma_fence provided by the driver through
> + * &struct drm_sched_backend_ops.run_job. The driver signals this fence once the
> + * hardware has completed the associated job.
> + *
> + * Drivers need to make sure that the normal dma_fence semantics are followed
> + * for this object. It's important to note that the memory for this object can
> + * *not* be allocated in &struct drm_sched_backend_ops.run_job since that would
> + * violate the requirements for the dma_fence implementation. The scheduler
> + * maintains a timeout handler which triggers if this fence doesn't signal
> + * within a configurable amount of time.
> + *
> + * The lifetime of this object follows dma_fence refcounting rules. The
> + * scheduler takes ownership of the reference returned by the driver and
> + * drops it when it's not needed any more.
> + *
> + * See &struct drm_sched_backend_ops.run_job for precise refcounting rules.
> + */
> +
> +/**
> + * DOC: Scheduler Fence Object
> + *
> + * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
> + * time from pushing the job into the scheduler until the hardware has finished
> + * processing it. It is managed by the scheduler. The implementation provides
> + * dma_fence interfaces for signaling both scheduling of a command submission
> + * as well as finishing of processing.
> + *
> + * The lifetime of this object also follows normal dma_fence refcounting rules.
> + */

The relict I'm most unsure about is this docu for the scheduler fence.
I know that some drivers are accessing the s_fence, but I strongly
suspect that this is a) unncessary and b) dangerous.

But the original draft from Christian hinted at that. So, @Christian,
this would be an opportunity to discuss this matter.

Otherwise I'd drop this docu section in v2. What users don't know, they
cannot misuse.

> +
> +/**
> + * DOC: Scheduler and Run Queue Objects
> + *
> + * The scheduler object itself (&struct drm_gpu_scheduler) does the actual
> + * scheduling: it picks the next entity to run a job from and pushes that job
> + * onto the hardware. Both FIFO and RR selection algorithms are supported, with
> + * FIFO being the default and the recommended one.
> + *
> + * The lifetime of the scheduler is managed by the driver using it. Before
> + * destroying the scheduler the driver must ensure that all hardware processing
> + * involving this scheduler object has finished by calling for example
> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> + * since this doesn't guarantee that all callback processing has finished.
> + *
> + * The run queue object (&struct drm_sched_rq) is a container for entities of a
> + * certain priority level. This object is internally managed by the scheduler
> + * and drivers must not touch it directly. The lifetime of a run queue is bound
> + * to the scheduler's lifetime.
> + *
> + * All entities associated with a scheduler must be torn down before it. Drivers
> + * should implement &struct drm_sched_backend_ops.cancel_job to avoid pending
> + * jobs (those that were pulled from an entity into the scheduler, but have not
> + * been completed by the hardware yet) from leaking.
>   */
>  
>  /**
>   * DOC: Flow Control
>   *
>   * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> - * in which the jobs fetched from scheduler entities are executed.
> + * at which jobs fetched from scheduler entities are executed.
>   *
> - * In this context the &drm_gpu_scheduler keeps track of a driver specified
> - * credit limit representing the capacity of this scheduler and a credit count;
> - * every &drm_sched_job carries a driver specified number of credits.
> + * In this context the &struct drm_gpu_scheduler keeps track of a driver
> + * specified credit limit representing the capacity of this scheduler and a
> + * credit count; every &struct drm_sched_job carries a driver-specified number
> + * of credits.
>   *
> - * Once a job is executed (but not yet finished), the job's credits contribute
> - * to the scheduler's credit count until the job is finished. If by executing
> - * one more job the scheduler's credit count would exceed the scheduler's
> - * credit limit, the job won't be executed. Instead, the scheduler will wait
> - * until the credit count has decreased enough to not overflow its credit limit.
> - * This implies waiting for previously executed jobs.
> + * Once a job is being executed, the job's credits contribute to the
> + * scheduler's credit count until the job is finished. If by executing one more
> + * job the scheduler's credit count would exceed the scheduler's credit limit,
> + * the job won't be executed. Instead, the scheduler will wait until the credit
> + * count has decreased enough to not overflow its credit limit. This implies
> + * waiting for previously executed jobs.
>   */
>  
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors are signaled by using dma_fence_set_error() on the hardware fence
> + * object before signaling it with dma_fence_signal(). Errors are then bubbled
> + * up from the hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission using the
> + * drm_sched_entity_error() function which can be used to cancel queued
> + * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
> + * pushing further ones into the entity in the driver's submission function.
> + *
> + * When the hardware fence doesn't signal within a configurable amount of time
> + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
> + * then follow the procedure described in that callback's documentation.
> + *
> + * (TODO: The timeout handler should probably switch to using the hardware
> + * fence as parameter instead of the job. Otherwise the handling will always
> + * race between timing out and signaling the fence).

This TODO can probably removed, too. The recently merged
DRM_GPU_SCHED_STAT_NO_HANG has solved this issue.


P.

> + *
> + * The scheduler also used to provided functionality for re-submitting jobs
> + * and, thereby, replaced the hardware fence during reset handling. This
> + * functionality is now deprecated. This has proven to be fundamentally racy
> + * and not compatible with dma_fence rules and shouldn't be used in new code.
> + *
> + * Additionally, there is the function drm_sched_increase_karma() which tries
> + * to find the entity which submitted a job and increases its 'karma' atomic
> + * variable to prevent resubmitting jobs from this entity. This has quite some
> + * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
> + * function is discouraged.
> + *
> + * Drivers can still recreate the GPU state in case it should be lost during
> + * timeout handling *if* they can guarantee that forward progress will be made
> + * and this doesn't cause another timeout. But this is strongly hardware
> + * specific and out of the scope of the general GPU scheduler.
> + */
>  #include <linux/export.h>
>  #include <linux/wait.h>
>  #include <linux/sched.h>
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 323a505e6e6a..0f0687b7ae9c 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -458,8 +458,8 @@ struct drm_sched_backend_ops {
>  	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>  
>  	/**
> -	 * @timedout_job: Called when a job has taken too long to execute,
> -	 * to trigger GPU recovery.
> +	 * @timedout_job: Called when a hardware fence didn't signal within a
> +	 * configurable amount of time. Triggers GPU recovery.
>  	 *
>  	 * @sched_job: The job that has timed out
>  	 *
> @@ -506,7 +506,6 @@ struct drm_sched_backend_ops {
>  	 * that timeout handlers are executed sequentially.
>  	 *
>  	 * Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
> -	 *
>  	 */
>  	enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
>