linux-kernel - [PATCH] drm/sched: Extend and update documentation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250724140121.70873-2-phasta@kernel.org>
Date: Thu, 24 Jul 2025 16:01:22 +0200
From: Philipp Stanner <phasta@...nel.org>
To: Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
	Maxime Ripard <mripard@...nel.org>,
	Thomas Zimmermann <tzimmermann@...e.de>,
	David Airlie <airlied@...il.com>,
	Simona Vetter <simona@...ll.ch>,
	Jonathan Corbet <corbet@....net>,
	Matthew Brost <matthew.brost@...el.com>,
	Danilo Krummrich <dakr@...nel.org>,
	Philipp Stanner <phasta@...nel.org>,
	Christian König <ckoenig.leichtzumerken@...il.com>,
	Sumit Semwal <sumit.semwal@...aro.org>
Cc: dri-devel@...ts.freedesktop.org,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-media@...r.kernel.org,
	Philipp Stanner <pstanner@...hat.com>,
	Christian König <christian.koenig@....com>
Subject: [PATCH] drm/sched: Extend and update documentation

From: Philipp Stanner <pstanner@...hat.com>

The various objects and their memory lifetime used by the GPU scheduler
are currently not fully documented.

Add documentation describing the scheduler's objects. Improve the
general documentation at a few other places.

Co-developed-by: Christian König <christian.koenig@....com>
Signed-off-by: Christian König <christian.koenig@....com>
Signed-off-by: Philipp Stanner <pstanner@...hat.com>
---
The first draft for this docu was posted by Christian in late 2023 IIRC.

This is an updated version. Please review.

@Christian: As we agreed on months (a year?) ago I kept your Signed-off
by. Just tell me if there's any issue or sth.
---
 Documentation/gpu/drm-mm.rst           |  36 ++++
 drivers/gpu/drm/scheduler/sched_main.c | 228 ++++++++++++++++++++++---
 include/drm/gpu_scheduler.h            |   5 +-
 3 files changed, 238 insertions(+), 31 deletions(-)

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index d55751cad67c..95ee95fd987a 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -556,12 +556,48 @@ Overview
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
    :doc: Overview
 
+Job Object
+----------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Job Object
+
+Entity Object
+-------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Entity Object
+
+Hardware Fence Object
+---------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Hardware Fence Object
+
+Scheduler Fence Object
+----------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Scheduler Fence Object
+
+Scheduler and Run Queue Objects
+-------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Scheduler and Run Queue Objects
+
 Flow Control
 ------------
 
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
    :doc: Flow Control
 
+Error and Timeout handling
+--------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Error and Timeout handling
+
 Scheduler Function References
 -----------------------------
 
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 5a550fd76bf0..2e7bc1e74186 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,48 +24,220 @@
 /**
  * DOC: Overview
  *
- * The GPU scheduler provides entities which allow userspace to push jobs
- * into software queues which are then scheduled on a hardware run queue.
- * The software queues have a priority among them. The scheduler selects the entities
- * from the run queue using a FIFO. The scheduler provides dependency handling
- * features among jobs. The driver is supposed to provide callback functions for
- * backend operations to the scheduler like submitting a job to hardware run queue,
- * returning the dependencies of a job etc.
+ * The GPU scheduler is shared infrastructure intended to help drivers managing
+ * command submission to their hardware.
  *
- * The organisation of the scheduler is the following:
+ * To do so, it offers a set of scheduling facilities that interact with the
+ * driver through callbacks which the latter can register.
  *
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be scheduled on
- *    the hardware.
+ * In particular, the scheduler takes care of:
+ *   - Ordering command submissions
+ *   - Signalling dma_fences, e.g., for finished commands
+ *   - Taking dependencies between command submissions into account
+ *   - Handling timeouts for command submissions
  *
- * The jobs in an entity are always scheduled in the order in which they were pushed.
+ * All callbacks the driver needs to implement are restricted by dma_fence
+ * signaling rules to guarantee deadlock free forward progress. This especially
+ * means that for normal operation no memory can be allocated in a callback.
+ * All memory which is needed for pushing the job to the hardware must be
+ * allocated before arming a job. It also means that no locks can be taken
+ * under which memory might be allocated.
  *
- * Note that once a job was taken from the entities queue and pushed to the
- * hardware, i.e. the pending queue, the entity must not be referenced anymore
- * through the jobs entity pointer.
+ * Optional memory, for example for device core dumping or debugging, *must* be
+ * allocated with GFP_NOWAIT and appropriate error handling if that allocation
+ * fails. GFP_ATOMIC should only be used if absolutely necessary since dipping
+ * into the special atomic reserves is usually not justified for a GPU driver.
+ *
+ * Note especially the following about the scheduler's historic background that
+ * lead to sort of a double role it plays today:
+ *
+ * In classic setups N ("hardware scheduling") entities share one scheduler,
+ * and the scheduler decides which job to pick from which entity and move it to
+ * the hardware ring next (that is: "scheduling").
+ *
+ * Many (especially newer) GPUs, however, can have an almost arbitrary number
+ * of hardware rings and it's a firmware scheduler which actually decides which
+ * job will run next. In such setups, the GPU scheduler is still used (e.g., in
+ * Nouveau) but does not "schedule" jobs in the classical sense anymore. It
+ * merely serves to queue and dequeue jobs and resolve dependencies. In such a
+ * scenario, it is recommended to have one scheduler per entity.
+ */
+
+/**
+ * DOC: Job Object
+ *
+ * The base job object (&struct drm_sched_job) contains submission dependencies
+ * in the form of &struct dma_fence objects. Drivers can also implement an
+ * optional prepare_job callback which returns additional dependencies as
+ * dma_fence objects. It's important to note that this callback can't allocate
+ * memory or grab locks under which memory is allocated.
+ *
+ * Drivers should use this as base class for an object which contains the
+ * necessary state to push the command submission to the hardware.
+ *
+ * The lifetime of the job object needs to last at least from submitting it to
+ * the scheduler (through drm_sched_job_arm()) until the scheduler has invoked
+ * &struct drm_sched_backend_ops.free_job and, thereby, has indicated that it
+ * does not need the job anymore. Drivers can of course keep their job object
+ * alive for longer than that, but that's outside of the scope of the scheduler
+ * component.
+ *
+ * Job initialization is split into two stages:
+ *   1. drm_sched_job_init() which serves for basic preparation of a job.
+ *      Drivers don't have to be mindful of this function's consequences and
+ *      its effects can be reverted through drm_sched_job_cleanup().
+ *   2. drm_sched_job_arm() which irrevokably arms a job for execution. This
+ *      initializes the job's fences and the job has to be submitted with
+ *      drm_sched_entity_push_job(). Once drm_sched_job_arm() has been called,
+ *      the job structure has to be valid until the scheduler invoked
+ *      drm_sched_backend_ops.free_job().
+ *
+ * It's important to note that after arming a job drivers must follow the
+ * dma_fence rules and can't easily allocate memory or takes locks under which
+ * memory is allocated.
+ */
+
+/**
+ * DOC: Entity Object
+ *
+ * The entity object (&struct drm_sched_entity) is a container for jobs which
+ * should execute sequentially. Drivers should create an entity for each
+ * individual context they maintain for command submissions which can run in
+ * parallel.
+ *
+ * The lifetime of the entity *should not* exceed the lifetime of the
+ * userspace process it was created for and drivers should call the
+ * drm_sched_entity_flush() function from their file_operations.flush()
+ * callback. It is possible that an entity object is not alive anymore
+ * while jobs previously fetched from it are still running on the hardware.
+ *
+ * This is done because all results of a command submission should become
+ * visible externally even after a process exits. This is normal POSIX
+ * behavior for I/O operations.
+ *
+ * The problem with this approach is that GPU submissions contain executable
+ * shaders enabling processes to evade their termination by offloading work to
+ * the GPU. So when a process is terminated with a SIGKILL the entity object
+ * makes sure that jobs are freed without running them while still maintaining
+ * correct sequential order for signaling fences.
+ *
+ * All entities associated with a scheduler have to be torn down before that
+ * scheduler.
+ */
+
+/**
+ * DOC: Hardware Fence Object
+ *
+ * The hardware fence object is a dma_fence provided by the driver through
+ * &struct drm_sched_backend_ops.run_job. The driver signals this fence once the
+ * hardware has completed the associated job.
+ *
+ * Drivers need to make sure that the normal dma_fence semantics are followed
+ * for this object. It's important to note that the memory for this object can
+ * *not* be allocated in &struct drm_sched_backend_ops.run_job since that would
+ * violate the requirements for the dma_fence implementation. The scheduler
+ * maintains a timeout handler which triggers if this fence doesn't signal
+ * within a configurable amount of time.
+ *
+ * The lifetime of this object follows dma_fence refcounting rules. The
+ * scheduler takes ownership of the reference returned by the driver and
+ * drops it when it's not needed any more.
+ *
+ * See &struct drm_sched_backend_ops.run_job for precise refcounting rules.
+ */
+
+/**
+ * DOC: Scheduler Fence Object
+ *
+ * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
+ * time from pushing the job into the scheduler until the hardware has finished
+ * processing it. It is managed by the scheduler. The implementation provides
+ * dma_fence interfaces for signaling both scheduling of a command submission
+ * as well as finishing of processing.
+ *
+ * The lifetime of this object also follows normal dma_fence refcounting rules.
+ */
+
+/**
+ * DOC: Scheduler and Run Queue Objects
+ *
+ * The scheduler object itself (&struct drm_gpu_scheduler) does the actual
+ * scheduling: it picks the next entity to run a job from and pushes that job
+ * onto the hardware. Both FIFO and RR selection algorithms are supported, with
+ * FIFO being the default and the recommended one.
+ *
+ * The lifetime of the scheduler is managed by the driver using it. Before
+ * destroying the scheduler the driver must ensure that all hardware processing
+ * involving this scheduler object has finished by calling for example
+ * disable_irq(). It is *not* sufficient to wait for the hardware fence here
+ * since this doesn't guarantee that all callback processing has finished.
+ *
+ * The run queue object (&struct drm_sched_rq) is a container for entities of a
+ * certain priority level. This object is internally managed by the scheduler
+ * and drivers must not touch it directly. The lifetime of a run queue is bound
+ * to the scheduler's lifetime.
+ *
+ * All entities associated with a scheduler must be torn down before it. Drivers
+ * should implement &struct drm_sched_backend_ops.cancel_job to avoid pending
+ * jobs (those that were pulled from an entity into the scheduler, but have not
+ * been completed by the hardware yet) from leaking.
  */
 
 /**
  * DOC: Flow Control
  *
  * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
- * in which the jobs fetched from scheduler entities are executed.
+ * at which jobs fetched from scheduler entities are executed.
  *
- * In this context the &drm_gpu_scheduler keeps track of a driver specified
- * credit limit representing the capacity of this scheduler and a credit count;
- * every &drm_sched_job carries a driver specified number of credits.
+ * In this context the &struct drm_gpu_scheduler keeps track of a driver
+ * specified credit limit representing the capacity of this scheduler and a
+ * credit count; every &struct drm_sched_job carries a driver-specified number
+ * of credits.
  *
- * Once a job is executed (but not yet finished), the job's credits contribute
- * to the scheduler's credit count until the job is finished. If by executing
- * one more job the scheduler's credit count would exceed the scheduler's
- * credit limit, the job won't be executed. Instead, the scheduler will wait
- * until the credit count has decreased enough to not overflow its credit limit.
- * This implies waiting for previously executed jobs.
+ * Once a job is being executed, the job's credits contribute to the
+ * scheduler's credit count until the job is finished. If by executing one more
+ * job the scheduler's credit count would exceed the scheduler's credit limit,
+ * the job won't be executed. Instead, the scheduler will wait until the credit
+ * count has decreased enough to not overflow its credit limit. This implies
+ * waiting for previously executed jobs.
  */
 
+/**
+ * DOC: Error and Timeout handling
+ *
+ * Errors are signaled by using dma_fence_set_error() on the hardware fence
+ * object before signaling it with dma_fence_signal(). Errors are then bubbled
+ * up from the hardware fence to the scheduler fence.
+ *
+ * The entity allows querying errors on the last run submission using the
+ * drm_sched_entity_error() function which can be used to cancel queued
+ * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
+ * pushing further ones into the entity in the driver's submission function.
+ *
+ * When the hardware fence doesn't signal within a configurable amount of time
+ * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
+ * then follow the procedure described in that callback's documentation.
+ *
+ * (TODO: The timeout handler should probably switch to using the hardware
+ * fence as parameter instead of the job. Otherwise the handling will always
+ * race between timing out and signaling the fence).
+ *
+ * The scheduler also used to provided functionality for re-submitting jobs
+ * and, thereby, replaced the hardware fence during reset handling. This
+ * functionality is now deprecated. This has proven to be fundamentally racy
+ * and not compatible with dma_fence rules and shouldn't be used in new code.
+ *
+ * Additionally, there is the function drm_sched_increase_karma() which tries
+ * to find the entity which submitted a job and increases its 'karma' atomic
+ * variable to prevent resubmitting jobs from this entity. This has quite some
+ * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
+ * function is discouraged.
+ *
+ * Drivers can still recreate the GPU state in case it should be lost during
+ * timeout handling *if* they can guarantee that forward progress will be made
+ * and this doesn't cause another timeout. But this is strongly hardware
+ * specific and out of the scope of the general GPU scheduler.
+ */
 #include <linux/export.h>
 #include <linux/wait.h>
 #include <linux/sched.h>
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 323a505e6e6a..0f0687b7ae9c 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -458,8 +458,8 @@ struct drm_sched_backend_ops {
 	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
 
 	/**
-	 * @timedout_job: Called when a job has taken too long to execute,
-	 * to trigger GPU recovery.
+	 * @timedout_job: Called when a hardware fence didn't signal within a
+	 * configurable amount of time. Triggers GPU recovery.
 	 *
 	 * @sched_job: The job that has timed out
 	 *
@@ -506,7 +506,6 @@ struct drm_sched_backend_ops {
 	 * that timeout handlers are executed sequentially.
 	 *
 	 * Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
-	 *
 	 */
 	enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
 
-- 
2.49.0