[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bd2fda6dc370432e5478c891514bce5d8a0d8b76.camel@redhat.com>
Date: Fri, 24 Oct 2025 11:12:40 +0200
From: Philipp Stanner <pstanner@...hat.com>
To: Matthew Brost <matthew.brost@...el.com>, Philipp Stanner
<phasta@...nel.org>
Cc: Danilo Krummrich <dakr@...nel.org>, Christian König
<ckoenig.leichtzumerken@...il.com>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Tvrtko Ursulin
<tvrtko.ursulin@...lia.com>, dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, linux-media@...r.kernel.org, Christian
König <christian.koenig@....com>
Subject: Re: [PATCH v2] drm/sched: Extend and update documentation
On Thu, 2025-10-23 at 14:27 -0700, Matthew Brost wrote:
> On Thu, Oct 23, 2025 at 04:38:04PM +0200, Philipp Stanner wrote:
> > From: Philipp Stanner <pstanner@...hat.com>
> >
> > The various objects and their memory lifetime used by the GPU scheduler
> > are currently not fully documented.
> >
> > Add documentation describing the scheduler's objects. Improve the
> > general documentation at a few other places.
> >
> > Co-developed-by: Christian König <christian.koenig@....com>
> > Signed-off-by: Christian König <christian.koenig@....com>
> > Signed-off-by: Philipp Stanner <pstanner@...hat.com>
>
> This is great documentation—thanks for writing it up. It clarifies
> several bugs I've noticed in multiple NPU drivers over the past year
> related to reclaim, making it clear that these behaviors are not
> allowed.
>
Good to hear that we're in line here.
> Also, check out [1], which enforces some of the reclaim rules
> outlined here.
Will take a look hopefully next week.
>
> A couple of minor nits below, but I'm inclined to RB.
>
> [1] https://patchwork.freedesktop.org/series/156284/
>
> > ---
> > Changes in v2:
> > - Fix some typos. (Bagas Sanjava)
> > ---
> > Documentation/gpu/drm-mm.rst | 36 ++++
> > drivers/gpu/drm/scheduler/sched_main.c | 229 ++++++++++++++++++++++---
> > include/drm/gpu_scheduler.h | 5 +-
> > 3 files changed, 239 insertions(+), 31 deletions(-)
> >
> > diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> > index d55751cad67c..95ee95fd987a 100644
> > --- a/Documentation/gpu/drm-mm.rst
> > +++ b/Documentation/gpu/drm-mm.rst
> > @@ -556,12 +556,48 @@ Overview
> > .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> > :doc: Overview
> >
> >
[…]
> >
> > +
> > +/**
> > + * DOC: Entity Object
> > + *
> > + * The entity object (&struct drm_sched_entity) is a container for jobs which
> > + * should execute sequentially. Drivers should create an entity for each
> > + * individual context they maintain for command submissions which can run in
> > + * parallel.
> > + *
> > + * The lifetime of the entity *should not* exceed the lifetime of the
> > + * userspace process it was created for and drivers should call the
> > + * drm_sched_entity_flush() function from their file_operations.flush()
> > + * callback. It is possible that an entity object is not alive anymore
> > + * while jobs previously fetched from it are still running on the hardware.
>
> This is not how Xe has implemented it. Instead, on FD close (or queue
> close, for that matter), we trigger our entity through a GPU command and
> immediately timeout all jobs,
>
How can Xe "timeout all jobs"? Are you interfering with the scheduler's
timeout work item?
Anyways, the docu just says "It is possible", not that it's typical. So
the docu is fine here, agreed?
> signaling them accordingly. Technically,
> the object entity can outlive the FD, but the effects of the FD close
> are externally visible. The complete teardown process can outlive the FD
> as well, since it involves multiple ping-pong steps with our firmware. I
> believe this approach is also valid.
>
> > + *
> > + * This is done because all results of a command submission should become
> > + * visible externally even after a process exits. This is normal POSIX
> > + * behavior for I/O operations.
> > + *
> > + * The problem with this approach is that GPU submissions contain executable
> > + * shaders enabling processes to evade their termination by offloading work to
> > + * the GPU. So when a process is terminated with a SIGKILL the entity object
> > + * makes sure that jobs are freed without running them while still maintaining
> > + * correct sequential order for signaling fences.
> > + *
> > + * All entities associated with a scheduler have to be torn down before that
> > + * scheduler.
> > + */
> > +
> > +/**
> > + * DOC: Hardware Fence Object
> > + *
> > + * The hardware fence object is a dma_fence provided by the driver through
> > + * &struct drm_sched_backend_ops.run_job. The driver signals this fence once the
> > + * hardware has completed the associated job.
> > + *
> > + * Drivers need to make sure that the normal dma_fence semantics are followed
> > + * for this object. It's important to note that the memory for this object can
> > + * *not* be allocated in &struct drm_sched_backend_ops.run_job since that would
> > + * violate the requirements for the dma_fence implementation. The scheduler
> > + * maintains a timeout handler which triggers if this fence doesn't signal
> > + * within a configurable amount of time.
> > + *
> > + * The lifetime of this object follows dma_fence refcounting rules. The
> > + * scheduler takes ownership of the reference returned by the driver and
> > + * drops it when it's not needed any more.
> > + *
> > + * See &struct drm_sched_backend_ops.run_job for precise refcounting rules.
> > + */
> > +
> > +/**
> > + * DOC: Scheduler Fence Object
> > + *
> > + * The scheduler fence object (&struct drm_sched_fence) encapsulates the whole
> > + * time from pushing the job into the scheduler until the hardware has finished
> > + * processing it. It is managed by the scheduler. The implementation provides
> > + * dma_fence interfaces for signaling both scheduling of a command submission
> > + * as well as finishing of processing. &struct drm_sched_fence.finished is the
> > + * fence typically used to synchronize userspace, e.g., in a &struct drm_syncobj.
>
> Is it ever not the finished fence? Early in Xe, I originally used the
> hardware fence, and it caused all sorts of trouble. One reason the DRM
> scheduler is a well-designed system is that the finished fence can be
> installed universally.
Agreed, we can say that it's always the finished fence.
(you should never call the GPU scheduler a "well-designed system",
though xD)
>
> I also suggest we add a wrapper to access the finished fence, so drivers
> don’t need to dig deep into scheduler objects to retrieve it.
Yes, ACK. I actually once had a patch somewhere which adds such a
wrapper and uses it in Nouveau. I can digg for it; but feel free to
provide one if you have some cycles.
>
> > + *
> > + * The lifetime of this object also follows normal dma_fence refcounting rules.
> > + */
> > +
> > +/**
> > + * DOC: Scheduler and Run Queue Objects
> > + *
> > + * The scheduler object itself (&struct drm_gpu_scheduler) does the actual
> > + * scheduling: it picks the next entity to run a job from and pushes that job
> > + * onto the hardware. Both FIFO and RR selection algorithms are supported, with
> > + * FIFO being the default and the recommended one.
> > + *
> > + * The lifetime of the scheduler is managed by the driver using it. Before
> > + * destroying the scheduler the driver must ensure that all hardware processing
> > + * involving this scheduler object has finished by calling for example
> > + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> > + * since this doesn't guarantee that all callback processing has finished.
> > + *
> > + * The run queue object (&struct drm_sched_rq) is a container for entities of a
> > + * certain priority level. This object is internally managed by the scheduler
> > + * and drivers must not touch it directly. The lifetime of a run queue is bound
> > + * to the scheduler's lifetime.
> > + *
> > + * All entities associated with a scheduler must be torn down before it. Drivers
> > + * should implement &struct drm_sched_backend_ops.cancel_job to avoid pending
> > + * jobs (those that were pulled from an entity into the scheduler, but have not
> > + * been completed by the hardware yet) from leaking.
> > */
> >
> > /**
> > * DOC: Flow Control
> > *
> > * The DRM GPU scheduler provides a flow control mechanism to regulate the rate
> > - * in which the jobs fetched from scheduler entities are executed.
> > + * at which jobs fetched from scheduler entities are executed.
> > *
> > - * In this context the &drm_gpu_scheduler keeps track of a driver specified
> > - * credit limit representing the capacity of this scheduler and a credit count;
> > - * every &drm_sched_job carries a driver specified number of credits.
> > + * In this context the &struct drm_gpu_scheduler keeps track of a driver
> > + * specified credit limit representing the capacity of this scheduler and a
> > + * credit count; every &struct drm_sched_job carries a driver-specified number
> > + * of credits.
> > *
> > - * Once a job is executed (but not yet finished), the job's credits contribute
> > - * to the scheduler's credit count until the job is finished. If by executing
> > - * one more job the scheduler's credit count would exceed the scheduler's
> > - * credit limit, the job won't be executed. Instead, the scheduler will wait
> > - * until the credit count has decreased enough to not overflow its credit limit.
> > - * This implies waiting for previously executed jobs.
> > + * Once a job is being executed, the job's credits contribute to the
> > + * scheduler's credit count until the job is finished. If by executing one more
> > + * job the scheduler's credit count would exceed the scheduler's credit limit,
> > + * the job won't be executed. Instead, the scheduler will wait until the credit
> > + * count has decreased enough to not overflow its credit limit. This implies
> > + * waiting for previously executed jobs.
> > */
> >
> > +/**
> > + * DOC: Error and Timeout handling
> > + *
>
> Should this section mention DRM_GPU_SCHED_STAT_NO_HANG usage?
Yes, it should.
>
> > + * Errors are signaled by using dma_fence_set_error() on the hardware fence
> > + * object before signaling it with dma_fence_signal(). Errors are then bubbled
> > + * up from the hardware fence to the scheduler fence.
> > + *
> > + * The entity allows querying errors on the last run submission using the
> > + * drm_sched_entity_error() function which can be used to cancel queued
> > + * submissions in &struct drm_sched_backend_ops.run_job as well as preventing
> > + * pushing further ones into the entity in the driver's submission function.
> > + *
> > + * When the hardware fence doesn't signal within a configurable amount of time
> > + * &struct drm_sched_backend_ops.timedout_job gets invoked. The driver should
> > + * then follow the procedure described in that callback's documentation.
> > + *
> > + * (TODO: The timeout handler should probably switch to using the hardware
> > + * fence as parameter instead of the job. Otherwise the handling will always
> > + * race between timing out and signaling the fence).
> > + *
>
> This might be a bit invasive, and honestly, I'm not sure if the
> subsystem can handle it. I'd leave out this TODO for now. We can't
> predict the future, so it's probably best not to try in the
> documentation.
Yes. I think that TODO is actually surplus now that we have NO_HANG.
P.
>
> Matt
>
> > + * The scheduler also used to provided functionality for re-submitting jobs
> > + * and, thereby, replaced the hardware fence during reset handling. This
> > + * functionality is now deprecated. This has proven to be fundamentally racy
> > + * and not compatible with dma_fence rules and shouldn't be used in new code.
> > + *
> > + * Additionally, there is the function drm_sched_increase_karma() which tries
> > + * to find the entity which submitted a job and increases its 'karma' atomic
> > + * variable to prevent resubmitting jobs from this entity. This has quite some
> > + * overhead and resubmitting jobs is now marked as deprecated. Thus, using this
> > + * function is discouraged.
> > + *
> > + * Drivers can still recreate the GPU state in case it should be lost during
> > + * timeout handling *if* they can guarantee that forward progress will be made
> > + * and this doesn't cause another timeout. But this is strongly hardware
> > + * specific and out of the scope of the general GPU scheduler.
> > + */
> > #include <linux/export.h>
> > #include <linux/wait.h>
> > #include <linux/sched.h>
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index fb88301b3c45..a97351435d9e 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -458,8 +458,8 @@ struct drm_sched_backend_ops {
> > struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
> >
> > /**
> > - * @timedout_job: Called when a job has taken too long to execute,
> > - * to trigger GPU recovery.
> > + * @timedout_job: Called when a hardware fence didn't signal within a
> > + * configurable amount of time. Triggers GPU recovery.
> > *
> > * @sched_job: The job that has timed out
> > *
> > @@ -506,7 +506,6 @@ struct drm_sched_backend_ops {
> > * that timeout handlers are executed sequentially.
> > *
> > * Return: The scheduler's status, defined by &enum drm_gpu_sched_stat
> > - *
> > */
> > enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
> >
> > --
> > 2.49.0
> >
Powered by blists - more mailing lists