linux-kernel - Re: [PATCH v7 1/3] drm/sched: Adjust outdated docu for run

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <81065d3cdf24c3d972e46b39eff1b744c93c7ccc.camel@mailbox.org>
Date: Fri, 07 Mar 2025 19:17:18 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Maíra Canal <mcanal@...lia.com>, Philipp Stanner
 <phasta@...nel.org>, Matthew Brost <matthew.brost@...el.com>, Danilo
 Krummrich <dakr@...nel.org>, Christian König
 <ckoenig.leichtzumerken@...il.com>, Maarten Lankhorst
 <maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>, 
 Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>,
 Simona Vetter <simona@...ll.ch>, Sumit Semwal <sumit.semwal@...aro.org>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v7 1/3] drm/sched: Adjust outdated docu for run_job()

On Fri, 2025-03-07 at 15:09 -0300, Maíra Canal wrote:
> Hi Philipp,
> 
> On 05/03/25 10:05, Philipp Stanner wrote:
> > The documentation for drm_sched_backend_ops.run_job() mentions a
> > certain
> > function called drm_sched_job_recovery(). This function does not
> > exist.
> > What's actually meant is drm_sched_resubmit_jobs(), which is by now
> > also
> > deprecated.
> > 
> > Furthermore, the scheduler expects to "inherit" a reference on the
> > fence
> > from the run_job() callback. This, so far, is also not documented.
> > 
> > Remove the mention of the removed function.
> > 
> > Discourage the behavior of drm_sched_backend_ops.run_job() being
> > called
> > multiple times for the same job.
> > 
> > Document the necessity of incrementing the refcount in run_job().
> > 
> > Signed-off-by: Philipp Stanner <phasta@...nel.org>
> > ---
> >   include/drm/gpu_scheduler.h | 34 ++++++++++++++++++++++++++++++--
> > --
> >   1 file changed, 30 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 50928a7ae98e..6381baae8024 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -410,10 +410,36 @@ struct drm_sched_backend_ops {
> >   					 struct drm_sched_entity
> > *s_entity);
> >   
> >   	/**
> > -         * @run_job: Called to execute the job once all of the
> > dependencies
> > -         * have been resolved.  This may be called multiple times,
> > if
> > -	 * timedout_job() has happened and
> > drm_sched_job_recovery()
> > -	 * decides to try it again.
> > +	 * @run_job: Called to execute the job once all of the
> > dependencies
> > +	 * have been resolved.
> > +	 *
> > +	 * @sched_job: the job to run
> > +	 *
> > +	 * The deprecated drm_sched_resubmit_jobs() (called by
> > &struct
> > +	 * drm_sched_backend_ops.timedout_job) can invoke this
> > again with the
> > +	 * same parameters. Using this is discouraged because it
> > violates
> > +	 * dma_fence rules, notably dma_fence_init() has to be
> > called on
> > +	 * already initialized fences for a second time. Moreover,
> > this is
> > +	 * dangerous because attempts to allocate memory might
> > deadlock with
> > +	 * memory management code waiting for the reset to
> > complete.
> 
> Thanks for adding this paragraph!

You're welcome


>  Also, thanks Christian for providing
> this explanation in v5. It really helped clarify the reasoning behind
> deprecating drm_sched_resubmit_jobs().

I thought a bit more about it the last days and think that you are
right and we definitely have to tell drivers with hardware scheduler
how they can achieve that without using drm_sched_resubmit_jobs().

Unfortunately, I discovered that this is quite complicated and
certainly difficult to do right.

So I'd only feel comfortable writing more docu about that once we got
more input from Christian or someone else who's got a hardware
scheduler about how they're currently doing it


Cheers
P.

> 
> Best Regards,
> - Maíra
> 
> > +	 *
> > +	 * TODO: Document what drivers should do / use instead.
> > +	 *
> > +	 * This method is called in a workqueue context - either
> > from the
> > +	 * submit_wq the driver passed through drm_sched_init(),
> > or, if the
> > +	 * driver passed NULL, a separate, ordered workqueue the
> > scheduler
> > +	 * allocated.
> > +	 *
> > +	 * Note that the scheduler expects to 'inherit' its own
> > reference to
> > +	 * this fence from the callback. It does not invoke an
> > extra
> > +	 * dma_fence_get() on it. Consequently, this callback must
> > take a
> > +	 * reference for the scheduler, and additional ones for
> > the driver's
> > +	 * respective needs.
> > +	 *
> > +	 * Return:
> > +	 * * On success: dma_fence the driver must signal once the
> > hardware has
> > +	 * completed the job ("hardware fence").
> > +	 * * On failure: NULL or an ERR_PTR.
> >   	 */
> >   	struct dma_fence *(*run_job)(struct drm_sched_job
> > *sched_job);
> >   
>