[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46f22193d960c0a0960c2ceaa525e9ff57fc09b6.camel@redhat.com>
Date: Fri, 20 Dec 2024 15:11:34 +0100
From: Philipp Stanner <pstanner@...hat.com>
To: Christian König <christian.koenig@....com>, Danilo
Krummrich <dakr@...nel.org>
Cc: Philipp Stanner <phasta@...nel.org>, Luben Tuikov <ltuikov89@...il.com>,
Matthew Brost <matthew.brost@...el.com>, Maarten Lankhorst
<maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Sumit Semwal <sumit.semwal@...aro.org>,
dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org, Tvrtko
Ursulin <tursulin@...ulin.net>, Andrey Grodzovsky
<andrey.grodzovsky@....com>
Subject: Re: [PATCH] drm/sched: Document run_job() refcount hazard
On Fri, 2024-12-20 at 14:25 +0100, Christian König wrote:
> Am 20.12.24 um 14:18 schrieb Danilo Krummrich:
> > On Fri, Dec 20, 2024 at 01:53:34PM +0100, Christian König wrote:
> > > Am 20.12.24 um 13:45 schrieb Philipp Stanner:
> > > > From: Philipp Stanner <pstanner@...hat.com>
> > > >
> > > > drm_sched_backend_ops.run_job() returns a dma_fence for the
> > > > scheduler.
> > > > That fence is signalled by the driver once the hardware
> > > > completed the
> > > > associated job. The scheduler does not increment the reference
> > > > count on
> > > > that fence, but implicitly expects to inherit this fence from
> > > > run_job().
> > > >
> > > > This is relatively subtle and prone to misunderstandings.
> > > >
> > > > This implies that, to keep a reference for itself, a driver
> > > > needs to
> > > > call dma_fence_get() in addition to dma_fence_init() in that
> > > > callback.
> > > >
> > > > It's further complicated by the fact that the scheduler even
> > > > decrements
> > > > the refcount in drm_sched_run_job_work() since it created a new
> > > > reference in drm_sched_fence_scheduled(). It does, however,
> > > > still use
> > > > its pointer to the fence after calling dma_fence_put() - which
> > > > is safe
> > > > because of the aforementioned new reference, but actually still
> > > > violates
> > > > the refcounting rules.
> > > >
> > > > Improve the explanatory comment for that decrement.
> > > >
> > > > Move the call to dma_fence_put() to the position behind the
> > > > last usage
> > > > of the fence.
> > > >
> > > > Document the necessity to increment the reference count in
> > > > drm_sched_backend_ops.run_job().
> > > >
> > > > Cc: Christian König <christian.koenig@....com>
> > > > Cc: Tvrtko Ursulin <tursulin@...ulin.net>
> > > > Cc: Andrey Grodzovsky <andrey.grodzovsky@....com>
> > > > Signed-off-by: Philipp Stanner <pstanner@...hat.com>
> > > > ---
> > > > drivers/gpu/drm/scheduler/sched_main.c | 10 +++++++---
> > > > include/drm/gpu_scheduler.h | 20
> > > > ++++++++++++++++----
> > > > 2 files changed, 23 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index 7ce25281c74c..d6f8df39d848 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > + *
> > > > + * @sched_job: the job to run
> > > > + *
> > > > + * Returns: dma_fence the driver must signal once the
> > > > hardware has
> > > > + * completed the job ("hardware fence").
> > > > + *
> > > > + * Note that the scheduler expects to 'inherit' its
> > > > own reference to
> > > > + * this fence from the callback. It does not invoke an
> > > > extra
> > > > + * dma_fence_get() on it. Consequently, this callback
> > > > must return a
> > > > + * fence whose refcount is at least 2: One for the
> > > > scheduler's
> > > > + * reference returned here, another one for the
> > > > reference kept by the
> > > > + * driver.
> > > Well the driver actually doesn't need any extra reference. The
> > > scheduler
> > > just needs to guarantee that this reference isn't dropped before
> > > it is
> > > signaled.
> > I think he means the reference the driver's fence context has to
> > have in order
> > to signal that thing eventually.
>
> Yeah, but this is usually a weak reference. IIRC most drivers don't
> increment the reference count for the reference they keep to signal a
> fence.
>
> It's expected that the consumers of the dma_fence keep the fence
> alive
> at least until it is signaled.
So are you saying that the driver having an extra reference (without
having obtained it with dma_fence_get()) is not an issue because the
driver is the one who will signal the fence [and then be done with it]?
> That's why we have this nice warning in
> dma_fence_release().
>
> On the other hand I completely agree it would be more defensive if
> drivers increment the reference count for the reference they keep for
> signaling.
>
> So if we want to document that the fence reference count should at
> least
> be 2 we somehow need to enforce this with a warning for example.
We could – but I'm not sure whether it really needs to be "enforced",
especially if it were only to be a minor issue, as you seem to hint at
above.
Document it is the minimum IMO
P.
>
> Regards,
> Christian.
>
>
>
> >
> > > Regards,
> > > Christian.
> > >
> > > > */
> > > > struct dma_fence *(*run_job)(struct drm_sched_job
> > > > *sched_job);
>
Powered by blists - more mailing lists