[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <585fa052-4eff-940e-b307-2415c315686a@amd.com>
Date: Wed, 8 Mar 2023 11:03:34 +0100
From: Christian König <christian.koenig@....com>
To: Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Asahi Lina <lina@...hilina.net>,
Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>,
David Airlie <airlied@...il.com>,
Daniel Vetter <daniel@...ll.ch>,
Miguel Ojeda <ojeda@...nel.org>,
Alex Gaynor <alex.gaynor@...il.com>,
Wedson Almeida Filho <wedsonaf@...il.com>,
Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
Björn Roy Baron <bjorn3_gh@...tonmail.com>,
Sumit Semwal <sumit.semwal@...aro.org>,
Luben Tuikov <luben.tuikov@....com>,
Jarkko Sakkinen <jarkko@...nel.org>,
Dave Hansen <dave.hansen@...ux.intel.com>
Cc: Alyssa Rosenzweig <alyssa@...enzweig.io>,
Karol Herbst <kherbst@...hat.com>,
Ella Stanforth <ella@...unix.org>,
Faith Ekstrand <faith.ekstrand@...labora.com>,
Mary <mary@...y.zone>, linux-kernel@...r.kernel.org,
dri-devel@...ts.freedesktop.org, rust-for-linux@...r.kernel.org,
linux-media@...r.kernel.org, linaro-mm-sig@...ts.linaro.org,
linux-sgx@...r.kernel.org, asahi@...ts.linux.dev
Subject: Re: [PATCH RFC 11/18] drm/scheduler: Clean up jobs when the scheduler
is torn down
Am 08.03.23 um 10:57 schrieb Maarten Lankhorst:
>
> On 2023-03-07 15:25, Asahi Lina wrote:
>> drm_sched_fini() currently leaves any pending jobs dangling, which
>> causes segfaults and other badness when job completion fences are
>> signaled after the scheduler is torn down.
>>
>> Explicitly detach all jobs from their completion callbacks and free
>> them. This makes it possible to write a sensible safe abstraction for
>> drm_sched, without having to externally duplicate the tracking of
>> in-flight jobs.
>>
>> This shouldn't regress any existing drivers, since calling
>> drm_sched_fini() with any pending jobs is broken and this change should
>> be a no-op if there are no pending jobs.
>>
>> Signed-off-by: Asahi Lina <lina@...hilina.net>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 27
>> +++++++++++++++++++++++++--
>> 1 file changed, 25 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 5c0add2c7546..0aab1e0aebdd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1119,10 +1119,33 @@ EXPORT_SYMBOL(drm_sched_init);
>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>> {
>> struct drm_sched_entity *s_entity;
>> + struct drm_sched_job *s_job, *tmp;
>> int i;
>> - if (sched->thread)
>> - kthread_stop(sched->thread);
>> + if (!sched->thread)
>> + return;
>> +
>> + /*
>> + * Stop the scheduler, detaching all jobs from their hardware
>> callbacks
>> + * and cleaning up complete jobs.
>> + */
>> + drm_sched_stop(sched, NULL);
>> +
>> + /*
>> + * Iterate through the pending job list and free all jobs.
>> + * This assumes the driver has either guaranteed jobs are
>> already stopped, or that
>> + * otherwise it is responsible for keeping any necessary data
>> structures for
>> + * in-progress jobs alive even when the free_job() callback is
>> called early (e.g. by
>> + * putting them in its own queue or doing its own refcounting).
>> + */
>> + list_for_each_entry_safe(s_job, tmp, &sched->pending_list, list) {
>> + spin_lock(&sched->job_list_lock);
>> + list_del_init(&s_job->list);
>> + spin_unlock(&sched->job_list_lock);
>> + sched->ops->free_job(s_job);
>> + }
>
> I would stop the kthread first, then delete all jobs without spinlock
> since nothing else can race against sched_fini?
>
> If you do need the spinlock, It would need to guard
> list_for_each_entry too.
Well this case here actually should not happen in the first place.
Jobs depend on their device, so as long as there are jobs there should
also be a reference to the scheduler.
What could be is that you have allocated a scheduler instance
dynamically, but even then you should first tear down all entities and
then the scheduler.
Regards,
Christian.
>
>> +
>> + kthread_stop(sched->thread);
>> for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >=
>> DRM_SCHED_PRIORITY_MIN; i--) {
>> struct drm_sched_rq *rq = &sched->sched_rq[i];
>>
Powered by blists - more mailing lists