[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <18cd6b1f-8872-4a16-9ceb-50fd1ecfea39@igalia.com>
Date: Mon, 16 Jun 2025 10:27:16 +0100
From: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>
To: phasta@...nel.org, Lyude Paul <lyude@...hat.com>,
Danilo Krummrich <dakr@...nel.org>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Matthew Brost <matthew.brost@...el.com>,
Christian König <ckoenig.leichtzumerken@...il.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
Sumit Semwal <sumit.semwal@...aro.org>,
Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@....com>
Cc: dri-devel@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, linux-media@...r.kernel.org
Subject: Re: [RFC PATCH 1/6] drm/sched: Avoid memory leaks with cancel_job()
callback
On 12/06/2025 15:20, Philipp Stanner wrote:
> On Thu, 2025-06-12 at 15:17 +0100, Tvrtko Ursulin wrote:
>>
>> On 03/06/2025 10:31, Philipp Stanner wrote:
>>> Since its inception, the GPU scheduler can leak memory if the
>>> driver
>>> calls drm_sched_fini() while there are still jobs in flight.
>>>
>>> The simplest way to solve this in a backwards compatible manner is
>>> by
>>> adding a new callback, drm_sched_backend_ops.cancel_job(), which
>>> instructs the driver to signal the hardware fence associated with
>>> the
>>> job. Afterwards, the scheduler can savely use the established
>>> free_job()
>>> callback for freeing the job.
>>>
>>> Implement the new backend_ops callback cancel_job().
>>>
>>> Suggested-by: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>
>>
>> Please just add the link to the patch here (it is only in the cover
>> letter):
>>
>> Link:
>> https://lore.kernel.org/dri-devel/20250418113211.69956-1-tvrtko.ursulin@igalia.com/
>
> That I can do, sure
Cool, with that, for this patch:
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>
>> And you probably want to take the unit test modifications from the
>> same
>> patch too. You could put them in the same patch or separate.
>
> Necessary adjustments for the unit tests are already implemented and
> are waiting for review separately, since this can be done independently
> from this entire series:
>
> https://lore.kernel.org/dri-devel/20250605134154.191764-2-phasta@kernel.org/
For me it would make most sense to fold that into 2/6 from this series.
I don't see it making sense as standalone. So if you could repost the
series with it integrated I will give it a spin and can review that
patch at least.
Regards,
Tvrtko
>
> Thx
> P.
>
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Signed-off-by: Philipp Stanner <phasta@...nel.org>
>>> ---
>>> drivers/gpu/drm/scheduler/sched_main.c | 34 ++++++++++++++++-----
>>> -----
>>> include/drm/gpu_scheduler.h | 9 +++++++
>>> 2 files changed, 30 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index d20726d7adf0..3f14f1e151fa 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1352,6 +1352,18 @@ int drm_sched_init(struct drm_gpu_scheduler
>>> *sched, const struct drm_sched_init_
>>> }
>>> EXPORT_SYMBOL(drm_sched_init);
>>>
>>> +static void drm_sched_kill_remaining_jobs(struct drm_gpu_scheduler
>>> *sched)
>>> +{
>>> + struct drm_sched_job *job, *tmp;
>>> +
>>> + /* All other accessors are stopped. No locking necessary.
>>> */
>>> + list_for_each_entry_safe_reverse(job, tmp, &sched-
>>>> pending_list, list) {
>>> + sched->ops->cancel_job(job);
>>> + list_del(&job->list);
>>> + sched->ops->free_job(job);
>>> + }
>>> +}
>>> +
>>> /**
>>> * drm_sched_fini - Destroy a gpu scheduler
>>> *
>>> @@ -1359,19 +1371,11 @@ EXPORT_SYMBOL(drm_sched_init);
>>> *
>>> * Tears down and cleans up the scheduler.
>>> *
>>> - * This stops submission of new jobs to the hardware through
>>> - * drm_sched_backend_ops.run_job(). Consequently,
>>> drm_sched_backend_ops.free_job()
>>> - * will not be called for all jobs still in
>>> drm_gpu_scheduler.pending_list.
>>> - * There is no solution for this currently. Thus, it is up to the
>>> driver to make
>>> - * sure that:
>>> - *
>>> - * a) drm_sched_fini() is only called after for all submitted
>>> jobs
>>> - * drm_sched_backend_ops.free_job() has been called or that
>>> - * b) the jobs for which drm_sched_backend_ops.free_job() has not
>>> been called
>>> - * after drm_sched_fini() ran are freed manually.
>>> - *
>>> - * FIXME: Take care of the above problem and prevent this function
>>> from leaking
>>> - * the jobs in drm_gpu_scheduler.pending_list under any
>>> circumstances.
>>> + * This stops submission of new jobs to the hardware through
>>> &struct
>>> + * drm_sched_backend_ops.run_job. If &struct
>>> drm_sched_backend_ops.cancel_job
>>> + * is implemented, all jobs will be canceled through it and
>>> afterwards cleaned
>>> + * up through &struct drm_sched_backend_ops.free_job. If
>>> cancel_job is not
>>> + * implemented, memory could leak.
>>> */
>>> void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>> {
>>> @@ -1401,6 +1405,10 @@ void drm_sched_fini(struct drm_gpu_scheduler
>>> *sched)
>>> /* Confirm no work left behind accessing device structures
>>> */
>>> cancel_delayed_work_sync(&sched->work_tdr);
>>>
>>> + /* Avoid memory leaks if supported by the driver. */
>>> + if (sched->ops->cancel_job)
>>> + drm_sched_kill_remaining_jobs(sched);
>>> +
>>> if (sched->own_submit_wq)
>>> destroy_workqueue(sched->submit_wq);
>>> sched->ready = false;
>>> diff --git a/include/drm/gpu_scheduler.h
>>> b/include/drm/gpu_scheduler.h
>>> index e62a7214e052..81dcbfc8c223 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -512,6 +512,15 @@ struct drm_sched_backend_ops {
>>> * and it's time to clean it up.
>>> */
>>> void (*free_job)(struct drm_sched_job *sched_job);
>>> +
>>> + /**
>>> + * @cancel_job: Used by the scheduler to guarantee
>>> remaining jobs' fences
>>> + * get signaled in drm_sched_fini().
>>> + *
>>> + * Drivers need to signal the passed job's hardware fence
>>> with
>>> + * -ECANCELED in this callback. They must not free the
>>> job.
>>> + */
>>> + void (*cancel_job)(struct drm_sched_job *sched_job);
>>> };
>>>
>>> /**
>>
>
Powered by blists - more mailing lists