[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <88f892f9-e99a-4813-830f-b3d30496ba3c@igalia.com>
Date: Wed, 23 Apr 2025 08:34:08 +0100
From: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>
To: Danilo Krummrich <dakr@...nel.org>, phasta@...nel.org
Cc: Lyude Paul <lyude@...hat.com>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Matthew Brost <matthew.brost@...el.com>,
Christian König <ckoenig.leichtzumerken@...il.com>,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
dri-devel@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
On 22/04/2025 15:52, Danilo Krummrich wrote:
> On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote:
>> On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
>>> On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
>>
>>>> Sorry I don't see the argument for the claim it is relying on the
>>>> internals
>>>> with the re-positioned drm_sched_fini call. In that case it is
>>>> fully
>>>> compliant with:
>>>>
>>>> /**
>>>> * drm_sched_fini - Destroy a gpu scheduler
>>>> *
>>>> * @sched: scheduler instance
>>>> *
>>>> * Tears down and cleans up the scheduler.
>>>> *
>>>> * This stops submission of new jobs to the hardware through
>>>> * drm_sched_backend_ops.run_job(). Consequently,
>>>> drm_sched_backend_ops.free_job()
>>>> * will not be called for all jobs still in
>>>> drm_gpu_scheduler.pending_list.
>>>> * There is no solution for this currently. Thus, it is up to the
>>>> driver to
>>>> make
>>>> * sure that:
>>>> *
>>>> * a) drm_sched_fini() is only called after for all submitted jobs
>>>> * drm_sched_backend_ops.free_job() has been called or that
>>>> * b) the jobs for which drm_sched_backend_ops.free_job() has not
>>>> been
>>>> called
>>>> *
>>>> * FIXME: Take care of the above problem and prevent this function
>>>> from
>>>> leaking
>>>> * the jobs in drm_gpu_scheduler.pending_list under any
>>>> circumstances.
>>>>
>>>> ^^^ recommended solution b).
>>>
>>> This has been introduced recently with commit baf4afc58314
>>> ("drm/sched: Improve
>>> teardown documentation") and I do not agree with this. The scheduler
>>> should
>>> *not* make any promises about implementation details to enable
>>> drivers to abuse
>>> their knowledge about component internals.
>>>
>>> This makes the problem *worse* as it encourages drivers to rely on
>>> implementation details, making maintainability of the scheduler even
>>> worse.
>>>
>>> For instance, what if I change the scheduler implementation, such
>>> that for every
>>> entry in the pending_list the scheduler allocates another internal
>>> object for
>>> ${something}? Then drivers would already fall apart leaking those
>>> internal
>>> objects.
>>>
>>> Now, obviously that's pretty unlikely, but I assume you get the idea.
>>>
>>> The b) paragraph in drm_sched_fini() should be removed for the given
>>> reasons.
>>>
>>> AFAICS, since the introduction of this commit, driver implementations
>>> haven't
>>> changed in this regard, hence we should be good.
>>>
>>> So, for me this doesn't change the fact that every driver
>>> implementation that
>>> just stops the scheduler at an arbitrary point of time and tries to
>>> clean things
>>> up manually relying on knowledge about component internals is broken.
>>
>> To elaborate on that, this documentation has been written so that we at
>> least have *some* documentation about the problem, instead of just
>> letting new drivers run into the knife.
>>
>> The commit explicitly introduced the FIXME, marking those two hacky
>> workarounds as undesirable.
>>
>> But back then we couldn't fix the problem quickly, so it was either
>> document the issue at least a bit, or leave it completely undocumented.
>
> Agreed, but b) really sounds like an invitation (or even justification) for
> doing the wrong thing, let's removed it.
IMO it is better to leave it. Regardless of whether it was added because
some driver is actually operating like that, it does describe a
_currently_ workable option to avoid memory leaks. Once a better method
is there, ie. FIXME is addressed, then it can be removed or replaced.
Regards,
Tvrtko
Powered by blists - more mailing lists