[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37a58172-c2fa-4942-bafa-d96be85cd26f@igalia.com>
Date: Mon, 10 Mar 2025 09:30:30 +0000
From: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>
To: phasta@...nel.org, Christian König
<ckoenig.leichtzumerken@...il.com>, matthew.brost@...el.com,
dakr@...nel.org, dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org
Cc: Christian König <christian.koenig@....com>
Subject: Re: [PATCH] drm/sched: revert "drm_sched_job_cleanup(): correct false
doc"
On 10/03/2025 08:41, Philipp Stanner wrote:
> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
>>
>> Sorry for the delayed response, I only stumbled over this now while
>> going
>> over old mails and then re-thinking my reviewed by for this change.
>>
>> The function drm_sched_job_arm() is indeed the point of no return.
>
> So would you say that the comment in the function's body,
> "drm_sched_job_arm() has been called" actually means / should mean "job
> had been submitted with drm_sched_entity_push_job()"?
>
>> The
>> background is that it is nearly impossible for the driver to
>> correctly
>> retract the fence and signal it in the order enforced by the
>> dma_fence
>> framework.
>>
>> The code in drm_sched_job_cleanup() is for the purpose to cleanup
>> after
>> the job was armed through drm_sched_job_arm() *and* processed by the
>> scheduler.
>>
>> The correct approach for error handling in this situation is to set
>> the
>> error on the fences and then push to the entity anyway.
>
> You expect the driver to set an error on scheduled and finished fence?
> I think this would be very likely to explode. AFAICS the scheduler code
> has no awareness for anyone else having touched those fences.
>
> If at all, if this is really a problem, we should tell the driver that
> it must enforce that there will be no failure between
> drm_sched_job_arm() and drm_sched_job_entity_push_job(). That's how
> Nouveau does it.
On top of Philipp's questions - I just want to raise that
amdgpu_cs_submit then also needs explaining. I always take amdgpu as
kind of almost reference, since that is where scheduler originated from.
And in there there is definitely the path of drm_sched_job_cleanup()
called after drm_sched_job_arm() when adding gang dependencies fails.
Regards,
Tvrtko
>
> Let's set our understanding straight before reverting. What
> drm_sched_job_arm() does is:
>
> 1. Pick scheduler, rq and priority for the job
> 2. Atomically increment the job_id_count and assign to job
> 3. Call drm_sched_fence_init() and therefore:
> 4. Set the fence's scheduler
> 5. Set the fences seq nr atomically
> 6. Initialize scheduled and finished fence
>
> What of the above precisely is the problem?
>
> You say the driver cannot "correctly retract the fence and signal it in
> the order enforced by the dma_fence framework". You mean that the
> finished_fences have to be signalled in an order only the scheduler
> knows? I assume you're referring to set dependencies?
>
> P.
>
>> We can certainly
>> improve the documentation, but removing the warning is clearly not a
>> good
>> idea.
>>
>> Signed-off-by: Christian König <christian.koenig@....com>
>> ---
>> drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>> 1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 53e6aec37b46..4d4219fbe49d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>> * Cleans up the resources allocated with drm_sched_job_init().
>> *
>> * Drivers should call this from their error unwind code if @job is
>> aborted
>> - * before it was submitted to an entity with
>> drm_sched_entity_push_job().
>> + * before drm_sched_job_arm() is called.
>> *
>> - * Since calling drm_sched_job_arm() causes the job's fences to be
>> initialized,
>> - * it is up to the driver to ensure that fences that were exposed to
>> external
>> - * parties get signaled. drm_sched_job_cleanup() does not ensure
>> this.
>> - *
>> - * This function must also be called in &struct
>> drm_sched_backend_ops.free_job
>> + * After that point of no return @job is committed to be executed by
>> the
>> + * scheduler, and this function should be called from the
>> + * &drm_sched_backend_ops.free_job callback.
>> */
>> void drm_sched_job_cleanup(struct drm_sched_job *job)
>> {
>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
>> *job)
>> /* drm_sched_job_arm() has been called */
>> dma_fence_put(&job->s_fence->finished);
>> } else {
>> - /* aborted job before arming */
>> + /* aborted job before committing to run it */
>> drm_sched_fence_free(job->s_fence);
>> }
>>
>
Powered by blists - more mailing lists