lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cfef8bd7-f335-4796-9d4f-93197bb3fc2d@igalia.com>
Date: Mon, 24 Feb 2025 10:29:26 -0300
From: Maíra Canal <mcanal@...lia.com>
To: phasta@...nel.org, Matthew Brost <matthew.brost@...el.com>,
 Danilo Krummrich <dakr@...nel.org>,
 Christian König <ckoenig.leichtzumerken@...il.com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
 Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Tvrtko Ursulin <tvrtko.ursulin@...lia.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 2/3] drm/sched: Adjust outdated docu for run_job()

Hi Philipp,

On 20/02/25 12:28, Philipp Stanner wrote:
> On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
>> Hi Philipp,
>>
>> On 20/02/25 08:28, Philipp Stanner wrote:
>>> The documentation for drm_sched_backend_ops.run_job() mentions a
>>> certain
>>> function called drm_sched_job_recovery(). This function does not
>>> exist.
>>> What's actually meant is drm_sched_resubmit_jobs(), which is by now
>>> also
>>> deprecated.
>>>
>>> Remove the mention of the removed function.
>>>
>>> Discourage the behavior of drm_sched_backend_ops.run_job() being
>>> called
>>> multiple times for the same job.
>>
>> It looks odd to me that this patch removes lines that were added in
>> patch 1/3. Maybe you could change the patchset order and place this
>> one
>> as the first.
>>
>>>
>>> Signed-off-by: Philipp Stanner <phasta@...nel.org>
>>> ---
>>>    include/drm/gpu_scheduler.h | 19 +++++++++++++------
>>>    1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/include/drm/gpu_scheduler.h
>>> b/include/drm/gpu_scheduler.h
>>> index 916279b5aa00..29e5bda91806 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -421,20 +421,27 @@ struct drm_sched_backend_ops {
>>>    
>>>    	/**
>>>    	 * @run_job: Called to execute the job once all of the
>>> dependencies
>>> -	 * have been resolved. This may be called multiple times,
>>> if
>>> -	 * timedout_job() has happened and
>>> drm_sched_job_recovery() decides to
>>> -	 * try it again.
>>> +	 * have been resolved.
>>> +	 *
>>> +	 * The deprecated drm_sched_resubmit_jobs() (called from
>>> +	 * drm_sched_backend_ops.timedout_job()) can invoke this
>>> again with the
>>
>> I think it would be "@timedout_job".
> 
> Not sure, isn't referencing in docstrings done with '&'?

`timedout_job` is a member of the same struct, so I believe it should be
@. But, I'm no kernel-doc expert, it's just my understanding of [1]. If
we don't use @, it should be at least
"&drm_sched_backend_ops.timedout_job".

[1] https://docs.kernel.org/doc-guide/kernel-doc.html

> 
>>
>>> +	 * same parameters. Using this is discouraged because it,
>>> presumably,
>>> +	 * violates dma_fence rules.
>>
>> I believe it would be "struct dma_fence".
> 
> Well, in this case strictly speaking not IMO, because it's about the
> rules of the "DMA Fence Subsystem", not about the struct itself.
> 
> I'd just keep it that way or call it "dma fence"
> 
>>
>>> +	 *
>>> +	 * TODO: Document which fence rules above.
>>>    	 *
>>>    	 * @sched_job: the job to run
>>>    	 *
>>> -	 * Returns: dma_fence the driver must signal once the
>>> hardware has
>>> -	 *	completed the job ("hardware fence").
>>> -	 *
>>>    	 * Note that the scheduler expects to 'inherit' its own
>>> reference to
>>>    	 * this fence from the callback. It does not invoke an
>>> extra
>>>    	 * dma_fence_get() on it. Consequently, this callback must
>>> take a
>>>    	 * reference for the scheduler, and additional ones for
>>> the driver's
>>>    	 * respective needs.
>>
>> Would it be possible to add a comment that `run_job()` must check if
>> `s_fence->finished.error` is different than 0? If you increase the
>> karma
>> of a job and don't check for `s_fence->finished.error`, you might run
>> a
>> cancelled job.
> 
> s_fence->finished is only signaled and its error set once the hardware
> fence got signaled; or when the entity is killed.

If you have a timeout, increase the karma of that job with
`drm_sched_increase_karma()` and call `drm_sched_resubmit_jobs()`, the
latter will flag an error in the dma fence. If you don't check for it in
`run_job()`, you will run the guilty job again.

I'm still talking about `drm_sched_resubmit_jobs()`, because I'm
currently fixing an issue in V3D with the GPU reset and we still use
`drm_sched_resubmit_jobs()`. I read the documentation of `run_job()` and
`timeout_job()` and the information I commented here (which was crucial
to fix the bug) wasn't available there.

`drm_sched_resubmit_jobs()` was deprecated in 2022, but Xe introduced a
new use in 2023, for example. The commit that deprecated it just
mentions AMD's case, but do we know if the function works as expected
for the other users? For V3D, it does. Also, we need to make it clear 
which are the dma fence requirements that the functions violates.

If we shouldn't use `drm_sched_resubmit_jobs()`, would it be possible to
provide a common interface for job resubmission?

Best Regards,
- Maíra

> 
> In any case, signaling "finished" will cause the job to be prevented
> from being executed (again), and will never reach run_job() in the
> first place.
> 
> Correct me if I am mistaken.
> 
> Or are you suggesting that there is a race?
> 
> 
> P.
> 
>>
>>> +	 *
>>> +	 * Return:
>>> +	 * * On success: dma_fence the driver must signal once the
>>> hardware has
>>> +	 * completed the job ("hardware fence").
>>
>> A suggestion: "the fence that the driver must signal once the
>> hardware
>> has completed the job".
>>
>> Best Regards,
>> - Maíra
>>
>>> +	 * * On failure: NULL or an ERR_PTR.
>>>    	 */
>>>    	struct dma_fence *(*run_job)(struct drm_sched_job
>>> *sched_job);
>>>    
>>
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ