lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6b9b27a7-2ccd-4f16-aa36-05877b5e8f7c@igalia.com>
Date: Mon, 10 Mar 2025 12:27:42 +0000
From: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>
To: phasta@...nel.org, Christian König
 <ckoenig.leichtzumerken@...il.com>, matthew.brost@...el.com,
 dakr@...nel.org, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Cc: Christian König <christian.koenig@....com>
Subject: Re: [PATCH] drm/sched: revert "drm_sched_job_cleanup(): correct false
 doc"


On 10/03/2025 12:11, Philipp Stanner wrote:
> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
> 
> OK, your arguments with fence ordering are strong. Please update the
> commit message according to our discussion:

Could that argument please be explained in more concrete terms?

Are we talking here about skipping one seqno has potential to cause a 
problem, or there is more to it?

Because if it is just skipping I don't immediately see that breaks the 
monotonic/unique seqno ordering.

Only if we are worried about some code somewhere making assumptions  "if 
N got completed, that means N-1 got completed too". That generally isn't 
anything new and can happen with GPU resets, albeit in the latter case 
the fence error is I think always set.

Regards,

Tvrtko

>> Sorry for the delayed response, I only stumbled over this now while
>> going
>> over old mails and then re-thinking my reviewed by for this change.
> 
> Your RB hadn't even been applied (I merged before you gave it), so you
> can remove this first paragraph from the commit message
> 
>>
>> The function drm_sched_job_arm() is indeed the point of no return.
>> The
>> background is that it is nearly impossible for the driver to
>> correctly
>> retract the fence and signal it in the order enforced by the
>> dma_fence
>> framework.
>>
>> The code in drm_sched_job_cleanup() is for the purpose to cleanup
>> after
>> the job was armed through drm_sched_job_arm() *and* processed by the
>> scheduler.
>>
>> The correct approach for error handling in this situation is to set
>> the
>> error on the fences and then push to the entity anyway. We can
>> certainly
>> improve the documentation, but removing the warning is clearly not a
>> good
>> idea.
> 
> This last paragraph, as per our discussion, seems invalid. We shouldn't
> have that in the commit log, so that it won't give later hackers
> browsing it wrong ideas and we end up with someone actually mengling
> with those fences.
> 
> Thx
> P.
> 
>>
>> Signed-off-by: Christian König <christian.koenig@....com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>>   1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 53e6aec37b46..4d4219fbe49d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>    * Cleans up the resources allocated with drm_sched_job_init().
>>    *
>>    * Drivers should call this from their error unwind code if @job is
>> aborted
>> - * before it was submitted to an entity with
>> drm_sched_entity_push_job().
>> + * before drm_sched_job_arm() is called.
>>    *
>> - * Since calling drm_sched_job_arm() causes the job's fences to be
>> initialized,
>> - * it is up to the driver to ensure that fences that were exposed to
>> external
>> - * parties get signaled. drm_sched_job_cleanup() does not ensure
>> this.
>> - *
>> - * This function must also be called in &struct
>> drm_sched_backend_ops.free_job
>> + * After that point of no return @job is committed to be executed by
>> the
>> + * scheduler, and this function should be called from the
>> + * &drm_sched_backend_ops.free_job callback.
>>    */
>>   void drm_sched_job_cleanup(struct drm_sched_job *job)
>>   {
>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
>> *job)
>>   		/* drm_sched_job_arm() has been called */
>>   		dma_fence_put(&job->s_fence->finished);
>>   	} else {
>> -		/* aborted job before arming */
>> +		/* aborted job before committing to run it */
>>   		drm_sched_fence_free(job->s_fence);
>>   	}
>>   
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ