linux-kernel - Re: [PATCH v4 5/6] drm/sched: Use parent fence instead of finished

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5beff5e2-8589-28cf-40b4-9e6b9e45b307@amd.com>
Date:   Thu, 29 Sep 2022 16:57:55 +0200
From:   Christian König <christian.koenig@....com>
To:     Steven Price <steven.price@....com>,
        Arvind Yadav <Arvind.Yadav@....com>, andrey.grodzovsky@....com,
        shashank.sharma@....com, amaranath.somalapuram@....com,
        Arunpravin.PaneerSelvam@....com, sumit.semwal@...aro.org,
        gustavo@...ovan.org, airlied@...ux.ie, daniel@...ll.ch,
        linux-media@...r.kernel.org, dri-devel@...ts.freedesktop.org,
        linaro-mm-sig@...ts.linaro.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 5/6] drm/sched: Use parent fence instead of finished

Am 29.09.22 um 16:53 schrieb Steven Price:
> On 14/09/2022 17:43, Arvind Yadav wrote:
>> Using the parent fence instead of the finished fence
>> to get the job status. This change is to avoid GPU
>> scheduler timeout error which can cause GPU reset.
> I'm able to reproduce crashes on Panfrost and I believe this commit is
> the cause. Specifically it's possible for job->s_fence->parent to be NULL.
>
> The underlying issue seems to involve drm_sched_resubmit_jobs_ext() - if
> the run_jobs() callback returns an error it will set s_fence->parent to
> NULL after signalling s_fence->finished:
>
>> 		fence = sched->ops->run_job(s_job);
>> 		i++;
>>
>> 		if (IS_ERR_OR_NULL(fence)) {
>> 			if (IS_ERR(fence))
>> 				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
>>
>> 			s_job->s_fence->parent = NULL;
> I don't understand the reasoning behind this change, but it doesn't seem
> right to be using the parent fence when we have code which can be
> setting that pointer to NULL.
>
> Since I don't understand the reasoning my only suggestion is to revert
> this patch (and potentially the dependent patch "dma-buf: Check status
> of enable-signaling bit on debug"?).
>
> Can anyone suggest a better fix?

Well, first of all please absolutely don't use 
drm_sched_resubmit_jobs_ext()!

It was an extremely bad idea in amdgpu to approach GPU by re-submitting 
jobs and it was an even worse idea to push this into the scheduler.

The design of dma_fence is that you submit that once and *only* once and 
then get a result for this submission. If re-submission is desirable it 
should be done in userspace or at least higher levels.

Apart from that, yes a NULL check is missing here but that should be 
trivial to fix.

Thanks,
Christian.

>
> Thanks,
>
> Steve
>
>> Signed-off-by: Arvind Yadav <Arvind.Yadav@....com>
>> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@....com>
>> ---
>>
>> changes in v1,v2 - Enable signaling for finished fence in sche_main()
>> is removed
>>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index e0ab14e0fb6b..2ac28ad11432 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>>   	job = list_first_entry_or_null(&sched->pending_list,
>>   				       struct drm_sched_job, list);
>>   
>> -	if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>> +	if (job && dma_fence_is_signaled(job->s_fence->parent)) {
>>   		/* remove job from pending_list */
>>   		list_del_init(&job->list);
>>   
>> @@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>>   
>>   		if (next) {
>>   			next->s_fence->scheduled.timestamp =
>> -				job->s_fence->finished.timestamp;
>> +				job->s_fence->parent->timestamp;
>>   			/* start TO timer for next job */
>>   			drm_sched_start_timeout(sched);
>>   		}