lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e613a225-dabf-7e87-2624-a3244df8a877@arm.com>
Date:   Thu, 29 Sep 2022 15:53:28 +0100
From:   Steven Price <steven.price@....com>
To:     Arvind Yadav <Arvind.Yadav@....com>, Christian.Koenig@....com,
        andrey.grodzovsky@....com, shashank.sharma@....com,
        amaranath.somalapuram@....com, Arunpravin.PaneerSelvam@....com,
        sumit.semwal@...aro.org, gustavo@...ovan.org, airlied@...ux.ie,
        daniel@...ll.ch, linux-media@...r.kernel.org,
        dri-devel@...ts.freedesktop.org, linaro-mm-sig@...ts.linaro.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4 5/6] drm/sched: Use parent fence instead of finished

On 14/09/2022 17:43, Arvind Yadav wrote:
> Using the parent fence instead of the finished fence
> to get the job status. This change is to avoid GPU
> scheduler timeout error which can cause GPU reset.

I'm able to reproduce crashes on Panfrost and I believe this commit is
the cause. Specifically it's possible for job->s_fence->parent to be NULL.

The underlying issue seems to involve drm_sched_resubmit_jobs_ext() - if
the run_jobs() callback returns an error it will set s_fence->parent to
NULL after signalling s_fence->finished:

> 		fence = sched->ops->run_job(s_job);
> 		i++;
> 
> 		if (IS_ERR_OR_NULL(fence)) {
> 			if (IS_ERR(fence))
> 				dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
> 
> 			s_job->s_fence->parent = NULL;

I don't understand the reasoning behind this change, but it doesn't seem
right to be using the parent fence when we have code which can be
setting that pointer to NULL.

Since I don't understand the reasoning my only suggestion is to revert
this patch (and potentially the dependent patch "dma-buf: Check status
of enable-signaling bit on debug"?).

Can anyone suggest a better fix?

Thanks,

Steve

> Signed-off-by: Arvind Yadav <Arvind.Yadav@....com>
> Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@....com>
> ---
> 
> changes in v1,v2 - Enable signaling for finished fence in sche_main()
> is removed
> 
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index e0ab14e0fb6b..2ac28ad11432 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -829,7 +829,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>  	job = list_first_entry_or_null(&sched->pending_list,
>  				       struct drm_sched_job, list);
>  
> -	if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
> +	if (job && dma_fence_is_signaled(job->s_fence->parent)) {
>  		/* remove job from pending_list */
>  		list_del_init(&job->list);
>  
> @@ -841,7 +841,7 @@ drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>  
>  		if (next) {
>  			next->s_fence->scheduled.timestamp =
> -				job->s_fence->finished.timestamp;
> +				job->s_fence->parent->timestamp;
>  			/* start TO timer for next job */
>  			drm_sched_start_timeout(sched);
>  		}

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ