[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <H8O1JR.4SM4JR87O02A2@att.net>
Date: Fri, 30 Sep 2022 18:38:41 -0400
From: Steven J Abner <pheonix.sja@....net>
To: Zhang Boyang <zhangboyang.id@...il.com>
Cc: Andrey Grodzovsky <andrey.grodzovsky@....com>,
Christian König <christian.koenig@....com>,
linux-kernel@...r.kernel.org, amd-gfx@...ts.freedesktop.org,
"David C . Rankin" <drankinatty@...denlinkmail.com>
Subject: Re: [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in
amdgpu_device_fini_sw()
I had done more delving into this also, thankfully was not forgotten.
Additional info to solve blackout, was going to contact AMD:
The problem as far as I could trace occurs in amdgpu_psp.c in function
psp_cmd_submit_buf().
Normal counter 'timeout' seems to use a max of about 400 with
usleep_range(5, 80).
Under normal operation 'psp->fence_buf' will equal 'index' and drop
from loop. I could not track what
fills the kernel buffer objects 'virtual' pointer's reference
(psp->fence_buf). I assume that
it's to a firmware file that on read is returning an error, and getting
stuck in a loop lock.
The error condition I found occurs when 'psp->fence_buf' does not equal
'index',
breaking from loop with 'timeout' == 0.
Blackout seemed to be about 80% of reboots, but found that in
reconfiguration of kernel,
on 5.18.19, with CONFIG_ATA_ACPI=y drops blackouts to about 30% of
reboots. I have now avoided
all blackouts with addition of CONFIG_SATA_PMP=y (at least a week
free?).
This is pure guess but, maybe the ARM PSP processor is dependent on
libata procedures, one
of which causes a lock up some of the time.
Steve
On Fri, Sep 30, 2022 at 21:41, Zhang Boyang <zhangboyang.id@...il.com>
wrote:
> After amdgpu_device_init() failed, adev->reset_domain may be NULL.
> Thus
> subsequent call to amdgpu_device_fini_sw() may result in
> null-ptr-deref.
>
> This patch fixes the problem by adding a NULL pointer check around the
> code of releasing adev->reset_domain in amdgpu_device_fini_sw().
>
> Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be
> refcounted.")
>
> Signed-off-by: Zhang Boyang <zhangboyang.id@...il.com>
> Link:
> https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
> Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index be7aff2d4a57..204daad06b32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct
> amdgpu_device *adev)
> if (adev->mman.discovery_bin)
> amdgpu_discovery_fini(adev);
>
> - amdgpu_reset_put_reset_domain(adev->reset_domain);
> - adev->reset_domain = NULL;
> + if (adev->reset_domain) {
> + amdgpu_reset_put_reset_domain(adev->reset_domain);
> + adev->reset_domain = NULL;
> + }
>
> kfree(adev->pci_state);
>
> --
> 2.30.2
>
Powered by blists - more mailing lists