linux-kernel - Re: [PATCH next,v2] kernel: Add 1 ms delay to init handler to fix s3 resume hang

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ca4bd694-4685-a76c-25ae-65627c36d142@amd.com>
Date:   Tue, 29 Mar 2022 08:20:24 +0200
From:   Christian König <christian.koenig@....com>
To:     Zhenneng Li <lizhenneng@...inos.cn>,
        Alex Deucher <alexander.deucher@....com>
Cc:     Pan Xinhui <Xinhui.Pan@....com>, David Airlie <airlied@...ux.ie>,
        Daniel Vetter <daniel@...ll.ch>,
        Sumit Semwal <sumit.semwal@...aro.org>,
        Andrey Grodzovsky <andrey.grodzovsky@....com>,
        Evan Quan <evan.quan@....com>,
        Guchun Chen <guchun.chen@....com>,
        Jack Zhang <Jack.Zhang1@....com>,
        Lijo Lazar <lijo.lazar@....com>,
        Kevin Wang <kevin1.wang@....com>,
        amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
        linux-kernel@...r.kernel.org, linux-media@...r.kernel.org,
        linaro-mm-sig@...ts.linaro.org
Subject: Re: [PATCH next,v2] kernel: Add 1 ms delay to init handler to fix s3
 resume hang

Am 29.03.22 um 05:05 schrieb Zhenneng Li:
> This is a workaround for s3 resume hang for r7 340(amdgpu).
> When we test s3 with r7 340 on arm64 platform, graphics card will hang up,
> the error message are as follows:
> Mar  4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [    1.599374][ 7] [  T291] amdgpu 0000:02:00.0: fb0: amdgpudrmfb frame buffer device
> Mar  4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [    1.612869][ 7] [  T291] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR* late_init of IP block <si_dpm> failed -22
> Mar  4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [    1.623392][ 7] [  T291] amdgpu 0000:02:00.0: amdgpu_device_ip_late_init failed
> Mar  4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [    1.630696][ 7] [  T291] amdgpu 0000:02:00.0: Fatal error during GPU init
> Mar  4 01:14:11 greatwall-GW-XXXXXX-XXX kernel: [    1.637477][ 7] [  T291] [drm] amdgpu: finishing device.
>
> On the following hardware:
> lspci -nn -s 05:00.0
> 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Oland [Radeon HD 8570 / R7 240/340 / Radeon 520 OEM] [1002:6611] (rev 87)

Well that's rather funny and certainly a NAK. To recap you are adding a 
delay to a delayed work handler. In other words you could delay the work 
handler in the first place :)

But this is not the reason why that here is a NAK. The more obvious 
problem is that we seem to have a race between the DPM code kicking in 
to save power after driver load and the asynchronous testing if 
userspace command submission works.

Adding the delay here works around that for the IB submission, but there 
can be other things going on in parallel which can fail as well.

Please rather open up a bug report instead.

Regards,
Christian.

>
> Signed-off-by: Zhenneng Li <lizhenneng@...inos.cn>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 3987ecb24ef4..1eced991b5b2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2903,6 +2903,8 @@ static void amdgpu_device_delayed_init_work_handler(struct work_struct *work)
>   		container_of(work, struct amdgpu_device, delayed_init_work.work);
>   	int r;
>   
> +	mdelay(1);
> +
>   	r = amdgpu_ib_ring_tests(adev);
>   	if (r)
>   		DRM_ERROR("ib ring test failed (%d).\n", r);