[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <59774c28-a0ef-d4f2-e920-503857bce1cf@igalia.com>
Date: Wed, 3 May 2023 16:14:11 -0300
From: André Almeida <andrealmeid@...lia.com>
To: Timur Kristóf <timur.kristof@...il.com>,
Felix Kuehling <felix.kuehling@....com>
Cc: Alex Deucher <alexdeucher@...il.com>,
"Pelloux-Prayer, Pierre-Eric" <pierre-eric.pelloux-prayer@....com>,
Marek Olšák <maraeo@...il.com>,
michel.daenzer@...lbox.org,
dri-devel <dri-devel@...ts.freedesktop.org>,
Christian König <ckoenig.leichtzumerken@...il.com>,
linux-kernel@...r.kernel.org,
Samuel Pitoiset <samuel.pitoiset@...il.com>,
amd-gfx list <amd-gfx@...ts.freedesktop.org>,
kernel-dev@...lia.com,
"Deucher, Alexander" <alexander.deucher@....com>,
Christian König <christian.koenig@....com>
Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
Em 03/05/2023 14:43, Timur Kristóf escreveu:
> Hi Felix,
>
> On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
>> That's the worst-case scenario where you're debugging HW or FW
>> issues.
>> Those should be pretty rare post-bringup. But are there hangs caused
>> by
>> user mode driver or application bugs that are easier to debug and
>> probably don't even require a GPU reset?
>
> There are many GPU hangs that gamers experience while playing. We have
> dozens of open bug reports against RADV about GPU hangs on various GPU
> generations. These usually fall into two categories:
>
> 1. When the hang always happens at the same point in a game. These are
> painful to debug but manageable.
> 2. "Random" hangs that happen to users over the course of playing a
> game for several hours. It is absolute hell to try to even reproduce
> let alone diagnose these issues, and this is what we would like to
> improve.
>
> For these hard-to-diagnose problems, it is already a challenge to
> determine whether the problem is the kernel (eg. setting wrong voltages
> / frequencies) or userspace (eg. missing some synchronization), can be
> even a game bug that we need to work around.
>
>> For example most VM faults can
>> be handled without hanging the GPU. Similarly, a shader in an endless
>> loop should not require a full GPU reset.
>
> This is actually not the case, AFAIK André's test case was an app that
> had an infinite loop in a shader.
>
This is the test app if anyone want to try out:
https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run.
The kernel calls amdgpu_ring_soft_recovery() when I run my example, but
I'm not sure what a soft recovery means here and if it's a full GPU
reset or not.
But if we can at least trust the CP registers to dump information for
soft resets, it would be some improvement from the current state I think
>>
>> It's more complicated for graphics because of the more complex
>> pipeline
>> and the lack of CWSR. But it should still be possible to do some
>> debugging without JTAG if the problem is in SW and not HW or FW. It's
>> probably worth improving that debugability without getting hung-up on
>> the worst case.
>
> I agree, and we welcome any constructive suggestion to improve the
> situation. It seems like our idea doesn't work if the kernel can't give
> us the information we need.
>
> How do we move forward?
>
> Best regards,
> Timur
>
Powered by blists - more mailing lists