linux-kernel - Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <59774c28-a0ef-d4f2-e920-503857bce1cf@igalia.com>
Date:   Wed, 3 May 2023 16:14:11 -0300
From:   André Almeida <andrealmeid@...lia.com>
To:     Timur Kristóf <timur.kristof@...il.com>,
        Felix Kuehling <felix.kuehling@....com>
Cc:     Alex Deucher <alexdeucher@...il.com>,
        "Pelloux-Prayer, Pierre-Eric" <pierre-eric.pelloux-prayer@....com>,
        Marek Olšák <maraeo@...il.com>,
        michel.daenzer@...lbox.org,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        Christian König <ckoenig.leichtzumerken@...il.com>,
        linux-kernel@...r.kernel.org,
        Samuel Pitoiset <samuel.pitoiset@...il.com>,
        amd-gfx list <amd-gfx@...ts.freedesktop.org>,
        kernel-dev@...lia.com,
        "Deucher, Alexander" <alexander.deucher@....com>,
        Christian König <christian.koenig@....com>
Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Em 03/05/2023 14:43, Timur Kristóf escreveu:
> Hi Felix,
> 
> On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
>> That's the worst-case scenario where you're debugging HW or FW
>> issues.
>> Those should be pretty rare post-bringup. But are there hangs caused
>> by
>> user mode driver or application bugs that are easier to debug and
>> probably don't even require a GPU reset?
> 
> There are many GPU hangs that gamers experience while playing. We have
> dozens of open bug reports against RADV about GPU hangs on various GPU
> generations. These usually fall into two categories:
> 
> 1. When the hang always happens at the same point in a game. These are
> painful to debug but manageable.
> 2. "Random" hangs that happen to users over the course of playing a
> game for several hours. It is absolute hell to try to even reproduce
> let alone diagnose these issues, and this is what we would like to
> improve.
> 
> For these hard-to-diagnose problems, it is already a challenge to
> determine whether the problem is the kernel (eg. setting wrong voltages
> / frequencies) or userspace (eg. missing some synchronization), can be
> even a game bug that we need to work around.
> 
>> For example most VM faults can
>> be handled without hanging the GPU. Similarly, a shader in an endless
>> loop should not require a full GPU reset.
> 
> This is actually not the case, AFAIK André's test case was an app that
> had an infinite loop in a shader.
> 

This is the test app if anyone want to try out: 
https://github.com/andrealmeid/vulkan-triangle-v1. Just compile and run.

The kernel calls amdgpu_ring_soft_recovery() when I run my example, but 
I'm not sure what a soft recovery means here and if it's a full GPU 
reset or not.

But if we can at least trust the CP registers to dump information for 
soft resets, it would be some improvement from the current state I think

>>
>> It's more complicated for graphics because of the more complex
>> pipeline
>> and the lack of CWSR. But it should still be possible to do some
>> debugging without JTAG if the problem is in SW and not HW or FW. It's
>> probably worth improving that debugability without getting hung-up on
>> the worst case.
> 
> I agree, and we welcome any constructive suggestion to improve the
> situation. It seems like our idea doesn't work if the kernel can't give
> us the information we need.
> 
> How do we move forward?
> 
> Best regards,
> Timur
>