linux-kernel - Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <967a044bc2723cc24ab914506c0164db08923c59.camel@gmail.com>
Date:   Wed, 03 May 2023 19:43:12 +0200
From:   Timur Kristóf <timur.kristof@...il.com>
To:     Felix Kuehling <felix.kuehling@....com>,
        Christian König 
        <ckoenig.leichtzumerken@...il.com>,
        Alex Deucher <alexdeucher@...il.com>
Cc:     "Pelloux-Prayer, Pierre-Eric" <pierre-eric.pelloux-prayer@....com>,
        André Almeida <andrealmeid@...lia.com>,
        Marek Olšák <maraeo@...il.com>,
        michel.daenzer@...lbox.org,
        dri-devel <dri-devel@...ts.freedesktop.org>,
        linux-kernel@...r.kernel.org,
        Samuel Pitoiset <samuel.pitoiset@...il.com>,
        amd-gfx list <amd-gfx@...ts.freedesktop.org>,
        kernel-dev@...lia.com,
        "Deucher, Alexander" <alexander.deucher@....com>,
        Christian König <christian.koenig@....com>
Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Hi Felix,

On Wed, 2023-05-03 at 11:08 -0400, Felix Kuehling wrote:
> That's the worst-case scenario where you're debugging HW or FW
> issues. 
> Those should be pretty rare post-bringup. But are there hangs caused
> by 
> user mode driver or application bugs that are easier to debug and 
> probably don't even require a GPU reset?

There are many GPU hangs that gamers experience while playing. We have
dozens of open bug reports against RADV about GPU hangs on various GPU
generations. These usually fall into two categories:

1. When the hang always happens at the same point in a game. These are
painful to debug but manageable.
2. "Random" hangs that happen to users over the course of playing a
game for several hours. It is absolute hell to try to even reproduce
let alone diagnose these issues, and this is what we would like to
improve.

For these hard-to-diagnose problems, it is already a challenge to
determine whether the problem is the kernel (eg. setting wrong voltages
/ frequencies) or userspace (eg. missing some synchronization), can be
even a game bug that we need to work around.

> For example most VM faults can 
> be handled without hanging the GPU. Similarly, a shader in an endless
> loop should not require a full GPU reset.

This is actually not the case, AFAIK André's test case was an app that
had an infinite loop in a shader.

> 
> It's more complicated for graphics because of the more complex
> pipeline 
> and the lack of CWSR. But it should still be possible to do some 
> debugging without JTAG if the problem is in SW and not HW or FW. It's
> probably worth improving that debugability without getting hung-up on
> the worst case.

I agree, and we welcome any constructive suggestion to improve the
situation. It seems like our idea doesn't work if the kernel can't give
us the information we need.

How do we move forward?

Best regards,
Timur