linux-kernel - Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <31380dad-1206-5f3c-ab7d-1f448c6a7cb3@amd.com>
Date:   Tue, 2 May 2023 09:48:41 +0200
From:   Christian König <christian.koenig@....com>
To:     André Almeida <andrealmeid@...lia.com>,
        dri-devel@...ts.freedesktop.org, amd-gfx@...ts.freedesktop.org,
        linux-kernel@...r.kernel.org
Cc:     kernel-dev@...lia.com, alexander.deucher@....com,
        pierre-eric.pelloux-prayer@....com,
        'Marek Olšák' <maraeo@...il.com>,
        Samuel Pitoiset <samuel.pitoiset@...il.com>,
        Bas Nieuwenhuizen <bas@...nieuwenhuizen.nl>,
        Timur Kristóf <timur.kristof@...il.com>,
        michel.daenzer@...lbox.org
Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Well first of all don't expose the VMID to userspace.

The UMD doesn't know (and shouldn't know) which VMID is used for a 
submission since this is dynamically assigned and can change at any time.

For debugging there is an interface to use an reserved VMID for your 
debugged process which allows to associate logs, tracepoints and hw 
dumps with the stuff executed by this specific process.

Then we already have a feedback mechanism in the form of the error 
number in the fence. What we still need is an IOCTL to query that.

Regarding how far processing inside the IB was when the issue was 
detected, intermediate debug fences are much more reliable than asking 
the kernel for that.

Regards,
Christian.

Am 01.05.23 um 20:57 schrieb André Almeida:
> Currently UMD hasn't much information on what went wrong during a GPU reset. To
> help with that, this patch proposes a new IOCTL that can be used to query
> information about the resources that caused the hang.
>
> The goal of this RFC is to gather feedback about this interface. The mesa part
> can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785
>
> The current implementation is racy, meaning that if two resets happens (even on
> different rings), the app will get the last reset information available, rather
> than the one that is looking for. Maybe this can be fixed with a ring_id
> parameter to query the information for a specific ring, but this also requires
> an interface to tell the UMD which ring caused it.
>
> I know that devcoredump is also used for this kind of information, but I believe
> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
> a file that its contents are subjected to be changed.
>
> André Almeida (1):
>    drm/amdgpu: Add interface to dump guilty IB on GPU hang
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h      |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c  |  3 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c  |  7 ++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  1 +
>   drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c   | 29 ++++++++++++++++++++++++
>   include/uapi/drm/amdgpu_drm.h            |  7 ++++++
>   7 files changed, 52 insertions(+), 1 deletion(-)
>