[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADnq5_NXj4W44F_etRQ7HWdVTnf5zARCM3Y_o3EiwWiHj8QMpA@mail.gmail.com>
Date: Mon, 1 May 2023 15:24:46 -0400
From: Alex Deucher <alexdeucher@...il.com>
To: André Almeida <andrealmeid@...lia.com>
Cc: dri-devel@...ts.freedesktop.org, amd-gfx@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, pierre-eric.pelloux-prayer@....com,
Marek Olšák <maraeo@...il.com>,
Timur Kristóf <timur.kristof@...il.com>,
michel.daenzer@...lbox.org,
Samuel Pitoiset <samuel.pitoiset@...il.com>,
kernel-dev@...lia.com, Bas Nieuwenhuizen <bas@...nieuwenhuizen.nl>,
alexander.deucher@....com, christian.koenig@....com
Subject: Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl
On Mon, May 1, 2023 at 2:58 PM André Almeida <andrealmeid@...lia.com> wrote:
>
> Currently UMD hasn't much information on what went wrong during a GPU reset. To
> help with that, this patch proposes a new IOCTL that can be used to query
> information about the resources that caused the hang.
If we went with the IOCTL, we'd want to limit this to the guilty process.
>
> The goal of this RFC is to gather feedback about this interface. The mesa part
> can be found at https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/22785
>
> The current implementation is racy, meaning that if two resets happens (even on
> different rings), the app will get the last reset information available, rather
> than the one that is looking for. Maybe this can be fixed with a ring_id
> parameter to query the information for a specific ring, but this also requires
> an interface to tell the UMD which ring caused it.
I think you'd want engine type or something like that so mesa knows
how to interpret the IB info. You could store the most recent info in
the fd priv for the guilty app. E.g., see what I did for tracking GPU
page fault into:
https://gitlab.freedesktop.org/agd5f/linux/-/commits/gpu_fault_info_ioctl
>
> I know that devcoredump is also used for this kind of information, but I believe
> that using an IOCTL is better for interfacing Mesa + Linux rather than parsing
> a file that its contents are subjected to be changed.
Can you elaborate a bit on that? Isn't the whole point of devcoredump
to store this sort of information?
Alex
>
> André Almeida (1):
> drm/amdgpu: Add interface to dump guilty IB on GPU hang
>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 3 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 3 +++
> drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 7 ++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 +
> drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 29 ++++++++++++++++++++++++
> include/uapi/drm/amdgpu_drm.h | 7 ++++++
> 7 files changed, 52 insertions(+), 1 deletion(-)
>
> --
> 2.40.1
>
Powered by blists - more mailing lists