linux-kernel - Re: [PATCH v2] drm/amdgpu: add ring timeout information in devcoredump

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <85027e94-9741-4bf8-ba5c-2937fc27d2d3@amd.com>
Date: Tue, 5 Mar 2024 14:20:35 +0100
From: Christian König <christian.koenig@....com>
To: "Khatri, Sunil" <sukhatri@....com>, Sunil Khatri <sunil.khatri@....com>,
 Alex Deucher <alexander.deucher@....com>,
 Shashank Sharma <shashank.sharma@....com>
Cc: amd-gfx@...ts.freedesktop.org, Pan@...-sunil-navi33.amd.com,
 Xinhui <Xinhui.Pan@....com>, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] drm/amdgpu: add ring timeout information in
 devcoredump

Am 05.03.24 um 14:17 schrieb Khatri, Sunil:
>
> On 3/5/2024 6:40 PM, Christian König wrote:
>> Am 05.03.24 um 12:58 schrieb Sunil Khatri:
>>> Add ring timeout related information in the amdgpu
>>> devcoredump file for debugging purposes.
>>>
>>> During the gpu recovery process the registered call
>>> is triggered and add the debug information in data
>>> file created by devcoredump framework under the
>>> directory /sys/class/devcoredump/devcdx/
>>>
>>> Signed-off-by: Sunil Khatri <sunil.khatri@....com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 15 +++++++++++++++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h |  2 ++
>>>   2 files changed, 17 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> index a59364e9b6ed..aa7fed59a0d5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c
>>> @@ -196,6 +196,13 @@ amdgpu_devcoredump_read(char *buffer, loff_t 
>>> offset, size_t count,
>>>                  coredump->reset_task_info.process_name,
>>>                  coredump->reset_task_info.pid);
>>>   +    if (coredump->ring_timeout) {
>>> +        drm_printf(&p, "\nRing timed out details\n");
>>> +        drm_printf(&p, "IP Type: %d Ring Name: %s \n",
>>> +                coredump->ring->funcs->type,
>>> +                coredump->ring->name);
>>> +    }
>>> +
>>>       if (coredump->reset_vram_lost)
>>>           drm_printf(&p, "VRAM is lost due to GPU reset!\n");
>>>       if (coredump->adev->reset_info.num_regs) {
>>> @@ -220,6 +227,8 @@ void amdgpu_coredump(struct amdgpu_device *adev, 
>>> bool vram_lost,
>>>   {
>>>       struct amdgpu_coredump_info *coredump;
>>>       struct drm_device *dev = adev_to_drm(adev);
>>> +    struct amdgpu_job *job = reset_context->job;
>>> +    struct drm_sched_job *s_job;
>>>         coredump = kzalloc(sizeof(*coredump), GFP_NOWAIT);
>>>   @@ -228,6 +237,12 @@ void amdgpu_coredump(struct amdgpu_device 
>>> *adev, bool vram_lost,
>>>           return;
>>>       }
>>>   +    if (job) {
>>> +        s_job = &job->base;
>>> +        coredump->ring = to_amdgpu_ring(s_job->sched);
>>> +        coredump->ring_timeout = TRUE;
>>> +    }
>>> +
>>>       coredump->reset_vram_lost = vram_lost;
>>>         if (reset_context->job && reset_context->job->vm) {
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> index 19899f6b9b2b..6d67001a1057 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
>>> @@ -97,6 +97,8 @@ struct amdgpu_coredump_info {
>>>       struct amdgpu_task_info         reset_task_info;
>>>       struct timespec64               reset_time;
>>>       bool                            reset_vram_lost;
>>> +    struct amdgpu_ring          *ring;
>>> +    bool                            ring_timeout;
>>
>> I think you can drop ring_timeout, just having ring as optional 
>> information should be enough.
>>
>> Apart from that looks pretty good I think.
>>
> - GPU reset could happen due to two possibilities atleast: 1. via 
> sysfs cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover there is no 
> timeout or page fault here. In this case we need information if 
> ringtimeout happened or not else it will try to print empty 
> information in devcoredump. Same goes for pagefault also in that case 
> also we need to see if recovery ran due to pagefault and then only add 
> that information.
>
> So to cover all use cases i added this parameter.

Yeah, but you don't need that parameter.

If a GPU reset is triggered by a ring timeout we note the ring causing 
this, otherwise the ring is simply NULL.

There is no need for a separate variable to note that a timeout happened.

Regards,
Christian.

>
>
> Thanks
> Sunil
>
>> Regards,
>> Christian.
>>
>>>   };
>>>   #endif
>>