lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Tue, 21 May 2024 13:04:09 +0200
From: Christian König <ckoenig.leichtzumerken@...il.com>
To: Alex Deucher <alexdeucher@...il.com>,
 Christian König <christian.koenig@....com>
Cc: Tim Van Patten <timvp@...omium.org>, LKML <linux-kernel@...r.kernel.org>,
 alexander.deucher@....com, prathyushi.nangia@....com,
 Tim Van Patten <timvp@...gle.com>, Daniel Vetter <daniel@...ll.ch>,
 David Airlie <airlied@...il.com>, Felix Kuehling <Felix.Kuehling@....com>,
 Ikshwaku Chauhan <ikshwaku.chauhan@....com>, Le Ma <le.ma@....com>,
 Lijo Lazar <lijo.lazar@....com>,
 Mario Limonciello <mario.limonciello@....com>,
 "Pan, Xinhui" <Xinhui.Pan@....com>, "Shaoyun.liu" <Shaoyun.liu@....com>,
 Shiwu Zhang <shiwu.zhang@....com>,
 Srinivasan Shanmugam <srinivasan.shanmugam@....com>,
 amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

Am 17.05.24 um 17:46 schrieb Alex Deucher:
> On Fri, May 17, 2024 at 2:35 AM Christian König
> <christian.koenig@....com> wrote:
>> Am 16.05.24 um 19:57 schrieb Tim Van Patten:
>>> From: Tim Van Patten <timvp@...gle.com>
>>>
>>> The following commit updated gmc->noretry from 0 to 1 for GC HW IP
>>> 9.3.0:
>>>
>>>       commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
>>>
>>> This causes the device to hang when a page fault occurs, until the
>>> device is rebooted. Instead, revert back to gmc->noretry=0 so the device
>>> is still responsive.
>> Wait a second. Why does the device hang on a page fault? That shouldn't
>> happen independent of noretry.
>>
>> So that strongly sounds like this is just hiding a bug elsewhere.
> Fair enough, but this is also the only gfx9 APU which defaults to
> noretry=1, all of the rest are dGPUs.  I'd argue it should align with
> the other GFX9 APUs or they should all enable noretry=1.

Completely agree.

It's just that while the hardware should theoretically be able to handle 
recoverable page faults it's just that this features is never tested on 
APUs because our hw engineering assumes that they don't have to support 
the use case. That's also the reason why we physically don't have the 
second IH ring on APUs.

I strongly suggest that instead of doing that for each hw generations 
individually to just disallow enabling retry on APUs.

Alternatively we could start testing it on hw and sw side and try to fix 
all the bugs.

Regards,
Christian.

>
> Alex
>
>> Regards,
>> Christian.
>>
>>> Fixes: 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
>>> Signed-off-by: Tim Van Patten <timvp@...gle.com>
>>> ---
>>>
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 -
>>>    1 file changed, 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> index be4629cdac049..bff54a20835f1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> @@ -876,7 +876,6 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>>>        struct amdgpu_gmc *gmc = &adev->gmc;
>>>        uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
>>>        bool noretry_default = (gc_ver == IP_VERSION(9, 0, 1) ||
>>> -                             gc_ver == IP_VERSION(9, 3, 0) ||
>>>                                gc_ver == IP_VERSION(9, 4, 0) ||
>>>                                gc_ver == IP_VERSION(9, 4, 1) ||
>>>                                gc_ver == IP_VERSION(9, 4, 2) ||


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ