lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 17 May 2024 09:31:25 -0600
From: Tim Van Patten <timvp@...omium.org>
To: Christian König <christian.koenig@....com>
Cc: LKML <linux-kernel@...r.kernel.org>, alexander.deucher@....com, 
	prathyushi.nangia@....com, Tim Van Patten <timvp@...gle.com>, 
	Daniel Vetter <daniel@...ll.ch>, David Airlie <airlied@...il.com>, 
	Felix Kuehling <Felix.Kuehling@....com>, Ikshwaku Chauhan <ikshwaku.chauhan@....com>, Le Ma <le.ma@....com>, 
	Lijo Lazar <lijo.lazar@....com>, Mario Limonciello <mario.limonciello@....com>, 
	"Pan, Xinhui" <Xinhui.Pan@....com>, "Shaoyun.liu" <Shaoyun.liu@....com>, 
	Shiwu Zhang <shiwu.zhang@....com>, Srinivasan Shanmugam <srinivasan.shanmugam@....com>, 
	amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org, 
	chris.kuruts@....com
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

On Fri, May 17, 2024 at 12:35 AM Christian König
<christian.koenig@....com> wrote:
>
> Am 16.05.24 um 19:57 schrieb Tim Van Patten:
> > From: Tim Van Patten <timvp@...gle.com>
> >
> > The following commit updated gmc->noretry from 0 to 1 for GC HW IP
> > 9.3.0:
> >
> >      commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
> >
> > This causes the device to hang when a page fault occurs, until the
> > device is rebooted. Instead, revert back to gmc->noretry=0 so the device
> > is still responsive.
>
> Wait a second. Why does the device hang on a page fault? That shouldn't
> happen independent of noretry.

No idea, but hopefully someone within AMD can help answer that.

I'm not an expert in this area, I was just able to bisect to the CL
causing the change in behavior. There are other reports of people
bisecting to the same CL, so this issue appears to extend beyond
ChromeOS:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/9728#note_2063879

> So that strongly sounds like this is just hiding a bug elsewhere.

That's entirely possible, bringing the number of real issues up to (at
least) two:
1. Why the page faults are occurring to begin with.
  - Any video of size 66x2158 seems to trigger the issue.
2. Why the page faults result in the device hanging with gmc->noretry=1.

I've asked prathyushi.nangia@amd (chris.kuruts@amd may be helping as
well) to look into the page faults further, since they can't hang the
device if they don't exist. She should be able to provide any further
details if you're interested, but please feel free to reach out to me
as well if you have any other questions.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ