[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH1Ww+QiAyfQL_bf1u=zLiT=ayKFWA0Fr2n5sBHUxfpzxcPbrg@mail.gmail.com>
Date: Thu, 11 Mar 2021 14:48:52 +0100
From: Daniel Gomez <daniel@...c.com>
To: Alex Deucher <alexdeucher@...il.com>
Cc: Alex Deucher <alexander.deucher@....com>,
Christian König <christian.koenig@....com>,
David Airlie <airlied@...ux.ie>,
Daniel Vetter <daniel@...ll.ch>,
Sumit Semwal <sumit.semwal@...aro.org>,
Hawking Zhang <Hawking.Zhang@....com>,
Huang Rui <ray.huang@....com>, Nirmoy Das <nirmoy.das@....com>,
Dennis Li <Dennis.Li@....com>, Monk Liu <Monk.Liu@....com>,
Yintian Tao <yttao@....com>, Guchun Chen <guchun.chen@....com>,
Evan Quan <evan.quan@....com>,
amd-gfx list <amd-gfx@...ts.freedesktop.org>,
Maling list - DRI developers
<dri-devel@...ts.freedesktop.org>,
LKML <linux-kernel@...r.kernel.org>,
linux-media <linux-media@...r.kernel.org>,
"moderated list:DMA BUFFER SHARING FRAMEWORK"
<linaro-mm-sig@...ts.linaro.org>, Alex Desnoyers <alex@...c.com>
Subject: Re: [PATCH]] drm/amdgpu/gfx9: add gfxoff quirk
On Thu, 11 Mar 2021 at 10:09, Daniel Gomez <daniel@...c.com> wrote:
>
> On Wed, 10 Mar 2021 at 18:06, Alex Deucher <alexdeucher@...il.com> wrote:
> >
> > On Wed, Mar 10, 2021 at 11:37 AM Daniel Gomez <daniel@...c.com> wrote:
> > >
> > > Disabling GFXOFF via the quirk list fixes a hardware lockup in
> > > Ryzen V1605B, RAVEN 0x1002:0x15DD rev 0x83.
> > >
> > > Signed-off-by: Daniel Gomez <daniel@...c.com>
> > > ---
> > >
> > > This patch is a continuation of the work here:
> > > https://lkml.org/lkml/2021/2/3/122 where a hardware lockup was discussed and
> > > a dma_fence deadlock was provoke as a side effect. To reproduce the issue
> > > please refer to the above link.
> > >
> > > The hardware lockup was introduced in 5.6-rc1 for our particular revision as it
> > > wasn't part of the new blacklist. Before that, in kernel v5.5, this hardware was
> > > working fine without any hardware lock because the GFXOFF was actually disabled
> > > by the if condition for the CHIP_RAVEN case. So this patch, adds the 'Radeon
> > > Vega Mobile Series [1002:15dd] (rev 83)' to the blacklist to disable the GFXOFF.
> > >
> > > But besides the fix, I'd like to ask from where this revision comes from. Is it
> > > an ASIC revision or is it hardcoded in the VBIOS from our vendor? From what I
> > > can see, it comes from the ASIC and I wonder if somehow we can get an APU in the
> > > future, 'not blacklisted', with the same problem. Then, should this table only
> > > filter for the vendor and device and not the revision? Do you know if there are
> > > any revisions for the 1002:15dd validated, tested and functional?
> >
> > The pci revision id (RID) is used to specify the specific SKU within a
> > family. GFXOFF is supposed to be working on all raven variants. It
> > was tested and functional on all reference platforms and any OEM
> > platforms that launched with Linux support. There are a lot of
> > dependencies on sbios in the early raven variants (0x15dd), so it's
> > likely more of a specific platform issue, but there is not a good way
> > to detect this so we use the DID/SSID/RID as a proxy. The newer raven
> > variants (0x15d8) have much better GFXOFF support since they all
> > shipped with newer firmware and sbios.
>
> We took one of the first reference platform boards to design our
> custom board based on the V1605B and I assume it has one of the early 'unstable'
> raven variants with RID 0x83. Also, as OEM we are in control of the bios
> (provided by insyde) but I wasn't sure about the RID so, thanks for the
> clarification. Is there anything we can do with the bios to have the GFXOFF
> enabled and 'stable' for this particular revision? Otherwise we'd need to add
> the 0x83 RID to the table. Also, there is an extra ']' in the patch
> subject. Sorry
> for that. Would you need a new patch in case you accept it with the ']' removed?
>
> Good to hear that the newer raven versions have better GFXOFF support.
Adding Alex Desnoyer to the loop as he is the electronic/hardware and
bios responsible so, he can
provide more information about this.
I've now done a test on the reference platform (dibbler) with the
latest bios available
and the hw lockup can be also reproduced with the same steps.
For reference, I'm using mainline kernel 5.12-rc2.
[ 5.938544] [drm] initializing kernel modesetting (RAVEN
0x1002:0x15DD 0x1002:0x15DD 0xC1).
[ 5.939942] amdgpu: ATOM BIOS: 113-RAVEN-11
As in the previous cases, the clocks go to 100% of usage when the hang occurs.
However, when the gpu hangs, dmesg output displays the following:
[ 1568.279847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, signaled seq=188, emitted seq=191
[ 1568.434084] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process Xorg pid 311 thread Xorg:cs0 pid 312
[ 1568.279847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx
timeout, signaled seq=188, emitted seq=191
[ 1568.434084] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process
information: process Xorg pid 311 thread Xorg:cs0 pid 312
[ 1568.507000] amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
[ 1628.491882] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1628.491882] rcu: 3-...!: (665 ticks this GP)
idle=f9a/1/0x4000000000000000 softirq=188533/188533 fqs=15
[ 1628.491882] rcu: rcu_sched kthread timer wakeup didn't happen for
58497 jiffies! g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 1628.491882] rcu: Possible timer handling issue on cpu=2
timer-softirq=55225
[ 1628.491882] rcu: rcu_sched kthread starved for 58500 jiffies!
g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=2
[ 1628.491882] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[ 1628.491882] rcu: RCU grace-period kthread stack dump:
[ 1628.491882] rcu: Stack dump where RCU GP kthread last ran:
[ 1808.518445] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1808.518445] rcu: 3-...!: (2643 ticks this GP)
idle=f9a/1/0x4000000000000000 softirq=188533/188533 fqs=15
[ 1808.518445] rcu: rcu_sched kthread starved for 238526 jiffies!
g726761 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=2
[ 1808.518445] rcu: Unless rcu_sched kthread gets sufficient CPU
time, OOM is now expected behavior.
[ 1808.518445] rcu: RCU grace-period kthread stack dump:
[ 1808.518445] rcu: Stack dump where RCU GP kthread last ran:
>
> Daniel
>
> >
> > Alex
> >
> >
> > >
> > > Logs:
> > > [ 27.708348] [drm] initializing kernel modesetting (RAVEN
> > > 0x1002:0x15DD 0x1002:0x15DD 0x83).
> > > [ 27.789156] amdgpu: ATOM BIOS: 113-RAVEN-115
> > >
> > > Thanks in advance,
> > > Daniel
> > >
> > > drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > index 65db88bb6cbc..319d4b99aec8 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> > > @@ -1243,6 +1243,8 @@ static const struct amdgpu_gfxoff_quirk amdgpu_gfxoff_quirk_list[] = {
> > > { 0x1002, 0x15dd, 0x103c, 0x83e7, 0xd3 },
> > > /* GFXOFF is unstable on C6 parts with a VBIOS 113-RAVEN-114 */
> > > { 0x1002, 0x15dd, 0x1002, 0x15dd, 0xc6 },
> > > + /* GFXOFF provokes a hw lockup on 83 parts with a VBIOS 113-RAVEN-115 */
> > > + { 0x1002, 0x15dd, 0x1002, 0x15dd, 0x83 },
> > > { 0, 0, 0, 0, 0 },
> > > };
> > >
> > > --
> > > 2.30.1
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > dri-devel@...ts.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Powered by blists - more mailing lists