lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACu1E7HROtx1Zgyy0EJuHj_HWE8Nd6OtFnxTcrDrHP+2HA5o6A@mail.gmail.com>
Date: Wed, 17 Jul 2024 10:40:26 +0100
From: Connor Abbott <cwabbott0@...il.com>
To: Vladimir Lypak <vladimir.lypak@...il.com>
Cc: Rob Clark <robdclark@...il.com>, Sean Paul <sean@...rly.run>, 
	Konrad Dybcio <konrad.dybcio@...aro.org>, Abhinav Kumar <quic_abhinavk@...cinc.com>, 
	Dmitry Baryshkov <dmitry.baryshkov@...aro.org>, 
	Marijn Suijten <marijn.suijten@...ainline.org>, David Airlie <airlied@...il.com>, 
	Daniel Vetter <daniel@...ll.ch>, Jordan Crouse <jordan@...micpenguin.net>, 
	linux-arm-msm@...r.kernel.org, dri-devel@...ts.freedesktop.org, 
	freedreno@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/4] fixes for Adreno A5Xx preemption

On Thu, Jul 11, 2024 at 11:10 AM Vladimir Lypak
<vladimir.lypak@...il.com> wrote:
>
> There are several issues with preemption on Adreno A5XX GPUs which
> render system unusable if more than one priority level is used. Those
> issues include persistent GPU faults and hangs, full UI lockups with
> idling GPU.
>
> ---
> Vladimir Lypak (4):
>   drm/msm/a5xx: disable preemption in submits by default
>   drm/msm/a5xx: properly clear preemption records on resume
>   drm/msm/a5xx: fix races in preemption evaluation stage
>   drm/msm/a5xx: workaround early ring-buffer emptiness check
>
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c     | 18 ++++++++++----
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.h     | 12 ++++++---
>  drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 30 ++++++++++++++++++++---
>  3 files changed, 47 insertions(+), 13 deletions(-)
> ---
> base-commit: 523b23f0bee3014a7a752c9bb9f5c54f0eddae88
> --
> 2.45.2
>

Hi Vladimir,

While looking at preemption on a7xx, where the overall logic is pretty
much the same, and I've been seeing the same "soft lockups". However
even after porting patch 3, it turns out that's not enough because
there's a different race. The sequence of events is something like
this:

1. Medium-prio app A submits to ring 2.
2. Ring 0 retires its last job and we start a preemption to ring 2.
3. High-prio app B submits to ring 0. It sees the preemption from step
2 ongoing and doesn't write the WTPR register or try to preempt.
4. The preemption finishes and we update WPTR.
5. App A's submit retires. We try to preempt, but the submit and
ring->cur write from step 3 happened on a different CPU and the write
hasn't landed yet so we skip it.

It's a bit tricky because write reordering is involved, but this seems
to be what's happening - everything except my speculation about the
delayed write to ring->cur being the problem comes straight from a
trace of this happening.

Rob suggested on IRC that we make retire handling happen on the same
workqueue as submissions, so that preempt_trigger is always
serialized, which IIUC would also make patch 3 unnecessary. What do
you think?

Best regards,

Connor

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ