[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<CAGwozwFmfBrnZBO6JRZPnPyHLrKycdnoMRtOkK+KpwkdQ4Fw=w@mail.gmail.com>
Date: Mon, 25 Aug 2025 16:01:00 +0200
From: Antheas Kapenekakis <lkml@...heas.dev>
To: Alex Deucher <alexdeucher@...il.com>
Cc: amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, Alex Deucher <alexander.deucher@....com>,
Christian König <christian.koenig@....com>,
David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
Harry Wentland <harry.wentland@....com>,
Rodrigo Siqueira <siqueira@...lia.com>,
Mario Limonciello <mario.limonciello@....com>, Peyton Lee <peytolee@....com>,
Lang Yu <lang.yu@....com>
Subject: Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix
hang on Strix Halo
On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@...heas.dev> wrote:
>
> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@...il.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@...heas.dev> wrote:
> > >
> > > On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> > > suspend resumes result in a soft lock around 1 second after the screen
> > > turns on (it freezes). This happens due to power gating VPE when it is
> > > not used, which happens 1 second after inactivity.
> > >
> > > Specifically, the VPE gating after resume is as follows: an initial
> > > ungate, followed by a gate in the resume process. Then,
> > > amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> > > to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> > > causes an ungate, After that test, vpe_idle_work_handler is scheduled
> > > with VPE_IDLE_TIMEOUT (1s).
> > >
> > > When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> > > SMU to hang and partially freezes half of the GPU IPs, with the thread
> > > that called the command being stuck processing it.
> > >
> > > Specifically, after that SMU command tries to run, we get the following:
> > >
> > > snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> > > ...
> > > xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> > > ...
> > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> > > [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> > > [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> > > [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> > > thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> > > thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> > > amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> > > amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> > > amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> > >
> > > In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> > > Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> > > a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> > > PowerDownVpe(50) command which is the common failure point in all
> > > failed resumes.
> > >
> > > On a normal resume, we should get the following power gates:
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> > > amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> > >
> > > To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> > > reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> > > time of 12s sleep, 8s resume. The suspected reason here is that 1s that
> > > when VPE is used, it needs a bit of time before it can be gated and
> > > there was a borderline delay before, which is not enough for Strix Halo.
> > > When the VPE is not used, such as on resume, gating it instantly does
> > > not seem to cause issues.
> >
> > This doesn't make much sense. The VPE idle timeout is arbitrary. The
> > VPE idle work handler checks to see if the block is idle before it
> > powers gates the block. If it's not idle, then the delayed work is
> > rescheduled so changing the timing should not make a difference. We
> > are no powering down VPE while it still has active jobs. It sounds
> > like there is some race condition somewhere else.
>
> On resume, the vpe is ungated and gated instantly, which does not
> cause any crashes, then the delayed work is scheduled to run two
> seconds later. Then, the tests run and finish, which start the gate
> timer. After the timer lapses and the kernel tries to gate VPE, it
> crashes. I logged all SMU commands and there is no difference between
> the ones in a crash and not, other than the fact the VPE gate command
> failed. Which becomes apparent when the next command runs. I will also
> note that until the idle timer lapses, the system is responsive
>
> Since the VPE is ungated to run the tests, I assume that in my setup
> it is not used close to resume.
I should also add that I forced a kernel panic and dumped all CPU
backtraces in multiple logs. After the softlock, CPUs were either
parked in the scheduler, powered off, or stuck executing an SMU
command by e.g., a userspace usage sensor graph. So it is not a
deadlock.
Antheas
> Antheas
>
> > Alex
> >
> > >
> > > Fixes: 5f82a0c90cca ("drm/amdgpu/vpe: enable vpe dpm")
> > > Signed-off-by: Antheas Kapenekakis <lkml@...heas.dev>
> > > ---
> > > drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > > index 121ee17b522b..24f09e457352 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vpe.c
> > > @@ -34,8 +34,8 @@
> > > /* VPE CSA resides in the 4th page of CSA */
> > > #define AMDGPU_CSA_VPE_OFFSET (4096 * 3)
> > >
> > > -/* 1 second timeout */
> > > -#define VPE_IDLE_TIMEOUT msecs_to_jiffies(1000)
> > > +/* 2 second timeout */
> > > +#define VPE_IDLE_TIMEOUT msecs_to_jiffies(2000)
> > >
> > > #define VPE_MAX_DPM_LEVEL 4
> > > #define FIXED1_8_BITS_PER_FRACTIONAL_PART 8
> > >
> > > base-commit: c17b750b3ad9f45f2b6f7e6f7f4679844244f0b9
> > > --
> > > 2.50.1
> > >
> > >
> >
Powered by blists - more mailing lists