linux-kernel - Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: 
 <CAGwozwHdQu0K-dgnh72P=ms-ory2bZr-6rtCtWM2QP0u8NqXng@mail.gmail.com>
Date: Mon, 25 Aug 2025 23:00:11 +0200
From: Antheas Kapenekakis <lkml@...heas.dev>
To: Mario Limonciello <superm1@...nel.org>
Cc: Alex Deucher <alexdeucher@...il.com>, amd-gfx@...ts.freedesktop.org,
	dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
	Alex Deucher <alexander.deucher@....com>,
 Christian König <christian.koenig@....com>,
	David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
	Harry Wentland <harry.wentland@....com>,
 Rodrigo Siqueira <siqueira@...lia.com>,
	Peyton Lee <peytolee@....com>, Lang Yu <lang.yu@....com>
Subject: Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix
 hang on Strix Halo

On Mon, 25 Aug 2025 at 18:41, Mario Limonciello <superm1@...nel.org> wrote:
>
> On 8/25/2025 9:01 AM, Antheas Kapenekakis wrote:
> > On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@...heas.dev> wrote:
> >>
> >> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@...il.com> wrote:
> >>>
> >>> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@...heas.dev> wrote:
> >>>>
> >>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
> >>>> suspend resumes result in a soft lock around 1 second after the screen
> >>>> turns on (it freezes). This happens due to power gating VPE when it is
> >>>> not used, which happens 1 second after inactivity.
> >>>>
> >>>> Specifically, the VPE gating after resume is as follows: an initial
> >>>> ungate, followed by a gate in the resume process. Then,
> >>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
> >>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
> >>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
> >>>> with VPE_IDLE_TIMEOUT (1s).
> >>>>
> >>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
> >>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
> >>>> that called the command being stuck processing it.
> >>>>
> >>>> Specifically, after that SMU command tries to run, we get the following:
> >>>>
> >>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
> >>>> ...
> >>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
> >>>> ...
> >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
> >>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
> >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
> >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
> >>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
> >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
> >>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
> >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
> >>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
> >>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
> >>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
> >>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
> >>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
> >>>>
> >>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
> >>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
> >>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
> >>>> PowerDownVpe(50) command which is the common failure point in all
> >>>> failed resumes.
> >>>>
> >>>> On a normal resume, we should get the following power gates:
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
> >>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
> >>>>
> >>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
> >>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
> >>>> time of 12s sleep, 8s resume. The suspected reason here is that 1s that
> >>>> when VPE is used, it needs a bit of time before it can be gated and
> >>>> there was a borderline delay before, which is not enough for Strix Halo.
> >>>> When the VPE is not used, such as on resume, gating it instantly does
> >>>> not seem to cause issues.
> >>>
> >>> This doesn't make much sense.  The VPE idle timeout is arbitrary.  The
> >>> VPE idle work handler checks to see if the block is idle before it
> >>> powers gates the block. If it's not idle, then the delayed work is
> >>> rescheduled so changing the timing should not make a difference.  We
> >>> are no powering down VPE while it still has active jobs.  It sounds
> >>> like there is some race condition somewhere else.
> >>
> >> On resume, the vpe is ungated and gated instantly, which does not
> >> cause any crashes, then the delayed work is scheduled to run two
> >> seconds later. Then, the tests run and finish, which start the gate
> >> timer. After the timer lapses and the kernel tries to gate VPE, it
> >> crashes. I logged all SMU commands and there is no difference between
> >> the ones in a crash and not, other than the fact the VPE gate command
> >> failed. Which becomes apparent when the next command runs. I will also
> >> note that until the idle timer lapses, the system is responsive
> >>
> >> Since the VPE is ungated to run the tests, I assume that in my setup
> >> it is not used close to resume.
> >
> > I should also add that I forced a kernel panic and dumped all CPU
> > backtraces in multiple logs. After the softlock, CPUs were either
> > parked in the scheduler, powered off, or stuck executing an SMU
> > command by e.g., a userspace usage sensor graph. So it is not a
> > deadlock.
> >
>
> Can you please confirm if you are on the absolute latest linux-firmware
> when you reproduced this issue?

I was on the latest at the time built from source. I think it was
commit 08ee93ff8ffa. There was an update today though it seems.


> Can you please share the debugfs output for amdgpu_firmware_info.

Here is the information from it:
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 35, firmware version: 0x0000001f
PFP feature version: 35, firmware version: 0x0000002c
CE feature version: 0, firmware version: 0x00000000
RLC feature version: 1, firmware version: 0x11530505
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
RLCP feature version: 1, firmware version: 0x11530505
RLCV feature version: 0, firmware version: 0x00000000
MEC feature version: 35, firmware version: 0x0000001f
IMU feature version: 0, firmware version: 0x0b352300
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 553648366, firmware version: 0x210000ee
TA XGMI feature version: 0x00000000, firmware version: 0x00000000
TA RAS feature version: 0x00000000, firmware version: 0x00000000
TA HDCP feature version: 0x00000000, firmware version: 0x17000044
TA DTM feature version: 0x00000000, firmware version: 0x12000018
TA RAP feature version: 0x00000000, firmware version: 0x00000000
TA SECUREDISPLAY feature version: 0x00000000, firmware version: 0x00000000
SMC feature version: 0, program: 0, firmware version: 0x00647000 (100.112.0)
SDMA0 feature version: 60, firmware version: 0x0000000e
VCN feature version: 0, firmware version: 0x0911800b
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x09002600
TOC feature version: 0, firmware version: 0x0000000b
MES_KIQ feature version: 6, firmware version: 0x0000006c
MES feature version: 1, firmware version: 0x0000007c
VPE feature version: 60, firmware version: 0x00000016
VBIOS version: 113-STRXLGEN-001

I see there was an update today though

Antheas
>