linux-kernel - Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <425162fe-aeb7-4ff5-9a84-e7f6da20225e@kernel.org>
Date: Mon, 25 Aug 2025 11:41:30 -0500
From: Mario Limonciello <superm1@...nel.org>
To: Antheas Kapenekakis <lkml@...heas.dev>,
 Alex Deucher <alexdeucher@...il.com>
Cc: amd-gfx@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org, Alex Deucher <alexander.deucher@....com>,
 Christian König <christian.koenig@....com>,
 David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
 Harry Wentland <harry.wentland@....com>,
 Rodrigo Siqueira <siqueira@...lia.com>, Peyton Lee <peytolee@....com>,
 Lang Yu <lang.yu@....com>
Subject: Re: [PATCH v1 1/2] drm/amdgpu/vpe: increase VPE_IDLE_TIMEOUT to fix
 hang on Strix Halo

On 8/25/2025 9:01 AM, Antheas Kapenekakis wrote:
> On Mon, 25 Aug 2025 at 15:33, Antheas Kapenekakis <lkml@...heas.dev> wrote:
>>
>> On Mon, 25 Aug 2025 at 15:20, Alex Deucher <alexdeucher@...il.com> wrote:
>>>
>>> On Mon, Aug 25, 2025 at 3:13 AM Antheas Kapenekakis <lkml@...heas.dev> wrote:
>>>>
>>>> On the Asus Z13 2025, which uses a Strix Halo platform, around 8% of the
>>>> suspend resumes result in a soft lock around 1 second after the screen
>>>> turns on (it freezes). This happens due to power gating VPE when it is
>>>> not used, which happens 1 second after inactivity.
>>>>
>>>> Specifically, the VPE gating after resume is as follows: an initial
>>>> ungate, followed by a gate in the resume process. Then,
>>>> amdgpu_device_delayed_init_work_handler with a delay of 2s is scheduled
>>>> to run tests, one of which is testing VPE in vpe_ring_test_ib. This
>>>> causes an ungate, After that test, vpe_idle_work_handler is scheduled
>>>> with VPE_IDLE_TIMEOUT (1s).
>>>>
>>>> When vpe_idle_work_handler runs and tries to gate VPE, it causes the
>>>> SMU to hang and partially freezes half of the GPU IPs, with the thread
>>>> that called the command being stuck processing it.
>>>>
>>>> Specifically, after that SMU command tries to run, we get the following:
>>>>
>>>> snd_hda_intel 0000:c4:00.1: Refused to change power state from D0 to D3hot
>>>> ...
>>>> xhci_hcd 0000:c4:00.4: Refused to change power state from D0 to D3hot
>>>> ...
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VPE!
>>>> [drm:vpe_set_powergating_state [amdgpu]] *ERROR* Dpm disable vpe failed, ret = -62.
>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:93:crtc-0] flip_done timed out
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate JPEG!
>>>> [drm:jpeg_v4_0_5_set_powergating_state [amdgpu]] *ERROR* Dpm disable jpeg failed, ret = -62.
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 0!
>>>> [drm:vcn_v4_0_5_stop [amdgpu]] *ERROR* Dpm disable uvd failed, ret = -62.
>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 1 from 0xd3
>>>> thunderbolt 0000:c6:00.5: 0: timeout reading config space 2 from 0x5
>>>> thunderbolt 0000:c6:00.5: Refused to change power state from D0 to D3hot
>>>> amdgpu 0000:c4:00.0: [drm] *ERROR* [CRTC:97:crtc-1] flip_done timed out
>>>> amdgpu 0000:c4:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000032 SMN_C2PMSG_82:0x00000000
>>>> amdgpu 0000:c4:00.0: amdgpu: Failed to power gate VCN instance 1!
>>>>
>>>> In addition to e.g., kwin errors in journalctl. 0000:c4.00.0 is the GPU.
>>>> Interestingly, 0000:c4.00.6, which is another HDA block, 0000:c4.00.5,
>>>> a PCI controller, and 0000:c4.00.2, resume normally. 0x00000032 is the
>>>> PowerDownVpe(50) command which is the common failure point in all
>>>> failed resumes.
>>>>
>>>> On a normal resume, we should get the following power gates:
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVpe(50) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg0(33) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownJpeg1(38) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn1(4) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerDownVcn0(6) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn0(7) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpVcn1(5) param: 0x00010000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg0(34) param: 0x00000000, resp: 0x00000001
>>>> amdgpu 0000:c4:00.0: amdgpu: smu send message: PowerUpJpeg1(39) param: 0x00010000, resp: 0x00000001
>>>>
>>>> To fix this, increase VPE_IDLE_TIMEOUT to 2 seconds. This increases
>>>> reliability from 4-25 suspends to 200+ (tested) suspends with a cycle
>>>> time of 12s sleep, 8s resume. The suspected reason here is that 1s that
>>>> when VPE is used, it needs a bit of time before it can be gated and
>>>> there was a borderline delay before, which is not enough for Strix Halo.
>>>> When the VPE is not used, such as on resume, gating it instantly does
>>>> not seem to cause issues.
>>>
>>> This doesn't make much sense.  The VPE idle timeout is arbitrary.  The
>>> VPE idle work handler checks to see if the block is idle before it
>>> powers gates the block. If it's not idle, then the delayed work is
>>> rescheduled so changing the timing should not make a difference.  We
>>> are no powering down VPE while it still has active jobs.  It sounds
>>> like there is some race condition somewhere else.
>>
>> On resume, the vpe is ungated and gated instantly, which does not
>> cause any crashes, then the delayed work is scheduled to run two
>> seconds later. Then, the tests run and finish, which start the gate
>> timer. After the timer lapses and the kernel tries to gate VPE, it
>> crashes. I logged all SMU commands and there is no difference between
>> the ones in a crash and not, other than the fact the VPE gate command
>> failed. Which becomes apparent when the next command runs. I will also
>> note that until the idle timer lapses, the system is responsive
>>
>> Since the VPE is ungated to run the tests, I assume that in my setup
>> it is not used close to resume.
> 
> I should also add that I forced a kernel panic and dumped all CPU
> backtraces in multiple logs. After the softlock, CPUs were either
> parked in the scheduler, powered off, or stuck executing an SMU
> command by e.g., a userspace usage sensor graph. So it is not a
> deadlock.
> 

Can you please confirm if you are on the absolute latest linux-firmware 
when you reproduced this issue?

Can you please share the debugfs output for amdgpu_firmware_info.