[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <72408fd5-bd8d-4f86-9856-b3b7858f0b9b@amd.com>
Date: Wed, 25 Jun 2025 20:10:28 -0400
From: Felix Kuehling <felix.kuehling@....com>
To: Alex Deucher <alexdeucher@...il.com>, Johl Brown <johlbrown@...il.com>,
Harish Kasiviswanathan <Harish.Kasiviswanathan@....com>,
"Yang, Philip" <Philip.Yang@....com>, "Kim, Jonathan" <Jonathan.Kim@....com>
Cc: amd-gfx@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
amd-gfx-owner@...ts.freedesktop.org
Subject: Re: [REGRESSION] RX-580 (gfx803) GPU hangs since ~v6.14.1 – “scheduler comp_1.1.1 is not ready” / ROCm 5.7-6.4+ broken
I couldn't find a dmesg attched to the linked bug reports. I was going to look for a kernel oops from calling an uninitialized function pointer. Your patch addresses just that.
I'm not sure how “drm/amdkfd: Improve signal event slow path” is implicated. I don't see anything in that patch that would break specifically on gfx v803.
Regards,
Felix
On 2025-06-25 18:21, Alex Deucher wrote:
> Adding folks from the KFD team to take a look. Thank you for
> bisecting. Does the attached patch fix it?
>
> Thanks,
>
> Alex
>
> On Wed, Jun 25, 2025 at 12:33 AM Johl Brown <johlbrown@...il.com> wrote:
>> Good Afternoon and best wishes!
>> This is my first attempt at upstreaming an issue after dailying arch for a full year now :)
>> Please forgive me, a lot of this is pushing my comfort zone, but preventing needless e-waste is important to me personally :) with this in mind, I will save your eyeballs and let you know I did use gpt to help compile the below, but I have proofread it several times (which means you can't be mad :p ).
>>
>>
>> https://github.com/ROCm/ROCm/issues/4965
>> https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779
>>
>>
>> Hello Kernel, AMD GPU, & ROCm maintainers,
>>
>> TL;DR: My Polaris (RX-580, gfx803) freezes under compute load on a number of kernels since v6.14 and newer. This was not previously the case prior to 6.15 for ROCm 6.4.0 on gfx803 cards.
>>
>> The issue has been successfully mitigated within an older version of ROC under kernel 6.16rc2 by reverting two specific commits:
>>
>> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19)
>>
>> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset for gfx 9.4+”, 2025-03-06)
>>
>> Reverting both commits on top of v6.16-rc3 restores full stability and allows ROCm 5.7 workloads (e.g., Stable-Diffusion, faster-whisper) to run. Instability is usually immediately obvious via eg models failing to initialise, no errors (other than host dmesg)/segfault reported, which is the usual failure method under previous kernels.
>>
>> ________________________________
>>
>> Problem Description
>>
>> A number of users report GPU hangs when initialising compute loads, specifically with ROCm 5.7+ workloads. This issue appears to be a regression, as it was not present in earlier kernel versions.
>>
>> System Information:
>>
>> OS: Arch Linux
>>
>> CPU: Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
>>
>> GPU: AMD Radeon RX 580 Series (gfx803)
>>
>> ROCm Version: Runtime Version: 1.1, Runtime Ext Version: 1.7 (as per rocminfo --support)
>>
>> ________________________________
>>
>> Affected Kernels and Regression Details
>>
>> The problem consistently occurs on v6.14.1-rc1 and newer kernels.
>>
>> Last known good: v6.11
>>
>> First known bad: v6.12
>>
>> The regression has been bisected to the following two commits, as reverting them resolves the issue:
>>
>> de84484c6f8b (“drm/amdkfd: Improve signal event slow path”, 2024-12-19)
>>
>> bac38ca057fe (“drm/amdkfd: implement per queue sdma reset …”, 2025-03-06)
>>
>> Both patches touch amdkfd queue reset paths and are first included in the exact releases where the regression appears.
>>
>> Here's a summary of kernel results:
>>
>> Kernel | Result | Note
>>
>> ------- | -------- | --------
>>
>> 6.13.y (LTS) | OK |
>>
>> 6.14.0 | OK | Baseline - my last working kernel, though I am not exactly sure which subver
>>
>> 6.14.1-rc1 | BAD | First hang
>>
>> 6.15-rc1 | BAD | Hang
>>
>> 6.15.8 | BAD | Hang
>>
>> 6.16-rc3 | BAD | Hang
>>
>> 6.16-rc3 – revert de84484 + bac38ca | OK | Full stability restored, ROCm workloads run for hours.
>>
>> ________________________________
>>
>> Reproduction Steps
>>
>> Boot the system with a kernel version exhibiting the issue (e.g., v6.14.1-rc1 or newer without the reverts).
>>
>> Run a ROCm workload that creates several compute queues, for example:
>>
>> python stable-diffusion.py
>>
>> faster-whisper --model medium ...
>>
>> Upon model initialization, an immediate driver crash occurs. This is visible on the host machine via dmesg logs.
>>
>> Observed Error Messages (dmesg):
>>
>> [drm] scheduler comp_1.1.1 is not ready, skipping
>> [drm:sched_job_timedout] ERROR ring comp_1.1.1 timeout
>> [message continues ad-infinitum while system functions generally]
>>
>> This is followed by a hard GPU reset (visible in logs, no visual artifacts), which reliably leads to a full system lockup. Python or Docker processes become unkillable, requiring a manual reboot. Over time, the desktop slowly loses interactivity.
>>
>> ________________________________
>>
>> Bisect Details
>>
>> I previously attempted a git bisect (limited to drivers/gpu/drm/amd) between v6.12 and v6.15-rc1, which identified some further potentially problematic commits, however due to undersized /boot/ partition was experiencing some difficulties. In the interim, it seems a user on the gfx803 compatibilty repo discovered the below regarding ROC 5.7:
>>
>> de84484c6f8b07ad0850d6c4 bad
>> bac38ca057fef2c8c024fe9e bad
>>
>> Cherry-picking reverts of both commits on top of v6.16-rc3 restores normal behavior; leaving either patch in place reproduces the hang.
>>
>> ________________________________
>>
>> Relevant Log Excerpts
>>
>> (Full dmesg logs can be attached separately if needed)
>>
>> [drm] scheduler comp_1.1.1 is not ready, skipping
>> [ 97.602622] amdgpu 0000:08:00.0: amdgpu: ring comp_1.1.1 timeout, signaled seq=123456 emitted seq=123459
>> [ 97.602630] amdgpu 0000:08:00.0: amdgpu: GPU recover succeeded, reset domain time = 2ms
>>
>> ________________________________
>> References:
>>
>> It's back: Log spam: [drm] scheduler comp_1.0.2 is not ready, skipping ... (https://bbs.archlinux.org/viewtopic.php?id=302729)
>>
>> Observations about HSA and KFD backends in TinyGrad · GitHub (https://gist.github.com/fxkamd/ffd02d66a2863e444ec208ea4f3adc48)
>>
>> AMD RX580 system freeze on maximum VRAM speed (https://discussion.fedoraproject.org/t/amd-rx580-system-freeze-on-maximum-vram-speed/136639)
>>
>> LKML: Linus Torvalds: Re: [git pull] drm fixes for 6.15-rc1 (https://lkml.org/lkml/2025/4/5/394)
>>
>> Commits · torvalds/linux - GitHub (Link for commit de84484) (https://github.com/torvalds/linux/commits?before=805ba04cb7ccfc7d72e834ebd796e043142156ba+6335)
>>
>> Commits · torvalds/linux - GitHub (Link for commit bac38ca) (https://github.com/torvalds/linux/commits?before=5bc1018675ec28a8a60d83b378d8c3991faa5a27+7980)
>>
>> ROCm-For-RX580/README.md at main - GitHub (https://github.com/woodrex83/ROCm-For-RX580/blob/main/README.md)
>>
>> ROCm 4.6.0 for gfx803 - GitHub (https://github.com/robertrosenbusch/gfx803_rocm/issues/35#issuecomment-2996884779)
>>
>> Compatibility matrices — Use ROCm on Radeon GPUs - AMD (https://rocm.docs.amd.com/projects/radeon/en/latest/docs/compatibility.html)
>>
>>
>> ________________________________
>>
>> Why this matters
>>
>> Although gfx803 is End-of-Life (EOL) for official ROCm support, large user communities (Stable-Diffusion, Whisper, Tinygrad) still depend on it. Community builds (e.g., github.com/robertrosenbusch/gfx803_rocm/) demonstrate that ROCm 6.4+ and RX-580 are fully functional on a number of relatively recent kernels. This regression significantly impacts the usability of these cards for compute workloads.
>>
>> ________________________________
>>
>> Proposed Next Steps
>>
>> I suggest the following for further investigation:
>>
>> Review the interaction between the new KFD signal-event slow-path and legacy GPUs that may lack valid event IDs.
>>
>> Confirm whether hqd_sdma_get_doorbell() logic (added in bac38ca) returns stale doorbells on gfx803, potentially causing false positives.
>>
>> Consider back-outs for 6.15-stable / 6.16-rc while a proper fix is developed.
>>
>> Please let me know if you require any further diagnostics or testing. I can easily rebuild kernels and provide annotated traces.
>>
>> Please find my working document: https://chatgpt.com/share/6854bef2-c69c-8002-a243-a06c67a2c066
>>
>> Thanks for your time!
>>
>> Best regards, big love,
>>
>> Johl Brown
>>
>> johlbrown@...il.com
Powered by blists - more mailing lists