[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8640445b-a868-4c1f-a32b-449bbffa2553@gmail.com>
Date: Wed, 9 Jul 2025 21:02:17 -0400
From: Alex Huang <huangalex409@...il.com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: Kenneth Feng <kenneth.feng@....com>,
Alex Deucher <alexander.deucher@....com>,
Christian König <christian.koenig@....com>,
amd-gfx@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org
Subject: Re: BUG: ASPM issues with Radeon Pro WX3100
On 2025-07-08 19:07, Bjorn Helgaas wrote:
> On Thu, Jul 03, 2025 at 12:09:20AM -0400, Alex Huang wrote:
>> Hi,
>>
>> Recently, I dug up a Radeon Pro WX3100 and when booting, got a black screen
>> with some complaints of No EDID read and then a `Fatal error during GPU
>> init`. With windows booting fine and an MSI Kombustor run turning out just
>> fine, I would say hardware failure highly unlikely. The logs seem unrelated
>> (although I have attached them anyways), lspci -vvxxx output for the device
>> is also at the end of the email. Also here is lspci -vvxxx for the upstream
>> PCI bridge attached to the GPU.
>>
>> A bisect reveals the offending commit is 0064b0ce85bb ("drm/amd/pm: enable
>> ASPM by default"). The simple fix appears to be setting `amdgpu.aspm=0` in
>> kernel boot parameters. This seemingly is a case of something in the Lenovo
>> ideacentre (specifically the ideacentre 510A-15ARR I found this bug on)
>> incorrectly reporting ASPM availability. I'd think this is a PCI driver
>> issue, but I am by no means an expert here. If this ends up on the wrong
>> mailing list, please do let me know.
>
> The messages below show that you're running v5.12.0-rc7+, but
> 0064b0ce85bb didn't appear until v5.14. Obviously it was reproducible
> if you could bisect it, but I'm confused about where you observed the
> problem.
Prior to v5.14, it's possible to replicate the bug with amdgpu.aspm=1, basically replicating what the change itself did.
>
> The newer log you posted at
> https://lore.kernel.org/r/e03b119d-4a27-45a0-8058-3ac7fbee23c7@gmail.com
> is from v6.16.0-rc4+, which is great because it's a current kernel,
> but the issue there looks much different (an oops in
> drm_gem_object_handle_put_unlocked()) and doesn't seem like a PCI
> issue at all.
Sorry, I had assumed you wanted the output from the PCI debug patch, which is why I had set amdgpu.aspm=0 to have easier access to logs.
I've attached a log where the bug can be seen, although it's just amdgpu complaining and then falling over.
>
> If you can reproduce a PCI issue in v6.16, I'd love to look at it, but
> right now I don't see anything I can help with.
Annoyingly, PCI doesn't complain at all about this issue, PCI just quietly reports ASPM is available (even when that is not the case) and amdgpu uses that to attempt to configure ASPM for the graphics card.
Peeking at the return value for amdgpu_device_should_use_aspm shows pcie_aspm_enabled returns true even though ASPM is explicitly set to the "disable" mode in the BIOS.
Leading me to believe this is a case of ASPM being incorrectly detected as enabled.
My reasons for my conclusion can basically be summarized like this:
- pcie_aspm_enabled returns true even if ASPM is disabled in BIOS.
- amdgpu crashes with a non obvious issue and a lot of warnings as long as it tries to configure ASPM.
- putting the WX3100 into another machine caused it to boot just fine, and did in fact correctly configure ASPM.
- https://lore.kernel.org/lkml/CADnq5_PmxGxrJG5uZkkFXQ1YbJbDZTvAqb2oYqdCE=NtqBojqw@mail.gmail.com/ mentions "It's more of an issue with whether the underlying platform supports ASPM or not"
It's possible I'm barking up the wrong tree here, I'm not familiar with this part of the kernel, if this turns out to actually be an amdgpu problem, please let me know.
Regards,
Alex H.
>
>> I also did try enabling/disabling ASPM on the BIOS side to no avail.
>>
>> The bug appears to be systematically existent for many other cards I ended
>> up plugging into the device (thus conclusion as PCI driver issue). And does
>> appear to have an attempt to fix specifically for amdgpu
>> (20220408154447.3519453-1-richard.gong@....com) but that never went
>> upstream.
>>
>> I could try fixing this bug if it indeed is a PCI driver bug.
>>
>> Thanks,
>> Alex H
>>
>> PS This is my first message around here, please be nice to me :)
>>
>>
View attachment "journalctl.txt" of type "text/plain" (162942 bytes)
Powered by blists - more mailing lists