linux-kernel - Re: BUG: ASPM issues with Radeon Pro WX3100

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250710154926.GA2250118@bhelgaas>
Date: Thu, 10 Jul 2025 10:49:26 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Alex Huang <huangalex409@...il.com>
Cc: Kenneth Feng <kenneth.feng@....com>,
	Alex Deucher <alexander.deucher@....com>,
	Christian König <christian.koenig@....com>,
	amd-gfx@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
	linux-pci@...r.kernel.org
Subject: Re: BUG: ASPM issues with Radeon Pro WX3100

On Wed, Jul 09, 2025 at 09:02:17PM -0400, Alex Huang wrote:
> On 2025-07-08 19:07, Bjorn Helgaas wrote:
> > On Thu, Jul 03, 2025 at 12:09:20AM -0400, Alex Huang wrote:
> >> Recently, I dug up a Radeon Pro WX3100 and when booting, got a black screen
> >> with some complaints of No EDID read and then a `Fatal error during GPU
> >> init`. With windows booting fine and an MSI Kombustor run turning out just
> >> fine, I would say hardware failure highly unlikely. The logs seem unrelated
> >> (although I have attached them anyways), lspci -vvxxx output for the device
> >> is also at the end of the email. Also here is lspci -vvxxx for the upstream
> >> PCI bridge attached to the GPU.
> >>
> >> A bisect reveals the offending commit is 0064b0ce85bb ("drm/amd/pm: enable
> >> ASPM by default"). The simple fix appears to be setting `amdgpu.aspm=0` in
> >> kernel boot parameters. This seemingly is a case of something in the Lenovo
> >> ideacentre (specifically the ideacentre 510A-15ARR I found this bug on)
> >> incorrectly reporting ASPM availability. I'd think this is a PCI driver
> >> issue, but I am by no means an expert here. If this ends up on the wrong
> >> mailing list, please do let me know.
> > 
> > The messages below show that you're running v5.12.0-rc7+, but
> > 0064b0ce85bb didn't appear until v5.14.  Obviously it was reproducible
> > if you could bisect it, but I'm confused about where you observed the
> > problem.
> 
> Prior to v5.14, it's possible to replicate the bug with
> amdgpu.aspm=1, basically replicating what the change itself did.
> > 
> > The newer log you posted at
> > https://lore.kernel.org/r/e03b119d-4a27-45a0-8058-3ac7fbee23c7@gmail.com
> > is from v6.16.0-rc4+, which is great because it's a current kernel,
> > but the issue there looks much different (an oops in
> > drm_gem_object_handle_put_unlocked()) and doesn't seem like a PCI
> > issue at all.
> 
> Sorry, I had assumed you wanted the output from the PCI debug patch,
> which is why I had set amdgpu.aspm=0 to have easier access to logs.
> I've attached a log where the bug can be seen, although it's just
> amdgpu complaining and then falling over.
> 
> > If you can reproduce a PCI issue in v6.16, I'd love to look at it, but
> > right now I don't see anything I can help with.
> 
> Annoyingly, PCI doesn't complain at all about this issue, PCI just
> quietly reports ASPM is available (even when that is not the case)
> and amdgpu uses that to attempt to configure ASPM for the graphics
> card.
>
> Peeking at the return value for amdgpu_device_should_use_aspm shows
> pcie_aspm_enabled returns true even though ASPM is explicitly set to
> the "disable" mode in the BIOS.
>
> Leading me to believe this is a case of ASPM being incorrectly
> detected as enabled.

ASPM is designed to be a feature that the PCI core can discover and
configure independent of the driver.  Devices advertise ASPM support
via their Link Capabilities register, e.g., this one claims to support
L1 as well as the L1.1 and L1.2 substates:

  01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3100] (prog-if 00 [VGA controller])
    Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
      LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <1us
	      ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
    Capabilities: [370 v1] L1 PM Substates
      L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
		PortCommonModeRestoreTime=0us PortTPowerOnTime=170us

I don't know what your BIOS "disable" switch does.  It's possible it
just keeps BIOS from configuring ASPM, while leaving it advertised as
"supported" in config space, and Linux would configure ASPM in that
case.  There is also a bit in the ACPI FADT that says "OSPM must not
enable OSPM ASPM control on this platform," and maybe the BIOS
"disable" switch would set that.  If set, this would be mentioned in
the dmesg log.

I don't have any insight into why amdgpu inserts itself in the middle
of ASPM configuration.  There might be hardware defects it works
around, or it could be working around old or current ASPM defects in
the PCI core.

> My reasons for my conclusion can basically be summarized like this:
> - pcie_aspm_enabled returns true even if ASPM is disabled in BIOS.

The BIOS switch could (a) prevent BIOS from enabling ASPM itself
(could figure this out by booting with "pci=earlydump" and looking at
Link Control), (b) set the ACPI FADT bit (would be shown in dmesg), or
(c) change what's advertised in Link Capabilities (very unlikely since
it would require AMDGPU-specific support in BIOS; also, Linux can't
change Link Capabilities, and lspci showed L1 supported).  There's no
other BIOS-OS handshake I'm aware of.

> - amdgpu crashes with a non obvious issue and a lot of warnings as
>   long as it tries to configure ASPM.

ASPM configuration should only affect power consumption.  AFAIK, even
if it's configured incorrectly, we should not see any functional
issues.

> - putting the WX3100 into another machine caused it to boot just
>   fine, and did in fact correctly configure ASPM.

I mentioned my suspicion of L1.2 because that does depend on some
platform electrical properties that we don't know how to discover.
But even so, we shouldn't see a functional issue.

> - https://lore.kernel.org/lkml/CADnq5_PmxGxrJG5uZkkFXQ1YbJbDZTvAqb2oYqdCE=NtqBojqw@mail.gmail.com/
>   mentions "It's more of an issue with whether the underlying
>   platform supports ASPM or not"
> 
> It's possible I'm barking up the wrong tree here, I'm not familiar
> with this part of the kernel, if this turns out to actually be an
> amdgpu problem, please let me know.

> >> I also did try enabling/disabling ASPM on the BIOS side to no avail.
> >>
> >> The bug appears to be systematically existent for many other cards I ended
> >> up plugging into the device (thus conclusion as PCI driver issue). 

This sounds interesting.  More details here?  I guess you also see
issues with different cards plugged into the same slot?  And there
appears to be some ASPM connection there, too?

> ...
> kernel: amdgpu 0000:01:00.0: amdgpu: [drm] Display Core v3.2.334 initialized on DCE 11.2
> kernel: amdgpu 0000:01:00.0: [drm] *ERROR* No EDID read.
> kernel: amdgpu 0000:01:00.0: [drm] *ERROR* No EDID read.
> kernel: amdgpu 0000:01:00.0: [drm] *ERROR* No EDID read.
> kernel: amdgpu 0000:01:00.0: amdgpu:
>         last message was failed ret is 65535

This is in smu7_send_msg_to_smc() and it looks like we might have
gotten ~0 when reading a register.  Possibly a PCIe error, since the
Root Complex typically synthesizes ~0 data returns when a read fails
on PCIe.

> kernel: ------------[ cut here ]------------
> kernel: WARNING: CPU: 1 PID: 154 at drivers/gpu/drm/amd/amdgpu/uvd_v6_0.c:1111 uvd_v6_0_ring_insert_nop+0xb5/0xc0 [amdgpu]

This is:

  WARN_ON(ring->wptr % 2 || count % 2);

so apparently ring->wptr or count are expected to be even, but at
least one was odd.  Makes me wonder if wptr was set from a PCIe read
that returned ~0.

Both are a little odd since I don't see any AER errors mentioned in
the dmesg or the lspci output.  But worth looking into to see if there
are errors that we could catch earlier or handle better.  Also of
course odd if an error like this were related to ASPM.

Bjorn