lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fxakjhx7lrikgs4x3nbwgnhhcwmlum3esxp2dj5d26xc5iyg22@wkbbwysh3due>
Date: Wed, 15 Oct 2025 11:52:25 +0530
From: Manivannan Sadhasivam <mani@...nel.org>
To: Dragan Simic <dsimic@...jaro.org>
Cc: Bjorn Helgaas <helgaas@...nel.org>, FUKAUMI Naoki <naoki@...xa.com>, 
	manivannan.sadhasivam@....qualcomm.com, Bjorn Helgaas <bhelgaas@...gle.com>, 
	Lorenzo Pieralisi <lpieralisi@...nel.org>, Krzysztof Wilczyński <kwilczynski@...nel.org>, 
	Rob Herring <robh@...nel.org>, linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-arm-msm@...r.kernel.org, "David E. Box" <david.e.box@...ux.intel.com>, 
	Kai-Heng Feng <kai.heng.feng@...onical.com>, "Rafael J. Wysocki" <rafael@...nel.org>, 
	Heiner Kallweit <hkallweit1@...il.com>, Chia-Lin Kao <acelan.kao@...onical.com>, 
	linux-rockchip@...ts.infradead.org, regressions@...ts.linux.dev
Subject: Re: [PATCH v2 1/2] PCI/ASPM: Override the ASPM and Clock PM states
 set by BIOS for devicetree platforms

On Wed, Oct 15, 2025 at 01:33:35AM +0200, Dragan Simic wrote:
> Hello all,
> 
> On Tuesday, October 14, 2025 20:49 CEST, Bjorn Helgaas <helgaas@...nel.org> wrote:
> > On Wed, Oct 15, 2025 at 01:30:16AM +0900, FUKAUMI Naoki wrote:
> > > I've noticed an issue on Radxa ROCK 5A/5B boards, which are based on the
> > > Rockchip RK3588(S) SoC.
> > > 
> > > When running Linux v6.18-rc1 or linux-next since 20250924, the kernel either
> > > freezes or fails to probe M.2 Wi-Fi modules. This happens with several
> > > different modules I've tested, including the Realtek RTL8852BE, MediaTek
> > > MT7921E, and Intel AX210.
> > > 
> > > I've found that reverting the following commit (i.e., the patch I'm replying
> > > to) resolves the problem:
> > > commit f3ac2ff14834a0aa056ee3ae0e4b8c641c579961
> > 
> > Thanks for the report, and sorry for the regression.
> > 
> > Since this affects several devices from different manufacturers and (I
> > assume) different drivers, it seems likely that there's some issue
> > with the Rockchip end, since ASPM probably works on these devices in
> > other systems.  So we should figure out if there's something wrong
> > with the way we configure ASPM, which we could potentially fix, or if
> > there's a hardware issue and we need some king of quirk to prevent
> > usage of ASPM on the affected platforms.
> > 
> > Can you collect a complete dmesg log when booting with
> > 
> >   ignore_loglevel pci=earlydump dyndbg="file drivers/pci/* +p"
> > 
> > and the output of "sudo lspci -vv"?
> > 
> > When the kernel freezes, can you give us any information about where,
> > e.g., a log or screenshot?
> > 
> > Do you know if any platforms other than Radxa ROCK 5A/5B have this
> > problem?
> 
> After thinking quite a bit about it, I think we should revert this
> patch and replace it with another patch that allows per-SoC, or
> maybe even per-board, opting into the forced enablement of PCIe
> ASPM.  Let me explain, please.
> 

ASPM is a PCIe device specific feature, nothing related to SoC/board. Even if
you limit it to certain platforms, there is no guarantee that it will be safe as
the users can connect a buggy device to the slot and it could lead to the same
issue.

> When a new feature is introduced, it's expected that it may fail
> on some hardware or with some specific setups, so quirking off such
> instances, as time passes, is perfectly fine.  Such a new feature
> didn't work before it was implemented, so it's acceptable that it
> fails in some instances after the introduction, and that it gets
> quirked off as time passes and more testing is performed.
> 

ASPM is not a new feature. It was introduced more than a decade before. But we
somehow procastinated the enablement for so long until we realized that if we
don't do it now, we wouldn't be able to do it anytime in the future.

> However, when some widespread feature, such as PCIe, has already
> been in production for quite a while, introducing high-risk changes
> to it in a blanket fashion, while intending to have the incompatible
> or not-yet-ready platforms quirked off over time, simply isn't the
> way to go.  Breaking stuff intentionally to find out what actually
> doesn't work is rarely a good option.
> 

The issue is due to devices exposing ASPM capability, but behaving erratically
when enabled. Until, we enable ASPM on these devices, we cannot know whether
they are working or not. To avoid mass chaos, we decided to enable it only for
devicetree platforms as a start.

> Thus, I'd suggest that this patch is replaced with nother patches,
> which would introduce an additional ASPM opt-in switch to the PCI
> binding, allowing SoCs or boards to have it enabled _after_ proper
> testing is performed.  The PCIe driver may emit a warning that ASPM
> is to be enabled at some point in the future, to "bug" people about
> the need to perform the testing, etc.

Even if we emit a "YOUR DEVICE MAY BREAK" warning, nobody would care as long as
the device works for them. We didn't decide to enable this feature overnight to
trouble users. The fact that ASPM saves runtime power, which will benefit users
and ofc the environment as a whole, should not be kept disabled.

But does that mean, we wanted to have breakages, NO. We expected breakages as
not all devices will play nicely with ASPM, but there is only one way to find
out. And we do want to disable ASPM only for those devices.

>  With all that in place, we
> could expect that in a year or two PCIe ASPM could eventually be
> enabled everywhere.  Getting everything tested is a massive endeavor,
> but that's the only way not to break stuff.
> 
> Biting the bullet and hoping that it all goes well, I'd say, isn't
> the right approach here.
> 

Your two year phased approach would never work as that's what we have hoped for
more than a decade.

- Mani


-- 
மணிவண்ணன் சதாசிவம்

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ