[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAd53p69oLBGYEc2A4PNBP9KVmQkH=EaNh2_zuFDbwWJNLmtXg@mail.gmail.com>
Date: Fri, 5 Jan 2024 11:25:53 +0800
From: Kai-Heng Feng <kai.heng.feng@...onical.com>
To: Michael Schaller <michael@...aller.de>
Cc: Bjorn Helgaas <helgaas@...nel.org>, bhelgaas@...gle.com, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org, regressions@...ts.linux.dev, macro@...am.me.uk,
ajayagarwal@...gle.com, sathyanarayanan.kuppuswamy@...ux.intel.com,
gregkh@...uxfoundation.org, hkallweit1@...il.com,
michael.a.bottini@...ux.intel.com, johan+linaro@...nel.org
Subject: Re: [Regression] [PCI/ASPM] [ASUS PN51] Reboot on resume attempt
(bisect done; commit found)
Hi Michael,
On Tue, Jan 2, 2024 at 2:57 AM Michael Schaller <michael@...aller.de> wrote:
>
> On 01.01.24 19:13, Bjorn Helgaas wrote:
> > On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
> >> Issue:
> >> On resume from suspend to RAM there is no output for about 12 seconds, then
> >> shortly a blinking cursor is visible in the upper left corner on an
> >> otherwise black screen which is followed by a reboot.
> >>
> >> Setup:
> >> * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
> >> * Firmware: 0508 (latest; also tested previous 0505)
> >> * OS: Ubuntu 23.10 (except kernel)
> >> * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
> >>
> >> Debugging summary:
> >> * Kernel 5.10.205 isn’t affected.
> >> * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
> >> cause.
> >> * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
> >> ASPM is enabled (default).
> >> * The commit message indicates that a quirk could be written to mitigate the
> >> issue but I don’t know how to write such a quirk.
> >>
> >> Confirmed workarounds:
> >> * Connect a USB flash drive (no clue why; maybe this causes a delay that
> >> lets the resume succeed)
> >> * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
> >> intentional; a quirk seems to be the preferred solution)
> >> * pcie_aspm=off
> >> * pcie_aspm.policy=performance
> >> * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
> >>
> >> Debugging details:
> >> * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
> >> any difference.
> >> * Double checked that the kernel is configured to *not* reboot on panic.
> >> * Double checked that there still isn't any kernel output without quiet and
> >> splash.
> >> * The issue doesn’t happen if a USB flash drive is connected. The content of
> >> the flash drive doesn’t appear to matter. The USB port doesn’t appear to
> >> matter.
> >> * No information in any logs after the reboot. I suspect the resume from
> >> suspend to RAM isn’t getting far enough as that logs could be written.
> >> * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
> >> affected.
> >> * A kernel bisect has revealed the following commit as cause:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
> >> * The commit was part of kernel 5.20 and has been backported to 5.15.
> >> * The commit mentions that a device-specific quirk could be added in case of
> >> new issues.
> >> * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
> >> has ASPM enabled by default.
> >> * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
> >> succeed.
> >> * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
> >> suspend to RAM succeed.
> >> * This would indicate that a quirk is missing for the device 0000:03:00.0
> >> (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
> >> the specifics for such a quirk.
> >> * I still have no clue how a USB flash drive plays into all this. Maybe some
> >> kind of a timing issue where the connected USB flash drive delays something
> >> long enough so that the resume succeeds. Maybe the code removed by commit
> >> 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
> >
> > Hmmm. 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> > appeared in v6.0, released Oct 2, 2022, so it's been there a while.
> >
> > But I think the best option is to revert it until this issue is
> > resolved. Per the commit log, 08d0cc5f3426 solved two problems:
> >
> > 1) ASPM config changes done via sysfs are lost if the device power
> > state is changed, e.g., typically set to D3hot in .suspend() and
> > D0 in .resume().
> >
> > 2) If L1SS is restored during system resume, that restored state
> > would be overwritten.
> >
> > Problem 2) relates to a patch that is currently reverted (a7152be79b62
> > ("Revert "PCI/ASPM: Save L1 PM Substates Capability for
> > suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
> > this problem worse.
> >
> > Reverting 08d0cc5f3426 will make 1) a problem again. But my guess is
> > ASPM changes via sysfs are fairly unusual and the device probably
> > remains functional even though it may use more power because the ASPM
> > configuration was lost.
> >
> > So unless somebody has a counter-argument, I plan to queue a revert of
> > 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> > v6.7.
> >
> > Bjorn
>
> If it helps I could also try if a partial revert of 08d0cc5f3426 would
> be sufficient. This might also narrow down the issue and give more
> insight where the issue originates from.
>
> Let me know what you think.
Just wondering, does `echo 0 > /sys/power/pm_asysnc` help?
Kai-Heng
>
> Michael
Powered by blists - more mailing lists