lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 1 Jan 2024 19:57:40 +0100
From: Michael Schaller <michael@...aller.de>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: bhelgaas@...gle.com, kai.heng.feng@...onical.com,
 linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org,
 regressions@...ts.linux.dev, macro@...am.me.uk, ajayagarwal@...gle.com,
 sathyanarayanan.kuppuswamy@...ux.intel.com, gregkh@...uxfoundation.org,
 hkallweit1@...il.com, michael.a.bottini@...ux.intel.com,
 johan+linaro@...nel.org
Subject: Re: [Regression] [PCI/ASPM] [ASUS PN51] Reboot on resume attempt
 (bisect done; commit found)

On 01.01.24 19:13, Bjorn Helgaas wrote:
> On Mon, Dec 25, 2023 at 07:29:02PM +0100, Michael Schaller wrote:
>> Issue:
>> On resume from suspend to RAM there is no output for about 12 seconds, then
>> shortly a blinking cursor is visible in the upper left corner on an
>> otherwise black screen which is followed by a reboot.
>>
>> Setup:
>> * Machine: ASUS mini PC PN51-BB757MDE1 (DMI model: MINIPC PN51-E1)
>> * Firmware: 0508 (latest; also tested previous 0505)
>> * OS: Ubuntu 23.10 (except kernel)
>> * Kernel: 6.6.8 (also tested 6.7-rc7; config attached)
>>
>> Debugging summary:
>> * Kernel 5.10.205 isn’t affected.
>> * Bisect identified commit 08d0cc5f34265d1a1e3031f319f594bd1970976c as
>> cause.
>> * PCI device 0000:03:00.0 (Intel 8265 Wifi) causes resume issues as long as
>> ASPM is enabled (default).
>> * The commit message indicates that a quirk could be written to mitigate the
>> issue but I don’t know how to write such a quirk.
>>
>> Confirmed workarounds:
>> * Connect a USB flash drive (no clue why; maybe this causes a delay that
>> lets the resume succeed)
>> * Revert commit 08d0cc5f34265d1a1e3031f319f594bd1970976c (commit seemed
>> intentional; a quirk seems to be the preferred solution)
>> * pcie_aspm=off
>> * pcie_aspm.policy=performance
>> * echo 0 | sudo tee /sys/bus/pci/devices/0000:03:00.0/link/l1_aspm
>>
>> Debugging details:
>> * The resume trigger (power button, keyboard, mouse) doesn’t seem to make
>> any difference.
>> * Double checked that the kernel is configured to *not* reboot on panic.
>> * Double checked that there still isn't any kernel output without quiet and
>> splash.
>> * The issue doesn’t happen if a USB flash drive is connected. The content of
>> the flash drive doesn’t appear to matter. The USB port doesn’t appear to
>> matter.
>> * No information in any logs after the reboot. I suspect the resume from
>> suspend to RAM isn’t getting far enough as that logs could be written.
>> * Kernel 5.10.205 isn’t affected. Kernel 5.15.145, 6.6.8 and 6.7-rc7 are
>> affected.
>> * A kernel bisect has revealed the following commit as cause:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08d0cc5f34265d1a1e3031f319f594bd1970976c
>> * The commit was part of kernel 5.20 and has been backported to 5.15.
>> * The commit mentions that a device-specific quirk could be added in case of
>> new issues.
>> * According to sysfs and lspci only device 0000:03:00.0 (Intel 8265 Wifi)
>> has ASPM enabled by default.
>> * Disabling ASPM for device 0000:03:00.0 lets the resume from suspend to RAM
>> succeed.
>> * Enabling ASPM for all devices except 0000:03:00.0 lets the resume from
>> suspend to RAM succeed.
>> * This would indicate that a quirk is missing for the device 0000:03:00.0
>> (Intel 8265 Wifi) but I have no clue how to write such a quirk or how to get
>> the specifics for such a quirk.
>> * I still have no clue how a USB flash drive plays into all this. Maybe some
>> kind of a timing issue where the connected USB flash drive delays something
>> long enough so that the resume succeeds. Maybe the code removed by commit
>> 08d0cc5f34265d1a1e3031f319f594bd1970976c caused a similar delay. ¯\_(ツ)_/¯
> 
> Hmmm.  08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()")
> appeared in v6.0, released Oct 2, 2022, so it's been there a while.
> 
> But I think the best option is to revert it until this issue is
> resolved.  Per the commit log, 08d0cc5f3426 solved two problems:
> 
>    1) ASPM config changes done via sysfs are lost if the device power
>       state is changed, e.g., typically set to D3hot in .suspend() and
>       D0 in .resume().
> 
>    2) If L1SS is restored during system resume, that restored state
>       would be overwritten.
> 
> Problem 2) relates to a patch that is currently reverted (a7152be79b62
> ("Revert "PCI/ASPM: Save L1 PM Substates Capability for
> suspend/resume""), so I don't think reverting 08d0cc5f3426 will make
> this problem worse.
> 
> Reverting 08d0cc5f3426 will make 1) a problem again.  But my guess is
> ASPM changes via sysfs are fairly unusual and the device probably
> remains functional even though it may use more power because the ASPM
> configuration was lost.
> 
> So unless somebody has a counter-argument, I plan to queue a revert of
> 08d0cc5f3426 ("PCI/ASPM: Remove pcie_aspm_pm_state_change()") for
> v6.7.
> 
> Bjorn

If it helps I could also try if a partial revert of 08d0cc5f3426 would 
be sufficient. This might also narrow down the issue and give more 
insight where the issue originates from.

Let me know what you think.

Michael

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ