lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240725153446.GA841157@bhelgaas>
Date: Thu, 25 Jul 2024 10:34:46 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Heiner Kallweit <hkallweit1@...il.com>
Cc: George-Daniel Matei <danielgeorgem@...omium.org>,
	Bjorn Helgaas <bhelgaas@...gle.com>, linux-kernel@...r.kernel.org,
	linux-pci@...r.kernel.org, nic_swsd@...ltek.com,
	netdev@...r.kernel.org, "Rafael J. Wysocki" <rjw@...ysocki.net>
Subject: Re: [PATCH] PCI: r8169: add suspend/resume aspm quirk

[+cc Rafael in case you have suspend debug help]

On Tue, Jul 16, 2024 at 09:25:40PM +0200, Heiner Kallweit wrote:
> On 16.07.2024 14:13, George-Daniel Matei wrote:
> > On Thu, Jul 11, 2024 at 7:45 AM Heiner Kallweit <hkallweit1@...il.com> wrote:
> >> On 10.07.2024 17:09, George-Daniel Matei wrote:
> >>>>> Added aspm suspend/resume hooks that run
> >>>>> before and after suspend and resume to change
> >>>>> the ASPM states of the PCI bus in order to allow
> >>>>> the system suspend while trying to prevent card hangs
> >>>>
> >>>> Why is this needed?  Is there a r8169 defect we're working around?
> >>>> A BIOS defect?  Is there a problem report you can reference here?
> >>>
> >>> We encountered this issue while upgrading from kernel v6.1 to v6.6.
> >>> The system would not suspend with 6.6. We tracked down the problem to
> >>> the NIC of the device, mainly that the following code was removed in
> >>> 6.6:
> >>>> else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
> >>>>         rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);1
> >>
> >> With this (older) 6.1 version everything is ok?
> >> Would mean that L1.1 is active and the system suspends (STR?) properly
> >> also with L1.1 being active.
> >>
> > Yes, with 6.1 everything was ok. L1 was active and just the L1.1 substate
> > was enabled, L1.2 was disabled.
> > 
> >> Under 6.6 per default L1 (incl. sub-states) is disabled.
> >> Then you manually enable L1 (incl. L1.1, but not L1.2?) via sysfs,
> >> and now the system hangs on suspend?
> >>
> > Yes, in 6.6 L1 (+substates) is disabled. Like Bjorn mentioned, I
> > think that is because of 90ca51e8c654 ("r8169:
> > fix ASPM-related issues on a number of systems with NIC version from
> > RTL8168h". With L1 disabled the system would not suspend so I enabled
> > back L1 along with just L1.1 substate through sysfs, just to test, and
> > saw that the system could
> 
> It still sounds very weird that a system does not suspend to ram
> just because ASPM L1 is disabled for a single device.
> What if a PCI device is used which doesn't support ASPM?
> 
> Which subsystem fails to suspend? Can you provide a log showing
> the suspend error?

Can we push on this a little bit?  The fact that suspend fails is
super interesting to me.  I'd like to know exactly how this fails and
whether it's in the kernel or in firmware.  If we're violating some
assumption firmware is making, maybe there would be a more generic
fix.

How exactly do you suspend?  Is there any debugging output you can
collect while doing that?  Maybe [1] has hints.  I see a bunch of
trace_suspend_resume() calls, and I think they're connected to [2],
which looks like it might generate console/dmesg output, but I don't
know how to enable that.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/power/basic-pm-debugging.rst?id=v6.10
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/trace/events/power.h?id=v6.10#n247

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ