lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <39b40ef8-b036-427e-9deb-b25bda61cb37@gmail.com>
Date: Thu, 25 Jul 2024 21:46:32 +0200
From: Heiner Kallweit <hkallweit1@...il.com>
To: George-Daniel Matei <danielgeorgem@...omium.org>,
 Bjorn Helgaas <helgaas@...nel.org>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, linux-kernel@...r.kernel.org,
 linux-pci@...r.kernel.org, nic_swsd@...ltek.com, netdev@...r.kernel.org
Subject: Re: [PATCH] PCI: r8169: add suspend/resume aspm quirk

On 25.07.2024 14:56, George-Daniel Matei wrote:
> On Wed, Jul 10, 2024 at 11:38 PM Bjorn Helgaas <helgaas@...nel.org> wrote:
>>
>> On Wed, Jul 10, 2024 at 05:09:08PM +0200, George-Daniel Matei wrote:
>>>>> Added aspm suspend/resume hooks that run
>>>>> before and after suspend and resume to change
>>>>> the ASPM states of the PCI bus in order to allow
>>>>> the system suspend while trying to prevent card hangs
>>>>
>>>> Why is this needed?  Is there a r8169 defect we're working around?
>>>> A BIOS defect?  Is there a problem report you can reference here?
>>>
>>> We encountered this issue while upgrading from kernel v6.1 to v6.6.
>>> The system would not suspend with 6.6. We tracked down the problem to
>>> the NIC of the device, mainly that the following code was removed in
>>> 6.6:
>>>
>>>> else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
>>>>         rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);
>>>
>>> For the listed devices, ASPM L1 is disabled entirely in 6.6. As for
>>> the reason, L1 was observed to cause some problems
>>> (https://bugzilla.kernel.org/show_bug.cgi?id=217814). We use a Raptor
>>> Lake soc and it won't change residency if the NIC doesn't have L1
>>> enabled. I saw in 6.1 the following comment:
>>
>> Can you verify that the problem still exists in a current kernel,
>> e.g., v6.9?
>>
> I tested it with v6.9, still the same problem.
> 
>> If this is a regression that's still present in v6.9, we need to
>> identify the commit that broke it.  Maybe it's 90ca51e8c654 ("r8169:
>> fix ASPM-related issues on a number of systems with NIC version from
>> RTL8168h")?
>>
> I also tried v6.9 with 90ca51e8c654 reverted and it works ok.
> 
>>>> Chips from RTL8168h partially have issues with L1.2, but seem
>>>> to work fine with L1 and L1.1.
>>>
>>> I was thinking that disabling/enabling L1.1 on the fly before/after
>>> suspend could help mitigate the risk associated with L1/L1.1 . I know
>>> that ASPM settings are exposed in sysfs and that this could be done
>>> from outside the kernel, that was my first approach, but it was
>>> suggested to me that this kind of workaround would be better suited
>>> for quirks. I did around 1000 suspend/resume cycles of 16-30 seconds
>>> each (correcting the resume dev->bus->self being configured twice
>>> mistake) and did not notice any problems. What do you think, is this a
>>> good approach ... ?
>>
>> Whatever the problem is, it definitely should be fixed in the kernel,
>> and Ilpo is right that it *should* be done in the PCI core ASPM
>> support (aspm.c) or at least with interfaces it supplies.
>>
> The problem is actually the system not being able to reach
> depper power saving states without certain ASPM states enabled.
> It was mentioned in the other thread replies that this kind of problem
> has been reported several times in the past.
> 
This is a known side effect of disabling ASPM L1 per default.
There isn't really something broken, the system just consumes some more
power. If this is an issue for you and ASPM L1 causes no issues on your
system, you can use sysfs to enable ASPM L1.

The general issue is that there are at least hundreds of combinations of
RTL8168 NICs, host chipsets, and BIOS versions. Several of these combinations
cause serious issues if ASPM L1 is enabled. It's just unknown which ones,
and why. Therefore ASPM L1 is disabled per default.


>> Generally speaking, drivers should not need to touch ASPM at all
>> except to work around hardware defects in their device, but r8169 has
>> a long history of weird ASPM stuff.  I dunno if that stuff is related
>> to hardware defects in the r8169 devices or if it is workarounds for
>> past or current defects in aspm.c.
>>
> What would be a good approach to move forward with this issue to
> get a fix approved?
> 
Propose a patch which fixes your power consumption issue and is guaranteed
not to have negative impact on any other user.

> Make a general version of this toggle workaround in the aspm core
> that would be controllable & configurable for each pci device individually?
> Keep the quirks and fix the aforementioned comments?
> 
>>>> This doesn't restore the state as it existed before suspend.  Does
>>>> this rely on other parts of restore to do that?
>>>
>>> It operates on the assumption that after driver initialization
>>> PCI_EXP_LNKCTL_ASPMC is 0 and that there are no states enabled in
>>> CTL1. I did a lspci -vvv dump on the affected devices before and after
>>> the quirks ran and saw no difference. This could be improved.
>>
>> Yep, we can't assume any of that because the PCI core owns ASPM
>> config, not the driver itself.
>>
>>>> What's the root cause of the issue?
>>>> A silicon bug on the host side?
>>>
>>> I think it's the ASPM implementation of the soc.
>>
>> As Heiner pointed out, if it's a SoC defect, it would potentially
>> affect all devices and a workaround would have to cover them all.
>>
>> Side note: oops, quoting error below, see note about top-posting here:
>> https://people.kernel.org/tglx/notes-about-netiquette
>>
>>> On Tue, Jul 9, 2024 at 12:15 AM Heiner Kallweit <hkallweit1@...il.com> wrote:
>>>>
>>>> On 08.07.2024 19:23, Bjorn Helgaas wrote:
>>>>> [+cc r8169 folks]
>>>>>
>>>>> On Mon, Jul 08, 2024 at 03:38:15PM +0000, George-Daniel Matei wrote:
>>>>>> Added aspm suspend/resume hooks that run
>>>>>> before and after suspend and resume to change
>>>>>> the ASPM states of the PCI bus in order to allow
>>>>>> the system suspend while trying to prevent card hangs
>>>>>
>>>>> Why is this needed?  Is there a r8169 defect we're working around?
>>>>> A BIOS defect?  Is there a problem report you can reference here?
>>> ...


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ