lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CACfW=qrCrXM6Et=Yafug00pbYZzifhVGLhdLMsdYiYXSh=tGFA@mail.gmail.com>
Date: Tue, 16 Jul 2024 14:13:14 +0200
From: George-Daniel Matei <danielgeorgem@...omium.org>
To: Heiner Kallweit <hkallweit1@...il.com>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, linux-kernel@...r.kernel.org, 
	linux-pci@...r.kernel.org, nic_swsd@...ltek.com, netdev@...r.kernel.org, 
	Bjorn Helgaas <helgaas@...nel.org>
Subject: Re: [PATCH] PCI: r8169: add suspend/resume aspm quirk

On Thu, Jul 11, 2024 at 7:45 AM Heiner Kallweit <hkallweit1@...il.com> wrote:
>
> On 10.07.2024 17:09, George-Daniel Matei wrote:
> > Hi,
> >
> >>> Added aspm suspend/resume hooks that run
> >>> before and after suspend and resume to change
> >>> the ASPM states of the PCI bus in order to allow
> >>> the system suspend while trying to prevent card hangs
> >>
> >> Why is this needed?  Is there a r8169 defect we're working around?
> >> A BIOS defect?  Is there a problem report you can reference here?
> >>
> >
> > We encountered this issue while upgrading from kernel v6.1 to v6.6.
> > The system would not suspend with 6.6. We tracked down the problem to
> > the NIC of the device, mainly that the following code was removed in
> > 6.6:
> >> else if (tp->mac_version >= RTL_GIGA_MAC_VER_46)
> >>         rc = pci_disable_link_state(pdev, PCIE_LINK_STATE_L1_2);1
>
> With this (older) 6.1 version everything is ok?
> Would mean that L1.1 is active and the system suspends (STR?) properly
> also with L1.1 being active.
>
Yes, with 6.1 everything was ok. L1 was active and just the L1.1 substate
was enabled, L1.2 was disabled.

> Under 6.6 per default L1 (incl. sub-states) is disabled.
> Then you manually enable L1 (incl. L1.1, but not L1.2?) via sysfs,
> and now the system hangs on suspend?
>
Yes, in 6.6 L1 (+substates) is disabled. Like Bjorn mentioned, I
think that is because of 90ca51e8c654 ("r8169:
fix ASPM-related issues on a number of systems with NIC version from
RTL8168h". With L1 disabled the system would not suspend so I enabled
back L1 along with just L1.1 substate through sysfs, just to test, and
saw that the system could
suspend again.  L1 is disabled by default for a reason, that's because
it could cause tx timeouts. So to try to work around the possible timeouts
I thought  of changing the ASPM states before suspending and then
restoring on resume.

> Is this what you're saying? Would be strange because in both cases
> L1.1 is active when suspending.
>
>
> > For the listed devices, ASPM L1 is disabled entirely in 6.6. As for
> > the reason, L1 was observed to cause some problems
> > (https://bugzilla.kernel.org/show_bug.cgi?id=217814). We use a Raptor
> > Lake soc and it won't change residency if the NIC doesn't have L1
> > enabled. I saw in 6.1 the following comment:
> >> Chips from RTL8168h partially have issues with L1.2, but seem
> >> to work fine with L1 and L1.1.
> > I was thinking that disabling/enabling L1.1 on the fly before/after
> > suspend could help mitigate the risk associated with L1/L1.1 . I know
> > that ASPM settings are exposed in sysfs and that this could be done
> > from outside the kernel, that was my first approach, but it was
> > suggested to me that this kind of workaround would be better suited
> > for quirks. I did around 1000 suspend/resume cycles of 16-30 seconds
> > each (correcting the resume dev->bus->self being configured twice
> > mistake) and did not notice any problems. What do you think, is this a
> > good approach ... ?
> >
> >>> +             //configure device
> >>> +             pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
> >>> +                                                PCI_EXP_LNKCTL_ASPMC, 0);
> >>> +
> >>> +             pci_read_config_word(dev->bus->self,
> >>> +                                  dev->bus->self->l1ss + PCI_L1SS_CTL1,
> >>> +                                  &val);
> >>> +             val = val & ~PCI_L1SS_CTL1_L1SS_MASK;
> >>> +             pci_write_config_word(dev->bus->self,
> >>> +                                   dev->bus->self->l1ss + PCI_L1SS_CTL1,
> >>> +                                   val);
> >> Updates the parent (dev->bus->self) twice; was the first one supposed
> >> to update the device (dev)?
> > Yes, it was supposed to update the device (dev). It's my first time
> > sending a patch and I messed something up while doing some style
> > changes, I will correct it. I'm sorry for that.
> >
> >> This doesn't restore the state as it existed before suspend.  Does
> >> this rely on other parts of restore to do that?
> > It operates on the assumption that after driver initialization
> > PCI_EXP_LNKCTL_ASPMC is 0 and that there are no states enabled in
> > CTL1. I did a lspci -vvv dump on the affected devices before and after
> > the quirks ran and saw no difference. This could be improved.
> >
> >> What is the RTL8168 chip version used on these systems?
> > It should be RTL8111H.
> >
> >> What's the root cause of the issue?
> >> A silicon bug on the host side?
> > I think it's the ASPM implementation of the soc.
> >
> >> ASPM L1 is disabled per default in r8169. So why is the patch needed
> >> at all?
> > Leaving it disabled all the time prevents the system from suspending.
> >
> > Thank you,
> > George-Daniel Matei
> >
> >
> >
> >
> >
> > On Tue, Jul 9, 2024 at 12:15 AM Heiner Kallweit <hkallweit1@...il.com> wrote:
> >>
> >> On 08.07.2024 19:23, Bjorn Helgaas wrote:
> >>> [+cc r8169 folks]
> >>>
> >>> On Mon, Jul 08, 2024 at 03:38:15PM +0000, George-Daniel Matei wrote:
> >>>> Added aspm suspend/resume hooks that run
> >>>> before and after suspend and resume to change
> >>>> the ASPM states of the PCI bus in order to allow
> >>>> the system suspend while trying to prevent card hangs
> >>>
> >>> Why is this needed?  Is there a r8169 defect we're working around?
> >>> A BIOS defect?  Is there a problem report you can reference here?
> >>>
> >>
> >> Basically the same question from my side. Apparently such a workaround
> >> isn't needed on any other system. And Realtek NICs can be found on more
> >> or less every consumer system. What's the root cause of the issue?
> >> A silicon bug on the host side?
> >>
> >> What is the RTL8168 chip version used on these systems?
> >>
> >> ASPM L1 is disabled per default in r8169. So why is the patch needed
> >> at all?
> >>
> >>> s/Added/Add/
> >>>
> >>> s/aspm/ASPM/ above
> >>>
> >>> s/PCI bus/device and parent/
> >>>
> >>> Add period at end of sentence.
> >>>
> >>> Rewrap to fill 75 columns.
> >>>
> >>>> Signed-off-by: George-Daniel Matei <danielgeorgem@...omium.org>
> >>>> ---
> >>>>  drivers/pci/quirks.c | 142 +++++++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 142 insertions(+)
> >>>>
> >>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> >>>> index dc12d4a06e21..aa3dba2211d3 100644
> >>>> --- a/drivers/pci/quirks.c
> >>>> +++ b/drivers/pci/quirks.c
> >>>> @@ -6189,6 +6189,148 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56b0, aspm_l1_acceptable_latency
> >>>>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56b1, aspm_l1_acceptable_latency);
> >>>>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c0, aspm_l1_acceptable_latency);
> >>>>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x56c1, aspm_l1_acceptable_latency);
> >>>> +
> >>>> +static const struct dmi_system_id chromebox_match_table[] = {
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Brask"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Aurash"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +            {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Bujia"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Gaelin"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Gladios"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Hahn"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Jeev"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Kinox"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Kuldax"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +            .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Lisbon"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    {
> >>>> +                    .matches = {
> >>>> +                    DMI_MATCH(DMI_PRODUCT_NAME, "Moli"),
> >>>> +                    DMI_MATCH(DMI_BIOS_VENDOR, "coreboot"),
> >>>> +            }
> >>>> +    },
> >>>> +    { }
> >>>> +};
> >>>> +
> >>>> +static void rtl8169_suspend_aspm_settings(struct pci_dev *dev)
> >>>> +{
> >>>> +    u16 val = 0;
> >>>> +
> >>>> +    if (dmi_check_system(chromebox_match_table)) {
> >>>> +            //configure parent
> >>>> +            pcie_capability_clear_and_set_word(dev->bus->self,
> >>>> +                                               PCI_EXP_LNKCTL,
> >>>> +                                               PCI_EXP_LNKCTL_ASPMC,
> >>>> +                                               PCI_EXP_LNKCTL_ASPM_L1);
> >>>> +
> >>>> +            pci_read_config_word(dev->bus->self,
> >>>> +                                 dev->bus->self->l1ss + PCI_L1SS_CTL1,
> >>>> +                                 &val);
> >>>> +            val = (val & ~PCI_L1SS_CTL1_L1SS_MASK) |
> >>>> +                  PCI_L1SS_CTL1_PCIPM_L1_2 | PCI_L1SS_CTL1_PCIPM_L1_2 |
> >>>> +                  PCI_L1SS_CTL1_ASPM_L1_1;
> >>>> +            pci_write_config_word(dev->bus->self,
> >>>> +                                  dev->bus->self->l1ss + PCI_L1SS_CTL1,
> >>>> +                                  val);
> >>>> +
> >>>> +            //configure device
> >>>> +            pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
> >>>> +                                               PCI_EXP_LNKCTL_ASPMC,
> >>>> +                                               PCI_EXP_LNKCTL_ASPM_L1);
> >>>> +
> >>>> +            pci_read_config_word(dev, dev->l1ss + PCI_L1SS_CTL1, &val);
> >>>> +            val = (val & ~PCI_L1SS_CTL1_L1SS_MASK) |
> >>>> +                  PCI_L1SS_CTL1_PCIPM_L1_2 | PCI_L1SS_CTL1_PCIPM_L1_2 |
> >>>> +                  PCI_L1SS_CTL1_ASPM_L1_1;
> >>>> +            pci_write_config_word(dev, dev->l1ss + PCI_L1SS_CTL1, val);
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +DECLARE_PCI_FIXUP_SUSPEND(PCI_VENDOR_ID_REALTEK, 0x8168,
> >>>> +                      rtl8169_suspend_aspm_settings);
> >>>> +
> >>>> +static void rtl8169_resume_aspm_settings(struct pci_dev *dev)
> >>>> +{
> >>>> +    u16 val = 0;
> >>>> +
> >>>> +    if (dmi_check_system(chromebox_match_table)) {
> >>>> +            //configure device
> >>>> +            pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
> >>>> +                                               PCI_EXP_LNKCTL_ASPMC, 0);
> >>>> +
> >>>> +            pci_read_config_word(dev->bus->self,
> >>>> +                                 dev->bus->self->l1ss + PCI_L1SS_CTL1,
> >>>> +                                 &val);
> >>>> +            val = val & ~PCI_L1SS_CTL1_L1SS_MASK;
> >>>> +            pci_write_config_word(dev->bus->self,
> >>>> +                                  dev->bus->self->l1ss + PCI_L1SS_CTL1,
> >>>> +                                  val);
> >>>> +
> >>>> +            //configure parent
> >>>> +            pcie_capability_clear_and_set_word(dev->bus->self,
> >>>> +                                               PCI_EXP_LNKCTL,
> >>>> +                                               PCI_EXP_LNKCTL_ASPMC, 0);
> >>>> +
> >>>> +            pci_read_config_word(dev->bus->self,
> >>>> +                                 dev->bus->self->l1ss + PCI_L1SS_CTL1,
> >>>> +                                 &val);
> >>>> +            val = val & ~PCI_L1SS_CTL1_L1SS_MASK;
> >>>> +            pci_write_config_word(dev->bus->self,
> >>>> +                                  dev->bus->self->l1ss + PCI_L1SS_CTL1,
> >>>> +                                  val);
> >>>
> >>> Updates the parent (dev->bus->self) twice; was the first one supposed
> >>> to update the device (dev)?
> >>>
> >>> This doesn't restore the state as it existed before suspend.  Does
> >>> this rely on other parts of restore to do that?
> >>>
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_REALTEK, 0x8168,
> >>>> +                     rtl8169_resume_aspm_settings);
> >>>>  #endif
> >>>>
> >>>>  #ifdef CONFIG_PCIE_DPC
> >>>> --
> >>>> 2.45.2.803.g4e1b14247a-goog
> >>>>
> >>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ