[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPpJ_edZmmoWHFrT8XMLmtTPuFEgNXkEQ1OJdLOVnf38uGfO=A@mail.gmail.com>
Date: Wed, 4 Sep 2024 11:56:23 +0800
From: Jian-Hong Pan <jhp@...lessos.org>
To: Nirmal Patel <nirmal.patel@...ux.intel.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
"David E. Box" <david.e.box@...ux.intel.com>, Bjorn Helgaas <helgaas@...nel.org>,
Johan Hovold <johan@...nel.org>,
Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@...ux.intel.com>,
Mika Westerberg <mika.westerberg@...ux.intel.com>, Damien Le Moal <dlemoal@...nel.org>,
Jonathan Derrick <jonathan.derrick@...ux.dev>,
Paul M Stillwell Jr <paul.m.stillwell.jr@...el.com>, linux-pci@...r.kernel.org,
LKML <linux-kernel@...r.kernel.org>, linux@...lessos.org
Subject: Re: [PATCH v8 4/4] PCI/ASPM: Fix L1.2 parameters when enable link state
Nirmal Patel <nirmal.patel@...ux.intel.com> 於 2024年9月3日 週二 下午11:17寫道:
>
> On Mon, 12 Aug 2024 16:18:22 +0800
> Jian-Hong Pan <jhp@...lessos.org> wrote:
>
> > Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com> 於 2024年8月8日 週四
> > 下午5:49寫道:
> > >
> > > On Wed, 7 Aug 2024, David E. Box wrote:
> > >
> > > > On Wed, 2024-08-07 at 14:18 +0300, Ilpo Järvinen wrote:
> > > > > On Wed, 7 Aug 2024, Jian-Hong Pan wrote:
> > > > >
> > > > > > David E. Box <david.e.box@...ux.intel.com> 於 2024年8月6日
> > > > > > 週二 上午4:26寫道:
> > > > > > >
> > > > > > > Hi Jian-Hong,
> > > > > > >
> > > > > > > On Fri, 2024-08-02 at 16:24 +0800, Jian-Hong Pan wrote:
> > > > > > > > Jian-Hong Pan <jhp@...lessos.org> 於 2024年7月19日 週五
> > > > > > > > 下午4:04寫道:
> > > > > > > > >
> > > > > > > > > Currently, when enable link's L1.2 features with
> > > > > > > > > __pci_enable_link_state(),
> > > > > > > > > it configs the link directly without ensuring related
> > > > > > > > > L1.2 parameters, such
> > > > > > > > > as T_POWER_ON, Common_Mode_Restore_Time, and
> > > > > > > > > LTR_L1.2_THRESHOLD have been
> > > > > > > > > programmed.
> > > > > > > > >
> > > > > > > > > This leads the link's L1.2 between PCIe Root Port and
> > > > > > > > > child device gets
> > > > > > > > > wrong configs when a caller tries to enabled it.
> > > > > > > > >
> > > > > > > > > Here is a failed example on ASUS B1400CEAE with enabled
> > > > > > > > > VMD:
> > > > > > > > >
> > > > > > > > > 10000:e0:06.0 PCI bridge: Intel Corporation 11th Gen
> > > > > > > > > Core Processor PCIe
> > > > > > > > > Controller (rev 01) (prog-if 00 [Normal decode])
> > > > > > > > > ...
> > > > > > > > > Capabilities: [200 v1] L1 PM Substates
> > > > > > > > > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+
> > > > > > > > > ASPM_L1.1+ L1_PM_Substates+
> > > > > > > > > PortCommonModeRestoreTime=45us
> > > > > > > > > PortTPowerOnTime=50us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=45us
> > > > > > > > > LTR1.2_Threshold=101376ns L1SubCtl2: T_PwrOn=50us
> > > > > > > > >
> > > > > > > > > 10000:e1:00.0 Non-Volatile memory controller: Sandisk
> > > > > > > > > Corp WD Blue SN550
> > > > > > > > > NVMe SSD (rev 01) (prog-if 02 [NVM Express])
> > > > > > > > > ...
> > > > > > > > > Capabilities: [900 v1] L1 PM Substates
> > > > > > > > > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+
> > > > > > > > > ASPM_L1.1- L1_PM_Substates+
> > > > > > > > > PortCommonModeRestoreTime=32us
> > > > > > > > > PortTPowerOnTime=10us L1SubCtl1: PCI-PM_L1.2-
> > > > > > > > > PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- T_CommonMode=0us
> > > > > > > > > LTR1.2_Threshold=0ns L1SubCtl2: T_PwrOn=10us
> > > > > > > > >
> > > > > > > > > According to "PCIe r6.0, sec 5.5.4", before enabling
> > > > > > > > > ASPM L1.2 on the PCIe
> > > > > > > > > Root Port and the child NVMe, they should be programmed
> > > > > > > > > with the same LTR1.2_Threshold value. However, they
> > > > > > > > > have different values in this case.
> > > > > > > > >
> > > > > > > > > Invoke aspm_calc_l12_info() to program the L1.2
> > > > > > > > > parameters properly before
> > > > > > > > > enable L1.2 bits of L1 PM Substates Control Register in
> > > > > > > > > __pci_enable_link_state().
> > > > > > > > >
> > > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=218394
> > > > > > > > > Signed-off-by: Jian-Hong Pan <jhp@...lessos.org>
> > > > > > > > > ---
> > > > > > > > > v2:
> > > > > > > > > - Prepare the PCIe LTR parameters before enable L1
> > > > > > > > > Substates
> > > > > > > > >
> > > > > > > > > v3:
> > > > > > > > > - Only enable supported features for the L1 Substates
> > > > > > > > > part
> > > > > > > > >
> > > > > > > > > v4:
> > > > > > > > > - Focus on fixing L1.2 parameters, instead of
> > > > > > > > > re-initializing whole L1SS
> > > > > > > > >
> > > > > > > > > v5:
> > > > > > > > > - Fix typo and commit message
> > > > > > > > > - Split introducing aspm_get_l1ss_cap() to "PCI/ASPM:
> > > > > > > > > Introduce aspm_get_l1ss_cap()"
> > > > > > > > >
> > > > > > > > > v6:
> > > > > > > > > - Skipped
> > > > > > > > >
> > > > > > > > > v7:
> > > > > > > > > - Pick back and rebase on the new version kernel
> > > > > > > > > - Drop the link state flag check. And, always config
> > > > > > > > > link state's timing
> > > > > > > > > parameters
> > > > > > > > >
> > > > > > > > > v8:
> > > > > > > > > - Because pcie_aspm_get_link() might return the link as
> > > > > > > > > NULL, move getting the link's parent and child devices
> > > > > > > > > after check the link is not NULL. This avoids NULL
> > > > > > > > > memory access.
> > > > > > > > >
> > > > > > > > > drivers/pci/pcie/aspm.c | 15 +++++++++++++++
> > > > > > > > > 1 file changed, 15 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/pci/pcie/aspm.c
> > > > > > > > > b/drivers/pci/pcie/aspm.c index
> > > > > > > > > 5db1044c9895..55ff1d26fcea 100644 ---
> > > > > > > > > a/drivers/pci/pcie/aspm.c +++ b/drivers/pci/pcie/aspm.c
> > > > > > > > > @@ -1411,9 +1411,15 @@
> > > > > > > > > EXPORT_SYMBOL(pci_disable_link_state); static int
> > > > > > > > > __pci_enable_link_state(struct pci_dev *pdev, int
> > > > > > > > > state, bool locked)
> > > > > > > > > {
> > > > > > > > > struct pcie_link_state *link =
> > > > > > > > > pcie_aspm_get_link(pdev);
> > > > > > > > > + u32 parent_l1ss_cap, child_l1ss_cap;
> > > > > > > > > + struct pci_dev *parent, *child;
> > > > > > > > >
> > > > > > > > > if (!link)
> > > > > > > > > return -EINVAL;
> > > > > > > > > +
> > > > > > > > > + parent = link->pdev;
> > > > > > > > > + child = link->downstream;
> > > > > > > > > +
> > > > > > > > > /*
> > > > > > > > > * A driver requested that ASPM be enabled on
> > > > > > > > > this device, but
> > > > > > > > > * if we don't have permission to manage ASPM
> > > > > > > > > (e.g., on ACPI @@ -1428,6 +1434,15 @@ static int
> > > > > > > > > __pci_enable_link_state(struct pci_dev
> > > > > > > > > *pdev, int state, bool locked)
> > > > > > > > > if (!locked)
> > > > > > > > > down_read(&pci_bus_sem);
> > > > > > > > > mutex_lock(&aspm_lock);
> > > > > > > > > + /*
> > > > > > > > > + * Ensure L1.2 parameters:
> > > > > > > > > Common_Mode_Restore_Times, T_POWER_ON and
> > > > > > > > > + * LTR_L1.2_THRESHOLD are programmed properly
> > > > > > > > > before enable bits for
> > > > > > > > > + * L1.2, per PCIe r6.0, sec 5.5.4.
> > > > > > > > > + */
> > > > > > > > > + parent_l1ss_cap = aspm_get_l1ss_cap(parent);
> > > > > > > > > + child_l1ss_cap = aspm_get_l1ss_cap(child);
> > > > > > > > > + aspm_calc_l12_info(link, parent_l1ss_cap,
> > > > > > > > > child_l1ss_cap);
> > > > > > >
> > > > > > > I still don't think this is the place to recalculate the
> > > > > > > L1.2 parameters especially when know the calculation was
> > > > > > > done but was cleared by pci_bus_reset(). Can't we just do a
> > > > > > > pci_save/restore_state() before/after pci_bus_reset() in
> > > > > > > vmd.c?
> > > > > >
> > > > > > I have not thought pci_save/restore_state() around
> > > > > > pci_bus_reset() before. It is an interesting direction.
> > > > > >
> > > > > > So, I prepare modification below for test. Include "[PATCH
> > > > > > v8 1/4] PCI: vmd: Set PCI devices to D0 before enable PCI
> > > > > > PM's L1 substates", too. Then, both the PCIe bridge and the
> > > > > > PCIe device have the same LTR_L1.2_THRESHOLD 101376ns as
> > > > > > expected.
> > > > > >
> > > > > > diff --git a/drivers/pci/controller/vmd.c
> > > > > > b/drivers/pci/controller/vmd.c index
> > > > > > bbf4a47e7b31..6b8dd4f30127 100644 ---
> > > > > > a/drivers/pci/controller/vmd.c +++
> > > > > > b/drivers/pci/controller/vmd.c @@ -727,6 +727,18 @@ static
> > > > > > void vmd_copy_host_bridge_flags(struct pci_host_bridge
> > > > > > *root_bridge, vmd_bridge->native_dpc =
> > > > > > root_bridge->native_dpc; }
> > > > > >
> > > > > > +static int vmd_pci_save_state(struct pci_dev *pdev, void
> > > > > > *userdata) +{
> > > > > > + pci_save_state(pdev);
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > > +static int vmd_pci_restore_state(struct pci_dev *pdev, void
> > > > > > *userdata) +{
> > > > > > + pci_restore_state(pdev);
> > > > > > + return 0;
> > > > > > +}
> > > > > > +
> > > > > > /*
> > > > > > * Enable ASPM and LTR settings on devices that aren't
> > > > > > configured by BIOS. */
> > > > > > @@ -927,6 +939,7 @@ static int vmd_enable_domain(struct
> > > > > > vmd_dev *vmd, unsigned long features)
> > > > > > pci_scan_child_bus(vmd->bus);
> > > > > > vmd_domain_reset(vmd);
> > > > > >
> > > > > > + pci_walk_bus(vmd->bus, vmd_pci_save_state, NULL);
> > > > > > /* When Intel VMD is enabled, the OS does not
> > > > > > discover the Root Ports
> > > > > > * owned by Intel VMD within the MMCFG space.
> > > > > > pci_reset_bus() applies
> > > > > > * a reset to the parent of the PCI device supplied
> > > > > > as argument. This
> > > > > > @@ -945,6 +958,7 @@ static int vmd_enable_domain(struct
> > > > > > vmd_dev *vmd, unsigned long features)
> > > > > > break;
> > > > > > }
> > > > > > }
> > > > > > + pci_walk_bus(vmd->bus, vmd_pci_restore_state, NULL);
> > > > >
> > > > > Why not call pci_reset_bus() (or __pci_reset_bus()) then in
> > > > > vmd_enable_domain() which preserves state unlike
> > > > > pci_reset_bus()?
> > > > >
> > > > > (Don't tell me naming of these functions is a horrible mess.
> > > > > :-/)
> > > >
> > > > Hmm. So this *is* calling pci_reset_bus().
> > >
> > > Yeah, I managed to get confused by the names myself, I somehow
> > > ended up thinking it calls pci_bus_reset() which is not correct...
> > >
> > > > L1.2 configuration has specific
> > > > ordering requirements for changes to parent & child devices.
> > > > Could be why it's not getting restored properly.
> > >
> > > Indeed, it has to be something else since the patch above doesn't
> > > even restore anything because dev->state_saved should get set to
> > > false by the first pci_restore_state() called from
> > > __pci_reset_bus() -> pci_bus_restore_locked() -> pci_dev_restore(),
> > > I think!?
> >
> > Inspired by Ilpo's comment. I add some debug messages based on
> > linux-next's tag 'next-20240809' to understand the code path of
> > pci_reset_bus():
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index ffaaca0978cb..3ee71374f1de 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -5133,8 +5133,10 @@ static void pci_dev_save_and_disable(struct
> > pci_dev *dev)
> > * races with ->remove() by the device lock, which must be
> > held by
> > * the caller.
> > */
> > - if (err_handler && err_handler->reset_prepare)
> > + if (err_handler && err_handler->reset_prepare) {
> > + pci_info(dev, "%s: %pF\n", __func__,
> > err_handler->reset_prepare);
> > err_handler->reset_prepare(dev);
> > + }
> >
> > /*
> > * Wake-up device prior to save. PM registers default to D0
> > after @@ -5144,6 +5146,7 @@ static void
> > pci_dev_save_and_disable(struct pci_dev *dev)
> > pci_set_power_state(dev, PCI_D0);
> >
> > pci_save_state(dev);
> > + pci_info(dev, "%s: PCI state_saved is %s\n", __func__,
> > dev->state_saved ? "true" : "false");
> > /*
> > * Disable the device by clearing the Command register,
> > except for
> > * INTx-disable which is set. This not only disables MMIO
> > and I/O port @@ -5655,6 +5658,10 @@ static void
> > pci_bus_save_and_disable_locked(struct pci_bus *bus)
> > struct pci_dev *dev;
> >
> > list_for_each_entry(dev, &bus->devices, bus_list) {
> > + pci_info(dev, "%s: PCI state_saved is %s, and %s
> > subordinate\n",
> > + __func__,
> > + dev->state_saved ? "true" : "false",
> > + dev->subordinate ? "has" : "does not have");
> > pci_dev_save_and_disable(dev);
> > if (dev->subordinate)
> > pci_bus_save_and_disable_locked(dev->subordinate);
> > @@ -5671,6 +5678,10 @@ static void pci_bus_restore_locked(struct
> > pci_bus *bus) struct pci_dev *dev;
> >
> > list_for_each_entry(dev, &bus->devices, bus_list) {
> > + pci_info(dev, "%s: PCI state_saved is %s, and %s
> > subordinate\n",
> > + __func__,
> > + dev->state_saved ? "true" : "false",
> > + dev->subordinate ? "has" : "does not have");
> > pci_dev_restore(dev);
> > if (dev->subordinate)
> > pci_bus_restore_locked(dev->subordinate);
> > @@ -5786,8 +5797,10 @@ static int pci_bus_reset(struct pci_bus *bus,
> > bool probe) if (!bus->self || !pci_bus_resettable(bus))
> > return -ENOTTY;
> >
> > - if (probe)
> > + if (probe) {
> > + pci_info(bus->self, "%s: probe is true. So return 0
> > directly", __func__);
> > return 0;
> > + }
> >
> > pci_bus_lock(bus);
> >
> > @@ -5858,10 +5871,12 @@ static int __pci_reset_bus(struct pci_bus
> > *bus) int rc;
> >
> > rc = pci_bus_reset(bus, PCI_RESET_PROBE);
> > + pci_info(bus->self, "%s: pci_bus_reset() returns %d\n",
> > __func__, rc); if (rc)
> > return rc;
> >
> > if (pci_bus_trylock(bus)) {
> > + pci_info(bus->self, "%s: pci_bus_trylock() returns
> > true\n", __func__);
> > pci_bus_save_and_disable_locked(bus);
> > might_sleep();
> > rc = pci_bridge_secondary_bus_reset(bus->self);
> > @@ -5881,6 +5896,7 @@ static int __pci_reset_bus(struct pci_bus *bus)
> > */
> > int pci_reset_bus(struct pci_dev *pdev)
> > {
> > + pci_info(pdev, "%s: %s", __func__,
> > !pci_probe_reset_slot(pdev->slot) ? "true" : "false");
> > return (!pci_probe_reset_slot(pdev->slot)) ?
> > __pci_reset_slot(pdev->slot) : __pci_reset_bus(pdev->bus);
> > }
> >
> > And, have the information of VMD PCIe devices with the built kernel:
> >
> > 10000:e0:06.0 PCI bridge [0604]: Intel Corporation 11th Gen Core
> > Processor PCIe Controller [8086:9a09] (rev 01) (prog-if 00 [Normal
> > decode])
> > ...
> > Capabilities: [200 v1] L1 PM Substates
> > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+
> > L1_PM_Substates+ PortCommonModeRestoreTime=45us PortTPowerOnTime=50us
> > L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > T_CommonMode=0us LTR1.2_Threshold=0ns
> > L1SubCtl2: T_PwrOn=0us
> >
> > 10000:e1:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD
> > Blue SN550 NVMe SSD [15b7:5009] (rev 01) (prog-if 02 [NVM Express])
> > ...
> > Capabilities: [900 v1] L1 PM Substates
> > L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > L1_PM_Substates+ PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
> > L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
> > T_CommonMode=0us LTR1.2_Threshold=101376ns
> > L1SubCtl2: T_PwrOn=50us
> >
> > We can see the NVMe has expected LTR1.2_Threshold=101376ns, but the
> > PCIe bridge has LTR1.2_Threshold=0ns.
> >
> > Then, check the dmesg. I notice the debug messages:
> >
> > pci 10000:e0:06.0: PCI bridge to [bus e1]
> > pci 10000:e0:06.0: Primary bus is hard wired to 0
> > pci 10000:e1:00.0: pci_reset_bus: false
> > pci 10000:e0:06.0: pci_bus_reset: probe is true. So return 0 directly
> > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_reset() returns 0
> > pci 10000:e0:06.0: __pci_reset_bus: pci_bus_trylock() returns true
> > pci 10000:e1:00.0: pci_bus_save_and_disable_locked: PCI state_saved is
> > false, and does not have subordinate
> > pci 10000:e1:00.0: pci_dev_save_and_disable: PCI state_saved is true
> > Freeing initrd memory: 75236K
> > pci 10000:e1:00.0: pci_bus_restore_locked: PCI state_saved is true,
> > and does not have subordinate
> >
> > So, the code path is:
> >
> > vmd_enable_domain()
> > pci_reset_bus()
> > __pci_reset_bus()
> > pci_bus_reset()
> > pci_bus_save_and_disable_locked()
> > pci_dev_save_and_disable()
> > pci_bus_restore_locked()
> > pci_dev_restore()
> >
> > And, from the debug messages, I learned only NVMe 10000:e1:00.0 does
> > pci_save/restore_state. But, the PCIe bridge 10000:e0:06.0 does not.
> > So, PCIe bridge 10000:e0:06.0 does not restore state correctly.
> >
> > Besides, it is NVMe 10000:e1:00.0's bus [e1] been reset, not the VMD's
> > bus in vmd_enable_domain().
> > * Bus "e1" has only NVMe 10000:e1:00.0
> > * VMD's bus in vmd_enable_domain() has PCIe bridge 10000:e0:06.0, NVMe
> > 10000:e1:00.0 and SATA Controller 10000:e0:17.0.
> >
> > Here is the PCI tree:
> >
> > -+-[0000:00]-+-00.0 Intel Corporation Device 9a04
> > | +-02.0 Intel Corporation Tiger Lake-LP GT2 [UHD
> > Graphics G4] | +-04.0 Intel Corporation TigerLake-LP
> > Dynamic Tuning Processor Participant
> > | +-06.0 Intel Corporation RST VMD Managed Controller
> > | +-07.0-[01-2b]--
> > | +-08.0 Intel Corporation GNA Scoring Accelerator module
> > | +-0a.0 Intel Corporation Tigerlake Telemetry Aggregator
> > Driver | +-0d.0 Intel Corporation Tiger Lake-LP
> > Thunderbolt 4 USB Controller
> > | +-0d.2 Intel Corporation Tiger Lake-LP Thunderbolt 4
> > NHI #0 | +-0e.0 Intel Corporation Volume Management Device
> > NVMe RAID Controller
> > | +-14.0 Intel Corporation Tiger Lake-LP USB 3.2 Gen 2x1
> > xHCI Host Controller
> > | +-14.2 Intel Corporation Tiger Lake-LP Shared SRAM
> > | +-14.3 Intel Corporation Wi-Fi 6 AX201
> > | +-15.0 Intel Corporation Tiger Lake-LP Serial IO I2C
> > Controller #0 | +-15.1 Intel Corporation Tiger Lake-LP
> > Serial IO I2C Controller #1 | +-16.0 Intel Corporation
> > Tiger Lake-LP Management Engine Interface | +-17.0 Intel
> > Corporation RST VMD Managed Controller | +-1f.0 Intel
> > Corporation Tiger Lake-LP LPC Controller | +-1f.3 Intel
> > Corporation Tiger Lake-LP Smart Sound Technology Audio Controller
> > | +-1f.4 Intel Corporation Tiger Lake-LP SMBus Controller
> > | +-1f.5 Intel Corporation Tiger Lake-LP SPI Controller
> > | \-1f.6 Intel Corporation Ethernet Connection (13) I219-V
> > \-[10000:e0]-+-06.0-[e1]----00.0 Sandisk Corp WD Blue SN550 NVMe SSD
> > \-17.0 Intel Corporation Tiger Lake-LP SATA Controller
> >
> > According the findings above, to ensure the devices on the VMD bus
> > have correctly states, seems pci_save_state() all the devices before
> > pci_reset_bus(), and pci_restore_state() all the devices after
> > pci_reset_bus() is the correct answer.
> What happens if you call pci_reset_bus with PCIe bridge 10000:e0:06.0
> instead of NVMe 10000:e1:00.0? I believe the current implementation in
> vmd_enable_domain finds first child device on each rootport and calls
> pci_reset_bus with NVMe.
If call pci_reset_bus with PCIe bridge 10000:e0:06.0 instead of NVMe
10000:e1:00.0, it will show warning message "pci 10000:e0:06.0 can't
reset device: -25".
Jian-Hong Pan
Powered by blists - more mailing lists