[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1534611c-5e36-413d-b8fe-ac29cccbb10b@amd.com>
Date: Tue, 27 Aug 2024 12:43:30 -0500
From: Mario Limonciello <mario.limonciello@....com>
To: Bjorn Helgaas <helgaas@...nel.org>, Mario Limonciello
<superm1@...nel.org>, Gary Li <Gary.Li@....com>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>,
Mathias Nyman <mathias.nyman@...el.com>,
Mika Westerberg <mika.westerberg@...ux.intel.com>,
"open list : PCI SUBSYSTEM" <linux-pci@...r.kernel.org>,
open list <linux-kernel@...r.kernel.org>,
"open list : USB XHCI DRIVER" <linux-usb@...r.kernel.org>,
Daniel Drake <drake@...lessos.org>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
Duc Dang <ducdang@...gle.com>, Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH v5 2/5] PCI: Check PCI_PM_CTRL instead of PCI_COMMAND in
pci_dev_wait()
On 8/26/2024 14:16, Mario Limonciello wrote:
> On 8/23/2024 14:54, Bjorn Helgaas wrote:
>> [+cc Duc, Alex]
>>
>> On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
>>> If a dock is plugged in at the same time as autosuspend delay then this
>>> can cause malfunctions in the USB4 stack. This happens because the
>>> device is still in D3cold at the time that the PCI core handed
>>> control back to the USB4 stack.
>>
>> I assume the USB device in question is in the dock that was hot-added?
>
> No; it's actually the USB4 router that is malfunctioning. The CM
> (thunderbolt.ko) thinks the router is in D0 already, but when it
> attempts to do a register read it gets back all F's and it trusts that.
>
>> This patch suggests that pci_dev_wait() has waited for a read of
>> PCI_COMMAND to respond with something other than ~0, but the device is
>> still in D3cold. I suppose we got to pci_dev_wait() via
>> pci_pm_bridge_power_up_actions() calling
>> pci_bridge_wait_for_secondary_bus(), since I wouldn't expect a reset
>> in the hot-add case.
>
> As I said it's the router (part of the SoC). The device never
> disappears. It's the action of plugging in/out the the dock that causes
> it to change power states.
>
> We didn't try it, but I wouldn't be surprised if it could be reproduced
> with a script that turned on/off runtime PM on very tight timing around
> the autosuspend delay.
>
>>
>>> A device that has gone through a reset may return a value in PCI_COMMAND
>>> but that doesn't mean it's finished transitioning to D0. For evices
>>> that
>>> support power management explicitly check PCI_PM_CTRL on everything but
>>> system resume to ensure the transition happened.
>>
>> s/evices/devices/
>
> Thanks.
>
>>
>>> Devices that don't support power management and system resume will
>>> continue to use PCI_COMMAND.
>>
>> Is there a bugzilla or other report with more details that we can
>> include here?
>
> No, unfortunately in this case it was only reported internally at AMD.
>
> Gary who is on CC brought it to me though, and if you think there are
> some other specific details needed but are missing we can see what else
> can be added to the commit message.
>
>>
>>> Signed-off-by: Mario Limonciello <mario.limonciello@....com>
>>> ---
>>> v4->v5:
>>> * Fix misleading indentation
>>> * Amend commit message
>>> ---
>>> drivers/pci/pci.c | 28 ++++++++++++++++++++--------
>>> 1 file changed, 20 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 1e219057a5069..f032a4aaec268 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev,
>>> enum pci_reset_type reset_type, int
>>> * the read (except when CRS SV is enabled and the read was for
>>> the
>>> * Vendor ID; in that case it synthesizes 0x0001 data).
>>> *
>>> - * Wait for the device to return a non-CRS completion. Read the
>>> - * Command register instead of Vendor ID so we don't have to
>>> - * contend with the CRS SV value.
>>> + * Wait for the device to return a non-CRS completion. On devices
>>> + * that support PM control and on waits that aren't part of system
>>> + * resume read the PM control register to ensure the device has
>>> + * transitioned to D0. On devices that don't support PM control,
>>> + * or during system resume read the command register to instead of
>>> + * Vendor ID so we don't have to contend with the CRS SV value.
>>> */
>>> for (;;) {
>>> - u32 id;
>>> -
>>> if (pci_dev_is_disconnected(dev)) {
>>> pci_dbg(dev, "disconnected; not waiting\n");
>>> return -ENOTTY;
>>> }
>>> - pci_read_config_dword(dev, PCI_COMMAND, &id);
>>> - if (!PCI_POSSIBLE_ERROR(id))
>>> - break;
>>> + if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
>>> + u16 pmcsr;
>>> +
>>> + pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL,
>>> &pmcsr);
>>> + if (!PCI_POSSIBLE_ERROR(pmcsr) &&
>>> + (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
>>> + break;
>>> + } else {
>>> + u32 id;
>>> +
>>> + pci_read_config_dword(dev, PCI_COMMAND, &id);
>>> + if (!PCI_POSSIBLE_ERROR(id))
>>> + break;
>>> + }
>>
>> What is the rationale behind using PCI_PM_CTRL in some cases and
>> PCI_COMMAND in others?
>
> We saw a deadlock during resume from suspend when PCI_PM_CTRL was used
> for all cases that supported dev->pm_cap.
>
>> Is there some spec language we can cite for
>> this?
>
> Perhaps it being a "cold reset" during resume?
>
>>
>> IIUC, pci_dev_wait() waits for a device to be ready after a reset
>> (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear,
>> DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()).
>> I think device readiness in all of these cases is covered by PCIe
>> r6.0, sec 6.6.1.
>
> Would it be helpful to you to get a dump_stack() call trace to
> pci_power_up() the specific call flow that needed this fix?
>
> Gary is able to to reproduce this at will, I think he should be able to
> gather that using an unpatched kernel to help this conversation.
Here is the kernel trace with a dump_stack() and a printk of current
inserted into pci_power_up() right before the failure occurs.
https://gist.github.com/superm1/cb407766ab15f42f12a6fe9d1196f6fc
Also the failure is visible right after.
>
>>
>> If the Root Port above the device supports Configuration RRS Software
>> Visibility, I think we probably should use that here instead of either
>> PCI_COMMAND or PCI_PM_CTRL.
>
> I did check and in this case the root port the USB4 routers are
> connected to support this.
>
> How do you think this should be done instead?
>
>> SR-IOV VFs don't implement Vendor ID,
>> which might complicate this a little. But it niggles in my mind that
>> there may be some other problem beyond that. Maybe Alex remembers.
>
>
>>
>> Anyway, if we meet the requirements of sec 6.6.1, the device should be
>> ready to respond to config requests, and I assume that also means
>> the device is in D0.
>>
>
> Within that section there is a quote to point out:
>
> "
> To allow components to perform internal initialization, system software
> must wait a specified minimum period
> following exit from a Conventional Reset of one or more devices before
> it is permitted to issue Configuration
> Requests to those devices
> "
>
> In pci_power_up() I don't really see any hardcoded delays specifically
> for this case of leaving D3cold. The PCI PM spec specifies that it will
> take "Full context restore or boot latency". I don't think it's
> reasonable to have NO delay.
Powered by blists - more mailing lists