linux-kernel - Re: [PATCH v5 2/5] PCI: Check PCI_PM_CTRL instead of PCI_COMMAND in pci_dev

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1534611c-5e36-413d-b8fe-ac29cccbb10b@amd.com>
Date: Tue, 27 Aug 2024 12:43:30 -0500
From: Mario Limonciello <mario.limonciello@....com>
To: Bjorn Helgaas <helgaas@...nel.org>, Mario Limonciello
 <superm1@...nel.org>, Gary Li <Gary.Li@....com>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>,
 Mathias Nyman <mathias.nyman@...el.com>,
 Mika Westerberg <mika.westerberg@...ux.intel.com>,
 "open list : PCI SUBSYSTEM" <linux-pci@...r.kernel.org>,
 open list <linux-kernel@...r.kernel.org>,
 "open list : USB XHCI DRIVER" <linux-usb@...r.kernel.org>,
 Daniel Drake <drake@...lessos.org>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
 Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>,
 Duc Dang <ducdang@...gle.com>, Alex Williamson <alex.williamson@...hat.com>
Subject: Re: [PATCH v5 2/5] PCI: Check PCI_PM_CTRL instead of PCI_COMMAND in
 pci_dev_wait()

On 8/26/2024 14:16, Mario Limonciello wrote:
> On 8/23/2024 14:54, Bjorn Helgaas wrote:
>> [+cc Duc, Alex]
>>
>> On Fri, Aug 23, 2024 at 10:40:20AM -0500, Mario Limonciello wrote:
>>> If a dock is plugged in at the same time as autosuspend delay then this
>>> can cause malfunctions in the USB4 stack. This happens because the
>>> device is still in D3cold at the time that the PCI core handed
>>> control back to the USB4 stack.
>>
>> I assume the USB device in question is in the dock that was hot-added?
> 
> No; it's actually the USB4 router that is malfunctioning.  The CM 
> (thunderbolt.ko) thinks the router is in D0 already, but when it 
> attempts to do a register read it gets back all F's and it trusts that.
> 
>> This patch suggests that pci_dev_wait() has waited for a read of
>> PCI_COMMAND to respond with something other than ~0, but the device is
>> still in D3cold.  I suppose we got to pci_dev_wait() via
>> pci_pm_bridge_power_up_actions() calling
>> pci_bridge_wait_for_secondary_bus(), since I wouldn't expect a reset
>> in the hot-add case.
> 
> As I said it's the router (part of the SoC). The device never 
> disappears.  It's the action of plugging in/out the the dock that causes 
> it to change power states.
> 
> We didn't try it, but I wouldn't be surprised if it could be reproduced 
> with a script that turned on/off runtime PM on very tight timing around 
> the autosuspend delay.
> 
>>
>>> A device that has gone through a reset may return a value in PCI_COMMAND
>>> but that doesn't mean it's finished transitioning to D0.  For evices 
>>> that
>>> support power management explicitly check PCI_PM_CTRL on everything but
>>> system resume to ensure the transition happened.
>>
>> s/evices/devices/
> 
> Thanks.
> 
>>
>>> Devices that don't support power management and system resume will
>>> continue to use PCI_COMMAND.
>>
>> Is there a bugzilla or other report with more details that we can
>> include here?
> 
> No, unfortunately in this case it was only reported internally at AMD.
> 
> Gary who is on CC brought it to me though, and if you think there are 
> some other specific details needed but are missing we can see what else 
> can be added to the commit message.
> 
>>
>>> Signed-off-by: Mario Limonciello <mario.limonciello@....com>
>>> ---
>>> v4->v5:
>>>   * Fix misleading indentation
>>>   * Amend commit message
>>> ---
>>>   drivers/pci/pci.c | 28 ++++++++++++++++++++--------
>>>   1 file changed, 20 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>> index 1e219057a5069..f032a4aaec268 100644
>>> --- a/drivers/pci/pci.c
>>> +++ b/drivers/pci/pci.c
>>> @@ -1309,21 +1309,33 @@ static int pci_dev_wait(struct pci_dev *dev, 
>>> enum pci_reset_type reset_type, int
>>>        * the read (except when CRS SV is enabled and the read was for 
>>> the
>>>        * Vendor ID; in that case it synthesizes 0x0001 data).
>>>        *
>>> -     * Wait for the device to return a non-CRS completion.  Read the
>>> -     * Command register instead of Vendor ID so we don't have to
>>> -     * contend with the CRS SV value.
>>> +     * Wait for the device to return a non-CRS completion.  On devices
>>> +     * that support PM control and on waits that aren't part of system
>>> +     * resume read the PM control register to ensure the device has
>>> +     * transitioned to D0.  On devices that don't support PM control,
>>> +     * or during system resume read the command register to instead of
>>> +     * Vendor ID so we don't have to contend with the CRS SV value.
>>>        */
>>>       for (;;) {
>>> -        u32 id;
>>> -
>>>           if (pci_dev_is_disconnected(dev)) {
>>>               pci_dbg(dev, "disconnected; not waiting\n");
>>>               return -ENOTTY;
>>>           }
>>> -        pci_read_config_dword(dev, PCI_COMMAND, &id);
>>> -        if (!PCI_POSSIBLE_ERROR(id))
>>> -            break;
>>> +        if (dev->pm_cap && reset_type != PCI_DEV_WAIT_RESUME) {
>>> +            u16 pmcsr;
>>> +
>>> +            pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, 
>>> &pmcsr);
>>> +            if (!PCI_POSSIBLE_ERROR(pmcsr) &&
>>> +                (pmcsr & PCI_PM_CTRL_STATE_MASK) == PCI_D0)
>>> +                break;
>>> +        } else {
>>> +            u32 id;
>>> +
>>> +            pci_read_config_dword(dev, PCI_COMMAND, &id);
>>> +            if (!PCI_POSSIBLE_ERROR(id))
>>> +                break;
>>> +        }
>>
>> What is the rationale behind using PCI_PM_CTRL in some cases and
>> PCI_COMMAND in others?
> 
> We saw a deadlock during resume from suspend when PCI_PM_CTRL was used 
> for all cases that supported dev->pm_cap.
> 
>> Is there some spec language we can cite for
>> this?
> 
> Perhaps it being a "cold reset" during resume?
> 
>>
>> IIUC, pci_dev_wait() waits for a device to be ready after a reset
>> (FLR, D3hot->D0 transition for devices where No_Soft_Reset is clear,
>> DPC) and after power-up from D3cold (pci_pm_bridge_power_up_actions()).
>> I think device readiness in all of these cases is covered by PCIe
>> r6.0, sec 6.6.1.
> 
> Would it be helpful to you to get a dump_stack() call trace to 
> pci_power_up() the specific call flow that needed this fix?
> 
> Gary is able to to reproduce this at will, I think he should be able to 
> gather that using an unpatched kernel to help this conversation.

Here is the kernel trace with a dump_stack() and a printk of current 
inserted into pci_power_up() right before the failure occurs.

https://gist.github.com/superm1/cb407766ab15f42f12a6fe9d1196f6fc

Also the failure is visible right after.

> 
>>
>> If the Root Port above the device supports Configuration RRS Software
>> Visibility, I think we probably should use that here instead of either
>> PCI_COMMAND or PCI_PM_CTRL. 
> 
> I did check and in this case the root port the USB4 routers are 
> connected to support this.
> 
> How do you think this should be done instead?
> 
>> SR-IOV VFs don't implement Vendor ID,
>> which might complicate this a little.  But it niggles in my mind that
>> there may be some other problem beyond that.  Maybe Alex remembers.
> 
> 
>>
>> Anyway, if we meet the requirements of sec 6.6.1, the device should be
>> ready to respond to config requests, and I assume that also means
>> the device is in D0.
>>
> 
> Within that section there is a quote to point out:
> 
> "
> To allow components to perform internal initialization, system software 
> must wait a specified minimum period
> following exit from a Conventional Reset of one or more devices before 
> it is permitted to issue Configuration
> Requests to those devices
> "
> 
> In pci_power_up() I don't really see any hardcoded delays specifically 
> for this case of leaving D3cold. The PCI PM spec specifies that it will 
> take "Full context restore or boot latency".  I don't think it's 
> reasonable to have NO delay.