linux-kernel - Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4ddd819a-76fc-5e4b-8956-ca0807247f0c@amd.com>
Date: Tue, 23 Jan 2024 10:18:26 +0530
From: Sanath S <sanaths2@....com>
To: Alex Williamson <alex.williamson@...hat.com>,
 Lukas Wunner <lukas@...ner.de>
Cc: linux-pci@...r.kernel.org, bhelgaas@...gle.com,
 linux-kernel@...r.kernel.org, eric.auger@...hat.com,
 mika.westerberg@...ux.intel.com, rafael.j.wysocki@...el.com, Sanath.S@....com
Subject: Re: [PATCH v2 2/2] PCI: Fix runtime PM race with PME polling


On 1/23/2024 10:16 AM, Alex Williamson wrote:
> On Mon, 22 Jan 2024 15:50:03 -0700
> Alex Williamson <alex.williamson@...hat.com> wrote:
>
>> On Mon, 22 Jan 2024 23:17:30 +0100
>> Lukas Wunner <lukas@...ner.de> wrote:
>>
>>> On Thu, Jan 18, 2024 at 11:50:49AM -0700, Alex Williamson wrote:
>>>> On Thu,  3 Aug 2023 11:12:33 -0600 Alex Williamson <alex.williamson@...hat.com wrote:
>>>>> Testing that a device is not currently in a low power state provides no
>>>>> guarantees that the device is not immenently transitioning to such a state.
>>>>> We need to increment the PM usage counter before accessing the device.
>>>>> Since we don't wish to wake the device for PME polling, do so only if the
>>>>> device is already active by using pm_runtime_get_if_active().
>>>>>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@...hat.com>
>>>>> ---
>>>>>   drivers/pci/pci.c | 23 ++++++++++++++++-------
>>>>>   1 file changed, 16 insertions(+), 7 deletions(-)
>>>> Resurrecting this patch (currently commit d3fcd7360338) for discussion
>>>> as it's been identified as the source of a regression in:
>>>>
>>>> https://bugzilla.kernel.org/show_bug.cgi?id=218360
>>>>
>>>> Copying Mika, Lukas, and Rafael as it's related to:
>>>>
>>>> 000dd5316e1c ("PCI: Do not poll for PME if the device is in D3cold")
>>>>
>>>> where we skip devices in D3cold when processing the PME list.
>>>>
>>>> I think the issue in the above bz is that the downstream TB3/USB4 port
>>>> is in D3 (presumably D3hot) and I therefore infer the device is in state
>>>> RPM_SUSPENDED.  This commit is attempting to make sure the device power
>>>> state is stable across the call such that it does not transition into
>>>> D3cold while we're accessing it.
>>>>
>>>> To do that I used pm_runtime_get_if_active(), but in retrospect this
>>>> requires the device to be in RPM_ACTIVE so we end up skipping anything
>>>> suspended or transitioning.
>>> How about dropping the calls to pm_runtime_get_if_active() and
>>> pm_runtime_put() and instead simply do:
>>>
>>> 			if (pm_runtime_suspended(&pdev->dev) &&
>>> 			    pdev->current_state != PCI_D3cold)
>>> 				pci_pme_wakeup(pdev, NULL);
>> Hi Lukas,
>>
>> Do we require that the polled device is in the RPM_SUSPENDED state?
>> Also pm_runtime_suspended() can also only be trusted while holding the
>> device power.lock, we need a usage count reference to maintain that
>> state.
>>
>> I'm also seeing cases where the bridge is power state D0, but PM state
>> RPM_SUSPENDING, so config space of the polled device becomes
>> inaccessible even while we're holding a reference once we allow polling
>> in RPM_SUSPENDED.
>>
>> I'm currently working with the below patch, which I believe addresses
>> all these issues, but I'd welcome review and testing. Thanks,
>>
>> Alex
>>
>> commit 0a063b8e91d0bc807db712c81c8b270864f99ecb
>> Author: Alex Williamson <alex.williamson@...hat.com>
>> Date:   Tue Jan 16 13:28:33 2024 -0700
>>
>>      PCI: Fix active state requirement in PME polling
>>      
>>      The commit noted in fixes added a bogus requirement that runtime PM
>>      managed devices need to be in the RPM_ACTIVE state for PME polling.
>>      In fact, there is no requirement of a specific runtime PM state, it
>>      is only required that the state is stable such that testing config
>>      space availability, ie. !D3cold, remains valid across the PME wakeup.
>>      
>>      To that effect, defer polling of runtime PM managed devices that are
>>      not in either the RPM_ACTIVE or RPM_SUSPENDED states.  Devices in
>>      transitional states remain on the pci_pme_list and will be re-queued.
>>      
>>      However in allowing polling of devices in the RPM_SUSPENDED state,
>>      the bridge state requires further refinement as it's possible to poll
>>      while the bridge is in D0, but the runtime PM state is RPM_SUSPENDING.
>>      An asynchronous completion of the bridge transition to a low power
>>      state can make config space of the subordinate device become
>>      unavailable.  A runtime PM reference to the bridge is therefore added
>>      with a supplementary requirement that the bridge is in the RPM_ACTIVE
>>      state.
>>      
>>      Fixes: d3fcd7360338 ("PCI: Fix runtime PM race with PME polling")
>>      Reported-by: Sanath S <sanath.s@....com>
>>      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218360
>>      Signed-off-by: Alex Williamson <alex.williamson@...hat.com>
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index bdbf8a94b4d0..31dbf1834b07 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -2433,29 +2433,45 @@ static void pci_pme_list_scan(struct work_struct *work)
>>   		if (pdev->pme_poll) {
>>   			struct pci_dev *bridge = pdev->bus->self;
>>   			struct device *dev = &pdev->dev;
>> -			int pm_status;
>> +			struct device *bdev = bridge ? &bridge->dev : NULL;
>>   
>>   			/*
>> -			 * If bridge is in low power state, the
>> -			 * configuration space of subordinate devices
>> -			 * may be not accessible
>> +			 * If we have a bridge, it should be in an active/D0
>> +			 * state or the configuration space of subordinate
>> +			 * devices may not be accessible.
>>   			 */
>> -			if (bridge && bridge->current_state != PCI_D0)
>> -				continue;
>> +			if (bdev) {
>> +				spin_lock_irq(&bdev->power.lock);
> With the code as shown here I have one system that seems to be getting
> contention when reading the vpd sysfs attribute when the endpoints
> (QL41000) are bound to vfio-pci and unused, resulting in the root port
> and endpoints being suspended.  A vpd read can take over a minute.
> Seems to be resolved changing the above spin_lock to a spin_trylock:
>
> 				if (!spin_trylock_irq(&bdev->power.lock))
> 					continue;
>
> The pm_runtime_barrier() as used in the vpd path can be prone to such
> issues, I saw similar in the fix I previously proposed in the bz.
>
> I'll continue to do more testing with this change and hopefully Sanath
> can verify this resolves the bug report.  Thanks,
>
> Alex
I'll verify it today and let you know the observations.
>> +				if (!pm_runtime_active(bdev) ||
>> +				    bridge->current_state != PCI_D0) {
>> +					spin_unlock_irq(&bdev->power.lock);
>> +					continue;
>> +				}
>> +				pm_runtime_get_noresume(bdev);
>> +				spin_unlock_irq(&bdev->power.lock);
>> +			}
>>   
>>   			/*
>> -			 * If the device is in a low power state it
>> -			 * should not be polled either.
>> +			 * The device itself may be either in active or
>> +			 * suspended state, but must not be in D3cold so
>> +			 * that configuration space is accessible.  The
>> +			 * transitional resuming and suspending states are
>> +			 * skipped to avoid D3cold races.
>>   			 */
>> -			pm_status = pm_runtime_get_if_active(dev, true);
>> -			if (!pm_status)
>> -				continue;
>> -
>> -			if (pdev->current_state != PCI_D3cold)
>> +			spin_lock_irq(&dev->power.lock);
>> +			if ((pm_runtime_active(dev) ||
>> +			     pm_runtime_suspended(dev)) &&
>> +			    pdev->current_state != PCI_D3cold) {
>> +				pm_runtime_get_noresume(dev);
>> +				spin_unlock_irq(&dev->power.lock);
>>   				pci_pme_wakeup(pdev, NULL);
>> -
>> -			if (pm_status > 0)
>>   				pm_runtime_put(dev);
>> +			} else {
>> +				spin_unlock_irq(&dev->power.lock);
>> +			}
>> +
>> +			if (bdev)
>> +				pm_runtime_put(bdev);
>>   		} else {
>>   			list_del(&pme_dev->list);
>>   			kfree(pme_dev);