lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <201205172121.34892.rjw@sisk.pl>
Date:	Thu, 17 May 2012 21:21:34 +0200
From:	"Rafael J. Wysocki" <rjw@...k.pl>
To:	Huang Ying <ying.huang@...el.com>
Cc:	Bjorn Helgaas <bhelgaas@...gle.com>, ming.m.lin@...el.com,
	linux-kernel@...r.kernel.org, linux-pm@...r.kernel.org,
	Zheng Yan <zheng.z.yan@...el.com>
Subject: Re: [PATCH -v3 2/2] PCIe: Add PCIe runtime D3cold support

On Thursday, May 17, 2012, Huang Ying wrote:
> On Tue, 2012-05-15 at 21:35 +0200, Rafael J. Wysocki wrote:
> > On Tuesday, May 15, 2012, Huang Ying wrote:
> > > This patch adds runtime D3cold support to PCIe bus.  D3cold is the
> > > deepest power saving state for PCIe devices.  Where the main PCIe link
> > > will be powered off totally, so before the PCIe main link is powered
> > > on again, you can not access the device, even the configuration space,
> > > which is still accessible in D3hot.  Because the main PCIe link is not
> > > available, the PCI PM registers can not be used to put device into/out
> > > of D3cold state, that will be done by platform logic such as ACPI
> > > _PR3.
> > > 
> > > To support remote wake up in D3cold, aux power is supplied, and a
> > > side-band pin: WAKE# is used to notify remote wake up event, the pin
> > > is usually connected to platform logic such as ACPI GPE.  This is
> > > quite different with other power saving states, where remote wake up
> > > is notified via PME message through the PCIe main link.
> > > 
> > > PCIe devices in plug-in slot usually has no direct platform logic, for
> > > example, there is usually no ACPI _PR3 for them.  The D3cold support
> > > for these devices can be done via the PCIe port connected to it.  When
> > > power on/off the PCIe port, the corresponding PCIe devices are powered
> > > on/off too.  And the remote wake up event from PCIe device will be
> > > notified to the corresponding PCIe port.
> > > 
> > > The PCI subsystem D3cold support and corresponding ACPI platform
> > > support is implemented in the patch.  Including the support for PCIe
> > > devices in plug-in slot.
> > > 
> > > For more information about PCIe D3cold and corresponding ACPI support,
> > > please refer to:
> > > 
> > > - PCI Express Base Specification Revision 2.0
> > > - Advanced Configuration and Power Interface Specification Revision 5.0
> > > 
> > > Originally-by: Zheng Yan <zheng.z.yan@...el.com>
> > > Signed-off-by: Huang Ying <ying.huang@...el.com>
> > > ---
> > >  drivers/pci/pci-acpi.c         |   26 +++++++--
> > >  drivers/pci/pci-driver.c       |    3 +
> > >  drivers/pci/pci-sysfs.c        |   29 ++++++++++
> > >  drivers/pci/pci.c              |  117 +++++++++++++++++++++++++++++++++++++----
> > >  drivers/pci/pci.h              |    1 
> > >  drivers/pci/pcie/portdrv_pci.c |   30 ++++++++++
> > >  include/linux/pci.h            |   16 ++++-
> > >  7 files changed, 204 insertions(+), 18 deletions(-)
> > > 
> > > --- a/drivers/pci/pci-acpi.c
> > > +++ b/drivers/pci/pci-acpi.c
> > > @@ -48,6 +48,15 @@ static void pci_acpi_wake_dev(acpi_handl
> > >  	if (event != ACPI_NOTIFY_DEVICE_WAKE || !pci_dev)
> > >  		return;
> > >  
> > > +	if (pci_dev->current_state == PCI_D3cold) {
> > > +		pci_wakeup_event(pci_dev);
> > > +		pm_runtime_get_sync(&pci_dev->dev);
> > > +		/* Powering on bridge needs to resume whole hierarchy */
> > > +		pci_wakeup_bus(pci_dev->subordinate, true);
> > > +		pm_runtime_put(&pci_dev->dev);
> > > +		return;
> > > +	}
> > > +
> > >  	if (!pci_dev->pm_cap || !pci_dev->pme_support
> > >  	     || pci_check_pme_status(pci_dev)) {
> > >  		if (pci_dev->pme_poll)
> > > @@ -187,10 +196,13 @@ acpi_status pci_acpi_remove_pm_notifier(
> > >  
> > >  static pci_power_t acpi_pci_choose_state(struct pci_dev *pdev)
> > >  {
> > > -	int acpi_state;
> > > +	int acpi_state, d_max;
> > >  
> > > -	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL,
> > > -						ACPI_STATE_D3);
> > > +	if (pdev->no_d3cold)
> > > +		d_max = ACPI_STATE_D3_HOT;
> > > +	else
> > > +		d_max = ACPI_STATE_D3_COLD;
> > > +	acpi_state = acpi_pm_device_sleep_state(&pdev->dev, NULL, d_max);
> > >  	if (acpi_state < 0)
> > >  		return PCI_POWER_ERROR;
> > >  
> > > @@ -297,7 +309,13 @@ static void acpi_pci_propagate_run_wake(
> > >  
> > >  static int acpi_pci_run_wake(struct pci_dev *dev, bool enable)
> > >  {
> > > -	if (dev->pme_interrupt)
> > > +	/*
> > > +	 * Per PCI Express Base Specification Revision 2.0 section
> > > +	 * 5.3.3.2 Link Wakeup, platform support is needed for D3cold
> > > +	 * waking up to power on the main link even if there is PME
> > > +	 * support for D3cold
> > > +	 */
> > > +	if (dev->pme_interrupt && !dev->runtime_d3cold)
> > >  		return 0;
> > >  
> > >  	if (!acpi_pm_device_run_wake(&dev->dev, enable))
> > > --- a/drivers/pci/pci-driver.c
> > > +++ b/drivers/pci/pci-driver.c
> > > @@ -1013,10 +1013,13 @@ static int pci_pm_runtime_suspend(struct
> > >  	if (!pm || !pm->runtime_suspend)
> > >  		return -ENOSYS;
> > >  
> > > +	pci_dev->no_d3cold = false;
> > >  	error = pm->runtime_suspend(dev);
> > >  	suspend_report_result(pm->runtime_suspend, error);
> > >  	if (error)
> > >  		return error;
> > > +	if (!pci_dev->d3cold_allowed)
> > > +		pci_dev->no_d3cold = true;
> > >  
> > >  	pci_fixup_device(pci_fixup_suspend, pci_dev);
> > >  
> > > --- a/drivers/pci/pci-sysfs.c
> > > +++ b/drivers/pci/pci-sysfs.c
> > > @@ -27,6 +27,7 @@
> > >  #include <linux/security.h>
> > >  #include <linux/pci-aspm.h>
> > >  #include <linux/slab.h>
> > > +#include <linux/pm_runtime.h>
> > >  #include "pci.h"
> > >  
> > >  static int sysfs_initialized;	/* = 0 */
> > > @@ -377,6 +378,31 @@ dev_bus_rescan_store(struct device *dev,
> > >  
> > >  #endif
> > >  
> > > +#if defined(CONFIG_PM_RUNTIME) && defined(CONFIG_ACPI)
> > > +static ssize_t d3cold_allowed_store(struct device *dev,
> > > +				    struct device_attribute *attr,
> > > +				    const char *buf, size_t count)
> > > +{
> > > +	struct pci_dev *pdev = to_pci_dev(dev);
> > > +	unsigned long val;
> > > +
> > > +	if (strict_strtoul(buf, 0, &val) < 0)
> > > +		return -EINVAL;
> > > +
> > > +	pdev->d3cold_allowed = !!val;
> > > +	pm_runtime_resume(dev);
> > > +
> > > +	return count;
> > > +}
> > > +
> > > +static ssize_t d3cold_allowed_show(struct device *dev,
> > > +				   struct device_attribute *attr, char *buf)
> > > +{
> > > +	struct pci_dev *pdev = to_pci_dev(dev);
> > > +	return sprintf (buf, "%u\n", pdev->d3cold_allowed);
> > > +}
> > > +#endif
> > > +
> > >  struct device_attribute pci_dev_attrs[] = {
> > >  	__ATTR_RO(resource),
> > >  	__ATTR_RO(vendor),
> > > @@ -401,6 +427,9 @@ struct device_attribute pci_dev_attrs[]
> > >  	__ATTR(remove, (S_IWUSR|S_IWGRP), NULL, remove_store),
> > >  	__ATTR(rescan, (S_IWUSR|S_IWGRP), NULL, dev_rescan_store),
> > >  #endif
> > > +#if defined(CONFIG_PM_RUNTIME) && defined(CONFIG_ACPI)
> > > +	__ATTR(d3cold_allowed, 0644, d3cold_allowed_show, d3cold_allowed_store),
> > > +#endif
> > >  	__ATTR_NULL,
> > >  };
> > >  
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -653,6 +653,12 @@ void pci_update_current_state(struct pci
> > >  	if (dev->pm_cap) {
> > >  		u16 pmcsr;
> > >  
> > > +		/*
> > > +		 * Configuration space is not accessible for device in
> > > +		 * D3cold, so keep in D3cold for safety
> > > +		 */
> > > +		if (dev->current_state == PCI_D3cold)
> > > +			return;
> > >  		pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
> > >  		dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
> > >  	} else {
> > > @@ -671,8 +677,12 @@ static int pci_platform_power_transition
> > >  
> > >  	if (platform_pci_power_manageable(dev)) {
> > >  		error = platform_pci_set_power_state(dev, state);
> > > -		if (!error)
> > > -			pci_update_current_state(dev, state);
> > > +		if (!error) {
> > > +			if (state == PCI_D3cold)
> > > +				dev->current_state = PCI_D3cold;
> > 
> > I would put the above into pci_update_current_state(), so that this entire
> > hunk wouldn't be necessary.
> 
> OK.  Will do that.
> 
> > > +			else
> > > +				pci_update_current_state(dev, state);
> > > +		}
> > >  		/* Fall back to PCI_D0 if native PM is not supported */
> > >  		if (!dev->pm_cap)
> > >  			dev->current_state = PCI_D0;
> > > @@ -693,8 +703,49 @@ static int pci_platform_power_transition
> > >   */
> > >  static void __pci_start_power_transition(struct pci_dev *dev, pci_power_t state)
> > >  {
> > > -	if (state == PCI_D0)
> > > +	if (state == PCI_D0) {
> > >  		pci_platform_power_transition(dev, PCI_D0);
> > > +		/*
> > > +		 * Mandatory power management transition delays, see
> > > +		 * PCI Express Base Specification Revision 2.0 Section
> > > +		 * 6.6.1: Conventional Reset.  Do not delay for
> > > +		 * devices powered on/off by corresponding bridge,
> > > +		 * because have already delayed for the bridge.
> > > +		 */
> > > +		if (dev->runtime_d3cold) {
> > > +			msleep(dev->d3cold_delay);
> > > +			/* When powering on a bridge from D3cold, the
> > 
> > Put the /* into a separate line, please.
> 
> Sorry, will change it.
> 
> > > +			 * whole hierarchy may be powered on into
> > > +			 * D0uninitialized state, resume them to give
> > > +			 * them a chance to suspend again
> > > +			 */
> > > +			pci_wakeup_bus(dev->subordinate, false);
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * __pci_dev_set_current_state - Set current state of a PCI device
> > > + * @dev: Device to handle
> > > + * @data: pointer to state to be set
> > > + */
> > > +static int __pci_dev_set_current_state(struct pci_dev *dev, void *data)
> > > +{
> > > +	pci_power_t state = *(pci_power_t *)data;
> > > +
> > > +	dev->current_state = state;
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * __pci_bus_set_current_state - Walk given bus and set current state of devices
> > > + * @bus: Top bus of the subtree to walk.
> > > + * @state: state to be set
> > > + */
> > > +static void __pci_bus_set_current_state(struct pci_bus *bus, pci_power_t state)
> > > +{
> > > +	if (bus)
> > > +		pci_walk_bus(bus, __pci_dev_set_current_state, &state);
> > >  }
> > >  
> > >  /**
> > > @@ -706,8 +757,15 @@ static void __pci_start_power_transition
> > >   */
> > >  int __pci_complete_power_transition(struct pci_dev *dev, pci_power_t state)
> > >  {
> > > -	return state >= PCI_D0 ?
> > > -			pci_platform_power_transition(dev, state) : -EINVAL;
> > > +	int ret;
> > > +
> > > +	if (state < PCI_D0)
> > > +		return -EINVAL;
> > > +	ret = pci_platform_power_transition(dev, state);
> > > +	/* Power off the bridge may power off the whole hierarchy */
> > > +	if (!ret && state == PCI_D3cold)
> > > +		__pci_bus_set_current_state(dev->subordinate, PCI_D3cold);
> > > +	return ret;
> > >  }
> > >  EXPORT_SYMBOL_GPL(__pci_complete_power_transition);
> > >  
> > > @@ -731,8 +789,8 @@ int pci_set_power_state(struct pci_dev *
> > >  	int error;
> > >  
> > >  	/* bound the state we're entering */
> > > -	if (state > PCI_D3hot)
> > > -		state = PCI_D3hot;
> > > +	if (state > PCI_D3cold)
> > > +		state = PCI_D3cold;
> > >  	else if (state < PCI_D0)
> > >  		state = PCI_D0;
> > >  	else if ((state == PCI_D1 || state == PCI_D2) && pci_no_d1d2(dev))
> > > @@ -747,10 +805,15 @@ int pci_set_power_state(struct pci_dev *
> > >  
> > >  	/* This device is quirked not to be put into D3, so
> > >  	   don't put it in D3 */
> > > -	if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
> > > +	if (state >= PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
> > >  		return 0;
> > >  
> > > -	error = pci_raw_set_power_state(dev, state);
> > > +	/*
> > > +	 * To put device in D3cold, we put device into D3hot in native
> > > +	 * way, then put device into D3cold with platform ops
> > > +	 */
> > > +	error = pci_raw_set_power_state(dev, state > PCI_D3hot ?
> > > +					PCI_D3hot : state);
> > 
> > I wonder if we really need to transition the device into PCI_D3hot before
> > removing power from it?  If not, this whole thing might be simplified.
> 
> From a book named PCI Express System Architecture, Figure 16-6. PCI
> Express Function Power management State Transitions.  The D3 Cold can be
> transition to only from D3 Hot.  But did not find the similar figure in
> PCI express spec.
> 
> From the practical point of view.  If the power resources to power
> on/off the PCIe device is shared by several devices.  If we do not
> transition the device into PCI_D3hot before removing power from it, and
> another device share the same power resources is not powered off, we may
> waste quite some energy.
> 
> Here is another possible situation where the underlying ACPI may not
> really power off the device.  If the device has only _PS3 and no _PR3,
> for safety, we treat the state after executing _PS3 as D3cold, but in
> fact, it may be just D3hot, we need transition the device into D3hot
> here.
> 
> > >  
> > >  	if (!__pci_complete_power_transition(dev, state))
> > >  		error = 0;
> > > @@ -1497,6 +1560,35 @@ void pci_pme_wakeup_bus(struct pci_bus *
> > >  }
> > >  
> > >  /**
> > > + * pci_wakeup - Wake up a PCI device
> > > + * @dev: Device to handle.
> > > + * @data: whether to resume first device synchronously
> > > + */
> > > +static int pci_wakeup(struct pci_dev *pci_dev, void *data)
> > > +{
> > > +	bool *sync_first = data;
> > > +
> > > +	pci_wakeup_event(pci_dev);
> > > +	if (*sync_first) {
> > 
> > Why exactly do we need to do this?
> 
> This is used to avoid the the following situation:
> 
> 1) The devices under the PCIe port and the PCIe port itself are in
> D3cold state
> 2) Remote wakeup of a device is triggered, so the #WAKE pin of the
> device is asserted
> 3) The #WAKE pin is connected with GPE logic, so Lxx handler will be
> executed. In Lxx handler, Notify the ACPI handler of the PCIe port
> 4) In the PM handler of the PCIe port (pci_acpi_wake_dev), the PCIe port
> will be resumed, so that the PCIe port and the devices are powered on to
> D0uninitialized state.  So the devices under the PCIe port should be
> resumed too (with pci_wakeup_bus).
> 
> Without sync_first logic introduced here, all the devices under the PCIe
> port are resumed asynchronously, and during the resume process of the
> PCIe port, a suspending request is pending for the PCIe port too.  So it
> is possible for the PCIe port is resumed, suspended, then resumed again.
> 
> With sync_fist logic, one of the devices under the PCIe port are resumed
> synchronously, so that the PCIe port is guaranteed to be kept in ACTIVE
> state.
> 
> If you think this situation is so seldom that we need not care too much,
> I will remove the sync_first logic at least for now.

Yes, please.  I think we can add it later if need be.

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ