lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 17 Apr 2012 10:02:27 +0800
From:	huang ying <huang.ying.caritas@...il.com>
To:	"Rafael J. Wysocki" <rjw@...k.pl>
Cc:	"Yan, Zheng" <zheng.z.yan@...el.com>, bhelgaas@...gle.com,
	linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
	linux-pm@...r.kernel.org, Lin Ming <ming.m.lin@...el.com>,
	Zhang Rui <rui.zhang@...el.com>,
	ACPI Devel Mailing List <linux-acpi@...r.kernel.org>
Subject: Re: [RFC PATCH] PCIe: Add PCIe runtime D3cold support

On Tue, Apr 17, 2012 at 5:30 AM, Rafael J. Wysocki <rjw@...k.pl> wrote:
> On Monday, April 16, 2012, huang ying wrote:
>> Hi,
>>
>> On Sat, Apr 14, 2012 at 3:41 AM, Rafael J. Wysocki <rjw@...k.pl> wrote:
>> > Hi,
>> >
>> > On Friday, April 13, 2012, Yan, Zheng wrote:
>> >> Hi all,
>> >>
>> >> This patch adds PCIe runtime D3cold support, namely cut power supply for functions
>> >> beneath a PCIe port when they all have entered D3. A device in D3cold can only
>> >> generate wake event through the WAKE# pin. Because we can not access to a device's
>> >> configure space while it's in D3cold, pme_poll is disabled for devices in D3cold.
>> >>
>> >> Any comment will be appreciated.
>> >>
>> >> Signed-off-by: Zheng Yan <zheng.z.yan@...el.com>
>> >> ---
>> >> diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
>> >> index 0f150f2..e210e8cb 100644
>> >> --- a/drivers/pci/pci-acpi.c
>> >> +++ b/drivers/pci/pci-acpi.c
>> >> @@ -224,7 +224,7 @@ static int acpi_pci_set_power_state(struct pci_dev *dev, pci_power_t state)
>> >>               [PCI_D1] = ACPI_STATE_D1,
>> >>               [PCI_D2] = ACPI_STATE_D2,
>> >>               [PCI_D3hot] = ACPI_STATE_D3,
>> >> -             [PCI_D3cold] = ACPI_STATE_D3
>> >> +             [PCI_D3cold] = ACPI_STATE_D3_COLD
>> >>       };
>> >>       int error = -EINVAL;
>> >>
>> >
>> > Please don't use that ACPI_STATE_D3_COLD thing, it's not defined correctly.
>> >
>> > We should define ACPI_STATE_D3_COLD == ACPI_STATE_D3 and add ACPI_STATE_D3_HOT
>> > instead.  I'll prepare a patch for that over the weekend if no one has done
>> > that already.
>> >
>> >> @@ -296,7 +296,8 @@ static void acpi_pci_propagate_run_wake(struct pci_bus *bus, bool enable)
>> >>
>> >>  static int acpi_pci_run_wake(struct pci_dev *dev, bool enable)
>> >>  {
>> >> -     if (dev->pme_interrupt)
>> >> +     /* PME interrupt isn't available in the D3cold case */
>> >> +     if (dev->pme_interrupt && !dev->runtime_d3cold)
>> >
>> > This whole thing is wrong.  First off, I don't think that the runtime_d3cold
>> > flag makes any sense.  We already cover that in dev->pme_support.
>>
>> PCIe devices and PCIe root port may have proper PME interrupt support
>> (that is, dev->pme_interrupt = true), but the process of remote wakeup
>> from D3cold is as follow:
>>
>> 1) In D3cold, the power of the main link is turned off, aux power is
>> provided (PCIe L2 link state)
>> 2) Device detect condition to resume, then assert #WAKE pin
>> 2) ACPI circuit linked with #WAKE pin, and will generate a GPE for that
>> 3) GPE handler will resume device with ACPI (via _PS3 and _PR0), the
>> power of the main link is turned on, after a while, link goes into L0
>> state
>> 4) The PME message is sent to root port, pme interrupt generated
>
> This isn't how it's supposed to work in theory.  If the device can signal PME
> from D3cold, it should be able to reestablish the link and send a PME
> message from there.  dev->pme_interrupt set means exactly that.
>
> ACPI is only supposed to be needed for things that don't send PME
> messages (in your case the PME interrupt generated by the port is essentially
> useless, because the wakeup event has already been signaled through ACPI).
>
>> So, for deivce, dev->pme_interrupt = true and dev->pme_support
>> advocate it support PME in D3cold.  But we still need ACPI to setup
>> run wake for the device.
>
> OK, so this is nonstandard.

This is the standard behavior.  Please refer to PCI Express Base
Sepcification Revision 2.0, section 5.3.3.2 Link Wakeup.  In D1, D2
and D3hot state, PCIe device can transit the link from L1 to L0 state,
and send the PME message.  In D3cold, the main link is powered off,
PCIe device will use a STANDARD sideband signal WAKE# to signal wakeup
firstly, then platform (power controller in spec) will power on the
main link for the device, after main link is back to L0, the PME
message is send to root port, pme interrupt is generated.  So in
theory, the wake up process can be divided into platform part (which
power on the main link) and PCIe part (which send PME).

>> So we need a way to inform acpi_pci_run_wake(), one way is using a
>> flag, there is other ways such as adding another parameter to
>> acpi_pci_run_wake() to tell it which power state to go.
>
> I'm not sure what to do with that.  It looks like we should use ACPI for
> signaling remote wakeup on this platform and additionally provide a dummy
> interrupt handler for root ports that will only clear PME Status.
>
>> > Second, pme_interrupt means that the _root_ _port_, not the device itself will
>> > trigger an interrupt whenever the device sends the PME message to it (which
>> > very well may happen for a device in D3_cold woken up by an external signal).
>>
>> As above, that may happend after ACPI GPE.
>
> Because it's not a standard PCIe system.
>
>> >>               return 0;
>> >>
>> >>       if (!acpi_pm_device_run_wake(&dev->dev, enable))
>> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> >> index 8156744..bc16869 100644
>> >> --- a/drivers/pci/pci.c
>> >> +++ b/drivers/pci/pci.c
>> >> @@ -731,8 +731,8 @@ int pci_set_power_state(struct pci_dev *dev, pci_power_t state)
>> >>       int error;
>> >>
>> >
>> > Guys, please.  Never, _ever_, touch pci_set_power_state() without discussing
>> > your ideas with someone who knows how it works and _why_ it works this way.
>>
>> This patch is sent as RFC, I think this is to call for discussing
>> instead of merging, isn't it?
>>
>> > The problem here is that you can't program a PCI device into D3_cold, so it
>> > doesn't even make sense to have a helper for that.
>>
>> We need to program ACPI to make the PCI device go into D3_cold.  That
>> is called in this function.
>>
>> pci_set_power_sate
>>   -> __pci_complete_power_transition
>>     -> pci_platform_power_transition
>
> So don't use pci_set_power_state() for that, because it's essentially
> a different operation.  You need a pci_platform_remove_power() helper or
> similar for that.
>
> What ACPI method exactly is used to remove power from the device?

The ACPI method executed is as follow:

- _PS3 (if exist)
- Power resources in _PR3 is turned on
- Power resources in _PR0 is turned off
- Power resources in _PR3 is turned off

I think the process can fit pci_set_power_state() quite well, so why
invent another helper for that?

>> >>       /* bound the state we're entering */
>> >> -     if (state > PCI_D3hot)
>> >> -             state = PCI_D3hot;
>> >> +     if (state > PCI_D3cold)
>> >> +             state = PCI_D3cold;
>> >>       else if (state < PCI_D0)
>> >>               state = PCI_D0;
>> >>       else if ((state == PCI_D1 || state == PCI_D2) && pci_no_d1d2(dev))
>> >> @@ -750,7 +750,8 @@ int pci_set_power_state(struct pci_dev *dev, pci_power_t state)
>> >>       if (state == PCI_D3hot && (dev->dev_flags & PCI_DEV_FLAGS_NO_D3))
>> >>               return 0;
>> >>
>> >> -     error = pci_raw_set_power_state(dev, state);
>> >> +     error = pci_raw_set_power_state(dev, state > PCI_D3hot ?
>> >> +                                     PCI_D3hot : state);
>> >>
>> >>       if (!__pci_complete_power_transition(dev, state))
>> >>               error = 0;
>> >> @@ -1482,6 +1483,17 @@ bool pci_pme_capable(struct pci_dev *dev, pci_power_t state)
>> >>       return !!(dev->pme_support & (1 << state));
>> >>  }
>> >>
>> >> +static void pci_pme_poll_wakeup(struct pci_dev *dev)
>> >> +{
>> >> +     struct pci_dev *bridge = dev->bus->self;
>> >> +
>> >> +     /* don't poll the pme bit if parent is in low power state */
>> >> +     if (bridge && bridge->current_state != PCI_D0)
>> >> +             return;
>> >> +
>> >> +     pci_pme_wakeup(dev, NULL);
>> >> +}
>> >
>> > This one actually makes some sense, although it might be better to put the
>> > test into pci_pme_wakeup() itself.
>> >
>> >> +
>> >>  static void pci_pme_list_scan(struct work_struct *work)
>> >>  {
>> >>       struct pci_pme_device *pme_dev, *n;
>> >> @@ -1490,7 +1502,7 @@ static void pci_pme_list_scan(struct work_struct *work)
>> >>       if (!list_empty(&pci_pme_list)) {
>> >>               list_for_each_entry_safe(pme_dev, n, &pci_pme_list, list) {
>> >>                       if (pme_dev->dev->pme_poll) {
>> >> -                             pci_pme_wakeup(pme_dev->dev, NULL);
>> >> +                             pci_pme_poll_wakeup(pme_dev->dev);
>> >>                       } else {
>> >>                               list_del(&pme_dev->list);
>> >>                               kfree(pme_dev);
>> >> @@ -1608,6 +1620,10 @@ int __pci_enable_wake(struct pci_dev *dev, pci_power_t state,
>> >>       if (enable) {
>> >>               int error;
>> >>
>> >> +             if (runtime && state >= PCI_D3cold)
>> >> +                     dev->runtime_d3cold = true;
>> >> +             else
>> >> +                     dev->runtime_d3cold = false;
>> >>               if (pci_pme_capable(dev, state))
>> >>                       pci_pme_active(dev, true);
>> >>               else
>> >> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
>> >> index e0610bd..d66b7e9 100644
>> >> --- a/drivers/pci/pcie/portdrv_pci.c
>> >> +++ b/drivers/pci/pcie/portdrv_pci.c
>> >> @@ -11,11 +11,13 @@
>> >>  #include <linux/kernel.h>
>> >>  #include <linux/errno.h>
>> >>  #include <linux/pm.h>
>> >> +#include <linux/pm_runtime.h>
>> >>  #include <linux/init.h>
>> >>  #include <linux/pcieport_if.h>
>> >>  #include <linux/aer.h>
>> >>  #include <linux/dmi.h>
>> >>  #include <linux/pci-aspm.h>
>> >> +#include <linux/delay.h>
>> >>
>> >>  #include "portdrv.h"
>> >>  #include "aer/aerdrv.h"
>> >> @@ -99,6 +101,25 @@ static int pcie_port_resume_noirq(struct device *dev)
>> >>       return 0;
>> >>  }
>> >>
>> >> +static int pcie_port_runtime_suspend(struct device *dev)
>> >> +{
>> >> +     struct pci_dev *pdev = to_pci_dev(dev);
>> >> +
>> >> +     pci_save_state(pdev);
>> >
>> > Are you sure this is sufficient?
>>
>> Maybe we need to pay attention to the pcie port drivers?  Run ->runtime_
>
> Yes.
>
>> >> +     return 0;
>> >> +}
>> >> +
>> >> +static int pcie_port_runtime_resume(struct device *dev)
>> >> +{
>> >> +     struct pci_dev *pdev = to_pci_dev(dev);
>> >> +
>> >> +     pci_restore_state(pdev);
>> >> +     if (pdev->runtime_d3cold)
>> >> +             msleep(100);
>> >
>> > What's _that_ supposed to do?
>>
>> When resume from d3cold, PCIe main link will be powered on again, it
>> will take quite some time before the main link go into L0 state.
>> Otherwise, accessing devices under the port may return wrong result.
>
> OK, but this is generic code and as per the standard the link should have been
> reestablished at this point already.

Maybe we can have some other way to check whether the link is
reestablished.  I will check the this.

> Please don't put some nonstandard-platform-specific quirks like this into
> code that's supposed to handle _every_ PCIe system.
>
>> >> +
>> >> +     return 0;
>> >> +}
>> >> +
>> >>  static const struct dev_pm_ops pcie_portdrv_pm_ops = {
>> >>       .suspend        = pcie_port_device_suspend,
>> >>       .resume         = pcie_port_device_resume,
>> >> @@ -107,6 +128,8 @@ static const struct dev_pm_ops pcie_portdrv_pm_ops = {
>> >>       .poweroff       = pcie_port_device_suspend,
>> >>       .restore        = pcie_port_device_resume,
>> >>       .resume_noirq   = pcie_port_resume_noirq,
>> >> +     .runtime_suspend = pcie_port_runtime_suspend,
>> >> +     .runtime_resume = pcie_port_runtime_resume,
>> >>  };
>> >>
>> >>  #define PCIE_PORTDRV_PM_OPS  (&pcie_portdrv_pm_ops)
>> >> @@ -144,12 +167,14 @@ static int __devinit pcie_portdrv_probe(struct pci_dev *dev,
>> >>               return status;
>> >>
>> >>       pci_save_state(dev);
>> >> +     pm_runtime_put_noidle(&dev->dev);
>> >
>> > What's the purpose of this?
>>
>> I think this is needed by PCI runtime PM support.  Before ->probe,
>> pm_runtime_get_xxx() is executed on device.
>
> ok
>
>> >>       return 0;
>> >>  }
>> >>
>> >>  static void pcie_portdrv_remove(struct pci_dev *dev)
>> >>  {
>> >> +     pm_runtime_get_noresume(&dev->dev);
>> >>       pcie_port_device_remove(dev);
>> >>       pci_disable_device(dev);
>> >>  }
>> >> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> >> index e444f5b..b41d9a1 100644
>> >> --- a/include/linux/pci.h
>> >> +++ b/include/linux/pci.h
>> >> @@ -281,6 +281,7 @@ struct pci_dev {
>> >>       unsigned int    no_d1d2:1;      /* Only allow D0 and D3 */
>> >>       unsigned int    mmio_always_on:1;       /* disallow turning off io/mem
>> >>                                                  decoding during bar sizing */
>> >> +     unsigned int    runtime_d3cold:1;
>> >>       unsigned int    wakeup_prepared:1;
>> >>       unsigned int    d3_delay;       /* D3->D0 transition time in ms */
>> >
>> > OK
>> >
>> > So now please tell me what exactly you want to achieve and why you want to do
>> > that in the first place.
>
> Well, is there any chance to get that information?

You mean the runtime_d3cold flag? That flag is used to tell
acpi_pci_run_wake() that we need ACPI wakeup setup for the device
because that is needed by D3cold.  The ACPI wakeup setup here means
turn on power resources needed by wake up (_PRW) and execute _DSW.

If you mean the whole patch, we want to implement runtime D3cold
support, which can save more power than D3hot.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ