lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b1609cc7-7f80-ee0e-aca4-1a7b0149e5af@nvidia.com>
Date:   Mon, 21 Feb 2022 12:05:25 +0530
From:   Abhishek Sahu <abhsahu@...dia.com>
To:     Alex Williamson <alex.williamson@...hat.com>
Cc:     kvm@...r.kernel.org, Cornelia Huck <cohuck@...hat.com>,
        Max Gurtovoy <mgurtovoy@...dia.com>,
        Yishai Hadas <yishaih@...dia.com>,
        Zhen Lei <thunder.leizhen@...wei.com>,
        Jason Gunthorpe <jgg@...dia.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with
 runtime PM framework

On 2/17/2022 5:18 AM, Alex Williamson wrote:
> On Mon, 24 Jan 2022 23:47:22 +0530
> Abhishek Sahu <abhsahu@...dia.com> wrote:
> 
>> Currently, there is very limited power management support
>> available in the upstream vfio-pci driver. If there is no user of vfio-pci
>> device, then the PCI device will be moved into D3Hot state by writing
>> directly into PCI PM registers. This D3Hot state help in saving power
>> but we can achieve zero power consumption if we go into the D3cold state.
>> The D3cold state cannot be possible with native PCI PM. It requires
>> interaction with platform firmware which is system-specific.
>> To go into low power states (including D3cold), the runtime PM framework
>> can be used which internally interacts with PCI and platform firmware and
>> puts the device into the lowest possible D-States.
>>
>> This patch registers vfio-pci driver with the runtime PM framework.
>>
>> 1. The PCI core framework takes care of most of the runtime PM
>>    related things. For enabling the runtime PM, the PCI driver needs to
>>    decrement the usage count and needs to register the runtime
>>    suspend/resume callbacks. For vfio-pci based driver, these callback
>>    routines can be stubbed in this patch since the vfio-pci driver
>>    is not doing the PCI device initialization. All the config state
>>    saving, and PCI power management related things will be done by
>>    PCI core framework itself inside its runtime suspend/resume callbacks.
>>
>> 2. Inside pci_reset_bus(), all the devices in bus/slot will be moved
>>    out of D0 state. This state change to D0 can happen directly without
>>    going through the runtime PM framework. So if runtime PM is enabled,
>>    then pm_runtime_resume() makes the runtime state active. Since the PCI
>>    device power state is already D0, so it should return early when it
>>    tries to change the state with pci_set_power_state(). Then
>>    pm_request_idle() can be used which will internally check for
>>    device usage count and will move the device again into the low power
>>    state.
>>
>> 3. Inside vfio_pci_core_disable(), the device usage count always needs
>>    to be decremented which was incremented in vfio_pci_core_enable().
>>
>> 4. Since the runtime PM framework will provide the same functionality,
>>    so directly writing into PCI PM config register can be replaced with
>>    the use of runtime PM routines. Also, the use of runtime PM can help
>>    us in more power saving.
>>
>>    In the systems which do not support D3Cold,
>>
>>    With the existing implementation:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D0
>>
>>    With runtime PM:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D3hot
>>
>>    So, with runtime PM, the upstream bridge or root port will also go
>>    into lower power state which is not possible with existing
>>    implementation.
>>
>>    In the systems which support D3Cold,
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D0
>>
>>    With runtime PM:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3cold
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D3cold
>>
>>    So, with runtime PM, both the PCI device and upstream bridge will
>>    go into D3cold state.
>>
>> 5. If 'disable_idle_d3' module parameter is set, then also the runtime
>>    PM will be enabled, but in this case, the usage count should not be
>>    decremented.
>>
>> 6. vfio_pci_dev_set_try_reset() return value is unused now, so this
>>    function return type can be changed to void.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@...dia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci.c      |  3 +
>>  drivers/vfio/pci/vfio_pci_core.c | 95 +++++++++++++++++++++++---------
>>  include/linux/vfio_pci_core.h    |  4 ++
>>  3 files changed, 75 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index a5ce92beb655..c8695baf3b54 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
>>       .remove                 = vfio_pci_remove,
>>       .sriov_configure        = vfio_pci_sriov_configure,
>>       .err_handler            = &vfio_pci_core_err_handlers,
>> +#if defined(CONFIG_PM)
>> +     .driver.pm              = &vfio_pci_core_pm_ops,
>> +#endif
>>  };
>>
>>  static void __init vfio_pci_fill_ids(void)
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index f948e6cd2993..c6e4fe9088c3 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>>  }
>>
>>  struct vfio_pci_group_info;
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>                                     struct vfio_pci_group_info *groups);
>>
>> @@ -245,7 +245,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>       u16 cmd;
>>       u8 msix_pos;
>>
>> -     vfio_pci_set_power_state(vdev, PCI_D0);
>> +     if (!disable_idle_d3) {
>> +             ret = pm_runtime_resume_and_get(&pdev->dev);
>> +             if (ret < 0)
>> +                     return ret;
>> +     }
> 
> Sorry for the delay in review, I'm a novice in pm runtime, but I
> haven't forgotten about the remainder of this series.
> 

 Thanks Alex.
 Should I include linux-pm@...r.kernel.org while sending the updated
 version. I got following comment in my different patch related
 with PCI PM
 (https://lore.kernel.org/lkml/20220204233219.GA228585@bhelgaas/T/#me17cb6e1aa3848cfd4ea577a3c93ebbbfdbf7c73)

 "generally PM patches should be CCed to linux-pm anyway"


> I think we're removing the unconditional wake here because we now wake
> the device in the core registration function below, but I think there
> might be a subtle dependency here on the fix to always wake devices in
> the disable function as well, otherwise I'm afraid the power state of a
> device released in D3hot could leak to the next user here.
> 

 Yes. We need to consider the fix. 
 Either we can add the state restore handling logic inside
 vfio_pci_core_runtime_resume() or we can keep restore the state
 alone here explictly. For runtime PM, we need to call
 pm_runtime_resume_and_get() first since root port should be moved
 to D0 first. 

>>
>>       /* Don't allow our initial saved state to include busmaster */
>>       pci_clear_master(pdev);
>> @@ -405,8 +409,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>  out:
>>       pci_disable_device(pdev);
>>
>> -     if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D3hot);
>> +     vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
>> +
>> +     /* Put the pm-runtime usage counter acquired during enable */
>> +     if (!disable_idle_d3)
>> +             pm_runtime_put(&pdev->dev);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>>
>> @@ -1847,19 +1854,20 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>>       vfio_pci_probe_power_state(vdev);
>>
>> -     if (!disable_idle_d3) {
>> -             /*
>> -              * pci-core sets the device power state to an unknown value at
>> -              * bootup and after being removed from a driver.  The only
>> -              * transition it allows from this unknown state is to D0, which
>> -              * typically happens when a driver calls pci_enable_device().
>> -              * We're not ready to enable the device yet, but we do want to
>> -              * be able to get to D3.  Therefore first do a D0 transition
>> -              * before going to D3.
>> -              */
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> -             vfio_pci_set_power_state(vdev, PCI_D3hot);
>> -     }
>> +     /*
>> +      * pci-core sets the device power state to an unknown value at
>> +      * bootup and after being removed from a driver.  The only
>> +      * transition it allows from this unknown state is to D0, which
>> +      * typically happens when a driver calls pci_enable_device().
>> +      * We're not ready to enable the device yet, but we do want to
>> +      * be able to get to D3.  Therefore first do a D0 transition
>> +      * before enabling runtime PM.
>> +      */
>> +     vfio_pci_set_power_state(vdev, PCI_D0);
>> +     pm_runtime_allow(&pdev->dev);
>> +
>> +     if (!disable_idle_d3)
>> +             pm_runtime_put(&pdev->dev);
> 
> I could use some enlightenment here.  pm_runtime_allow() only does
> something if power.runtime_allow is false, in which case it sets that
> value to true and decrements power.usage_count.  runtime_allow is
> enabled by default in pm_runtime_init(), but pci_pm_init() calls
> pm_runtime_forbid() which does the reverse of pm_runtime_allow().  So
> do I understand correctly that PCI devices are probed with
> runtime_allow = false and a usage_count of 2?
> 

 Following is the flow w.r.t. usage_count and runtime_allow.

 In pci_pm_init(), the default usage_count=0 and runtime_allow=true initially.
 pm_runtime_forbid() in pci_pm_init() makes usage_count=1 and runtime_allow=false

 Then, inside local_pci_probe(), pm_runtime_get_sync() is called,
 After this, the usage_count=2 and runtime_allow=false

 So, you are correct that the PCI devices are probed with
 runtime_allow=false and usage_count=2.

 In the driver, 

 pm_runtime_allow() is for doing the reverse of pm_runtime_forbid().
 and pm_runtime_put() is for doing the reverse of pm_runtime_get_sync().
 
>>
>>       ret = vfio_register_group_dev(&vdev->vdev);
>>       if (ret)
>> @@ -1868,7 +1876,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>>  out_power:
>>       if (!disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> +             pm_runtime_get_noresume(&pdev->dev);
>> +
>> +     pm_runtime_forbid(&pdev->dev);
>>  out_vf:
>>       vfio_pci_vf_uninit(vdev);
>>       return ret;
>> @@ -1887,7 +1897,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>>       vfio_pci_vga_uninit(vdev);
>>
>>       if (!disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> +             pm_runtime_get_noresume(&pdev->dev);
>> +
>> +     pm_runtime_forbid(&pdev->dev);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>>
>> @@ -2093,33 +2105,62 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>>   *  - At least one of the affected devices is marked dirty via
>>   *    needs_reset (such as by lack of FLR support)
>>   * Then attempt to perform that bus or slot reset.
>> - * Returns true if the dev_set was reset.
>>   */
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>>  {
>>       struct vfio_pci_core_device *cur;
>>       struct pci_dev *pdev;
>>       int ret;
>>
>>       if (!vfio_pci_dev_set_needs_reset(dev_set))
>> -             return false;
>> +             return;
>>
>>       pdev = vfio_pci_dev_set_resettable(dev_set);
>>       if (!pdev)
>> -             return false;
>> +             return;
>>
>>       ret = pci_reset_bus(pdev);
>>       if (ret)
>> -             return false;
>> +             return;
>>
>>       list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>>               cur->needs_reset = false;
>> -             if (!disable_idle_d3)
>> -                     vfio_pci_set_power_state(cur, PCI_D3hot);
>> +             if (!disable_idle_d3) {
>> +                     /*
>> +                      * Inside pci_reset_bus(), all the devices in bus/slot
>> +                      * will be moved out of D0 state. This state change to
> 
> s/out of/into/?
> 

 Yes. I will fix this. 

>> +                      * D0 can happen directly without going through the
>> +                      * runtime PM framework. pm_runtime_resume() will
>> +                      * help make the runtime state as active and then
>> +                      * pm_request_idle() can be used which will
>> +                      * internally check for device usage count and will
>> +                      * move the device again into the low power state.
>> +                      */
>> +                     pm_runtime_resume(&pdev->dev);
>> +                     pm_request_idle(&pdev->dev);
>> +             }
>>       }
>> -     return true;
>>  }
>>
>> +#ifdef CONFIG_PM
>> +static int vfio_pci_core_runtime_suspend(struct device *dev)
>> +{
>> +     return 0;
>> +}
>> +
>> +static int vfio_pci_core_runtime_resume(struct device *dev)
>> +{
>> +     return 0;
>> +}
>> +
>> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
>> +     SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
>> +                        vfio_pci_core_runtime_resume,
>> +                        NULL)
>> +};
>> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
>> +#endif
> 
> It looks like the vfio_pci_core_pm_ops implementation should all be
> moved to where we implement D3cold support, it's not necessary to
> implement stubs for any of the functionality of this patch.  Thanks,
> 

 We need to provide dev_pm_ops atleast to make runtime PM working.
 In pci_pm_runtime_idle() generic function:

 const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
 if (!pm)
    return -ENOSYS;

 Without dev_pm_ops, the idle routine will return ENOSYS error.

 vfio_pci_core_runtime_{suspend/resume}() stub implementation can be removed
 but we need to provide stub vfio_pci_core_pm_ops atleast.

 const struct dev_pm_ops vfio_pci_core_pm_ops = { };
 EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);

 Thanks,
 Abhishek 
 
> Alex
> 
>> +
>>  void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
>>                             bool is_disable_idle_d3)
>>  {
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index ef9a44b6cf5d..aafe09c9fa64 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
>>  void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
>>  void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>>
>> +#ifdef CONFIG_PM
>> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
>> +#endif
>> +
>>  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>  {
>>       return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ