lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220504134252.6d556d66.alex.williamson@redhat.com>
Date:   Wed, 4 May 2022 13:42:52 -0600
From:   Alex Williamson <alex.williamson@...hat.com>
To:     Abhishek Sahu <abhsahu@...dia.com>
Cc:     Cornelia Huck <cohuck@...hat.com>,
        Yishai Hadas <yishaih@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
        Kevin Tian <kevin.tian@...el.com>,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        Max Gurtovoy <mgurtovoy@...dia.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
        <linux-pm@...r.kernel.org>, <linux-pci@...r.kernel.org>
Subject: Re: [PATCH v3 5/8] vfio/pci: Enable runtime PM for vfio_pci_core
 based drivers

On Mon, 25 Apr 2022 14:56:12 +0530
Abhishek Sahu <abhsahu@...dia.com> wrote:

> Currently, there is very limited power management support
> available in the upstream vfio_pci_core based drivers. If there
> are no users of the device, then the PCI device will be moved into
> D3hot state by writing directly into PCI PM registers. This D3hot
> state help in saving power but we can achieve zero power consumption
> if we go into the D3cold state. The D3cold state cannot be possible
> with native PCI PM. It requires interaction with platform firmware
> which is system-specific. To go into low power states (including D3cold),
> the runtime PM framework can be used which internally interacts with PCI
> and platform firmware and puts the device into the lowest possible
> D-States.
> 
> This patch registers vfio_pci_core based drivers with the
> runtime PM framework.
> 
> 1. The PCI core framework takes care of most of the runtime PM
>    related things. For enabling the runtime PM, the PCI driver needs to
>    decrement the usage count and needs to provide 'struct dev_pm_ops'
>    at least. The runtime suspend/resume callbacks are optional and needed
>    only if we need to do any extra handling. Now there are multiple
>    vfio_pci_core based drivers. Instead of assigning the
>    'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
>    itself assigns the 'struct dev_pm_ops'. There are other drivers where
>    the 'struct dev_pm_ops' is being assigned inside core layer
>    (For example, wlcore_probe() and some sound based driver, etc.).
> 
> 2. This patch provides the stub implementation of 'struct dev_pm_ops'.
>    The subsequent patch will provide the runtime suspend/resume
>    callbacks. All the config state saving, and PCI power management
>    related things will be done by PCI core framework itself inside its
>    runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
>    pci_pm_runtime_resume()).
> 
> 3. Inside pci_reset_bus(), all the devices in dev_set needs to be
>    runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
>    care of the runtime resume and its error handling.
> 
> 4. Inside vfio_pci_core_disable(), the device usage count always needs
>    to be decremented which was incremented in vfio_pci_core_enable().
> 
> 5. Since the runtime PM framework will provide the same functionality,
>    so directly writing into PCI PM config register can be replaced with
>    the use of runtime PM routines. Also, the use of runtime PM can help
>    us in more power saving.
> 
>    In the systems which do not support D3cold,
> 
>    With the existing implementation:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3hot
> 
>    So, with runtime PM, the upstream bridge or root port will also go
>    into lower power state which is not possible with existing
>    implementation.
> 
>    In the systems which support D3cold,
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3cold
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3cold
> 
>    So, with runtime PM, both the PCI device and upstream bridge will
>    go into D3cold state.
> 
> 6. If 'disable_idle_d3' module parameter is set, then also the runtime
>    PM will be enabled, but in this case, the usage count should not be
>    decremented.
> 
> 7. vfio_pci_dev_set_try_reset() return value is unused now, so this
>    function return type can be changed to void.
> 
> 8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
>    For preventing any runtime usage mismatch, pci_num_vf() has been
>    called explicitly during disable.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@...dia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++----------
>  1 file changed, 114 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 953ac33b2f5f..aee5e0cd6137 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  }
>  
>  struct vfio_pci_group_info;
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  				      struct vfio_pci_group_info *groups);
>  
> @@ -261,6 +261,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	return ret;
>  }
>  
> +#ifdef CONFIG_PM
> +/*
> + * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
> + * so use structure without any callbacks.
> + *
> + * The pci-driver core runtime PM routines always save the device state
> + * before going into suspended state. If the device is going into low power
> + * state with only with runtime PM ops, then no explicit handling is needed
> + * for the devices which have NoSoftRst-.
> + */
> +static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
> +#endif
> +
>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> @@ -268,21 +281,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	u16 cmd;
>  	u8 msix_pos;
>  
> -	vfio_pci_set_power_state(vdev, PCI_D0);
> +	if (!disable_idle_d3) {
> +		ret = pm_runtime_resume_and_get(&pdev->dev);
> +		if (ret < 0)
> +			return ret;
> +	}
>  
>  	/* Don't allow our initial saved state to include busmaster */
>  	pci_clear_master(pdev);
>  
>  	ret = pci_enable_device(pdev);
>  	if (ret)
> -		return ret;
> +		goto out_power;
>  
>  	/* If reset fails because of the device lock, fail this path entirely */
>  	ret = pci_try_reset_function(pdev);
> -	if (ret == -EAGAIN) {
> -		pci_disable_device(pdev);
> -		return ret;
> -	}
> +	if (ret == -EAGAIN)
> +		goto out_disable_device;
>  
>  	vdev->reset_works = !ret;
>  	pci_save_state(pdev);
> @@ -306,12 +321,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	}
>  
>  	ret = vfio_config_init(vdev);
> -	if (ret) {
> -		kfree(vdev->pci_saved_state);
> -		vdev->pci_saved_state = NULL;
> -		pci_disable_device(pdev);
> -		return ret;
> -	}
> +	if (ret)
> +		goto out_free_state;
>  
>  	msix_pos = pdev->msix_cap;
>  	if (msix_pos) {
> @@ -332,6 +343,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  
>  
>  	return 0;
> +
> +out_free_state:
> +	kfree(vdev->pci_saved_state);
> +	vdev->pci_saved_state = NULL;
> +out_disable_device:
> +	pci_disable_device(pdev);
> +out_power:
> +	if (!disable_idle_d3)
> +		pm_runtime_put(&pdev->dev);
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_enable);
>  
> @@ -439,8 +460,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  out:
>  	pci_disable_device(pdev);
>  
> -	if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
> +	vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
> +
> +	/* Put the pm-runtime usage counter acquired during enable */
> +	if (!disable_idle_d3)
> +		pm_runtime_put(&pdev->dev);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>  
> @@ -1879,19 +1903,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
>  
>  	vfio_pci_probe_power_state(vdev);
>  
> -	if (!disable_idle_d3) {
> -		/*
> -		 * pci-core sets the device power state to an unknown value at
> -		 * bootup and after being removed from a driver.  The only
> -		 * transition it allows from this unknown state is to D0, which
> -		 * typically happens when a driver calls pci_enable_device().
> -		 * We're not ready to enable the device yet, but we do want to
> -		 * be able to get to D3.  Therefore first do a D0 transition
> -		 * before going to D3.
> -		 */
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
> -	}
> +	/*
> +	 * pci-core sets the device power state to an unknown value at
> +	 * bootup and after being removed from a driver.  The only
> +	 * transition it allows from this unknown state is to D0, which
> +	 * typically happens when a driver calls pci_enable_device().
> +	 * We're not ready to enable the device yet, but we do want to
> +	 * be able to get to D3.  Therefore first do a D0 transition
> +	 * before enabling runtime PM.
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +
> +#if defined(CONFIG_PM)
> +	dev->driver->pm = &vfio_pci_core_pm_ops,
> +#endif
> +
> +	pm_runtime_allow(dev);
> +	if (!disable_idle_d3)
> +		pm_runtime_put(dev);
>  
>  	ret = vfio_register_group_dev(&vdev->vdev);
>  	if (ret)
> @@ -1900,7 +1929,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
>  
>  out_power:
>  	if (!disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		pm_runtime_get_noresume(dev);
> +
> +	pm_runtime_forbid(dev);
>  out_vf:
>  	vfio_pci_vf_uninit(vdev);
>  out_drvdata:
> @@ -1922,8 +1953,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>  	vfio_pci_vga_uninit(vdev);
>  
>  	if (!disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		pm_runtime_get_noresume(dev);
>  
> +	pm_runtime_forbid(dev);
>  	dev_set_drvdata(dev, NULL);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
> @@ -1984,18 +2016,26 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
>  
>  		/*
>  		 * The PF power state should always be higher than the VF power
> -		 * state. If PF is in the low power state, then change the
> -		 * power state to D0 first before enabling SR-IOV.
> +		 * state. If PF is in the runtime suspended state, then resume
> +		 * it first before enabling SR-IOV.
>  		 */
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> -		ret = pci_enable_sriov(pdev, nr_virtfn);
> +		ret = pm_runtime_resume_and_get(&pdev->dev);
>  		if (ret)
>  			goto out_del;
> +
> +		ret = pci_enable_sriov(pdev, nr_virtfn);
> +		if (ret) {
> +			pm_runtime_put(&pdev->dev);
> +			goto out_del;
> +		}
>  		ret = nr_virtfn;
>  		goto out_put;
>  	}
>  
> -	pci_disable_sriov(pdev);
> +	if (pci_num_vf(pdev)) {
> +		pci_disable_sriov(pdev);
> +		pm_runtime_put(&pdev->dev);
> +	}
>  
>  out_del:
>  	mutex_lock(&vfio_pci_sriov_pfs_mutex);
> @@ -2072,6 +2112,30 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set)
>  	return pdev;
>  }
>  
> +static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
> +{
> +	struct vfio_pci_core_device *cur_pm;
> +	struct vfio_pci_core_device *cur;
> +	int ret = 0;
> +
> +	list_for_each_entry(cur_pm, &dev_set->device_list, vdev.dev_set_list) {
> +		ret = pm_runtime_resume_and_get(&cur_pm->pdev->dev);
> +		if (ret < 0)
> +			break;
> +	}
> +
> +	if (!ret)
> +		return 0;
> +
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> +		if (cur == cur_pm)
> +			break;
> +		pm_runtime_put(&cur->pdev->dev);
> +	}
> +
> +	return ret;
> +}

The above works, but maybe could be a little cleaner taking advantage
of list_for_each_entry_continue_reverse as:

{
	struct vfio_pci_core_device *cur;
	int ret;

	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
		ret = pm_runtime_resume_and_get(&cur->pdev->dev);
		if (ret)
			goto unwind;
	}

	return 0;

unwind:
	list_for_each_entry_continue_reverse(cur, &dev_set->device_list, vdev.dev_set_list)
		pm_runtime_put(&cur->pdev->dev);

	return ret;
}

Thanks,
Alex

> +
>  /*
>   * We need to get memory_lock for each device, but devices can share mmap_lock,
>   * therefore we need to zap and hold the vma_lock for each device, and only then
> @@ -2178,43 +2242,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>   *  - At least one of the affected devices is marked dirty via
>   *    needs_reset (such as by lack of FLR support)
>   * Then attempt to perform that bus or slot reset.
> - * Returns true if the dev_set was reset.
>   */
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>  {
>  	struct vfio_pci_core_device *cur;
>  	struct pci_dev *pdev;
> -	int ret;
> +	bool reset_done = false;
>  
>  	if (!vfio_pci_dev_set_needs_reset(dev_set))
> -		return false;
> +		return;
>  
>  	pdev = vfio_pci_dev_set_resettable(dev_set);
>  	if (!pdev)
> -		return false;
> +		return;
>  
>  	/*
> -	 * The pci_reset_bus() will reset all the devices in the bus.
> -	 * The power state can be non-D0 for some of the devices in the bus.
> -	 * For these devices, the pci_reset_bus() will internally set
> -	 * the power state to D0 without vfio driver involvement.
> -	 * For the devices which have NoSoftRst-, the reset function can
> -	 * cause the PCI config space reset without restoring the original
> -	 * state (saved locally in 'vdev->pm_save').
> +	 * Some of the devices in the bus can be in the runtime suspended
> +	 * state. Increment the usage count for all the devices in the dev_set
> +	 * before reset and decrement the same after reset.
>  	 */
> -	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> -		vfio_pci_set_power_state(cur, PCI_D0);
> +	if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set))
> +		return;
>  
> -	ret = pci_reset_bus(pdev);
> -	if (ret)
> -		return false;
> +	if (!pci_reset_bus(pdev))
> +		reset_done = true;
>  
>  	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> -		cur->needs_reset = false;
> +		if (reset_done)
> +			cur->needs_reset = false;
> +
>  		if (!disable_idle_d3)
> -			vfio_pci_set_power_state(cur, PCI_D3hot);
> +			pm_runtime_put(&cur->pdev->dev);
>  	}
> -	return true;
>  }
>  
>  void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ