lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220721163455.5ba133ef.alex.williamson@redhat.com>
Date:   Thu, 21 Jul 2022 16:34:55 -0600
From:   Alex Williamson <alex.williamson@...hat.com>
To:     Abhishek Sahu <abhsahu@...dia.com>
Cc:     Cornelia Huck <cohuck@...hat.com>,
        Yishai Hadas <yishaih@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
        Kevin Tian <kevin.tian@...el.com>,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        Max Gurtovoy <mgurtovoy@...dia.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
        <linux-pm@...r.kernel.org>, <linux-pci@...r.kernel.org>
Subject: Re: [PATCH v5 4/5] vfio/pci: Implement
 VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT

On Tue, 19 Jul 2022 17:45:22 +0530
Abhishek Sahu <abhsahu@...dia.com> wrote:

> Currently, if the runtime power management is enabled for vfio-pci
> based devices in the guest OS, then the guest OS will do the register
> write for PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write of
> PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework, then
> we can achieve the D3cold state (on the supported systems) which will
> help in saving maximum power.
> 
> 1. D3cold state can't be achieved by writing PCI standard
>    PM config registers. This patch implements the following
>    newly added low power related device features:
>     - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>     - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
> 
>    The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY will move the device into
>    the low power state, and the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
>    will move the device out of the low power state.

Isn't this really:

	The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY feature will allow the
	device to make use of low power platform states on the host
	while the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT will prevent
	further use of those power states.

ie. we can't make the device move to low power and every ioctl/access
will make it exit, it's more about allowing/preventing use of those
platform provided low power states.

> 
> 2. The vfio-pci driver uses runtime PM framework for low power entry and
>    exit. On the platforms where D3cold state is supported, the runtime
>    PM framework will put the device into D3cold otherwise, D3hot or some
>    other power state will be used. If the user has explicitly disabled
>    runtime PM for the device, then the device will be in the power state
>    configured by the guest OS through PCI_PM_CTRL.

This is talking about disabling runtime PM support for a device on the
host precluding this interface from allowing the device to enter
platform defined low power states, right?
 
> 3. The hypervisors can implement virtual ACPI methods. For example,
>    in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
>    resources with _ON/_OFF method, then guest linux OS invokes
>    the _OFF method during D3cold transition and then _ON during D0
>    transition. The hypervisor can tap these virtual ACPI calls and then
>    call the low power device feature IOCTL.
> 
> 4. The 'pm_runtime_engaged' flag tracks the entry and exit to
>    runtime PM. This flag is protected with 'memory_lock' semaphore.
> 
> 5. All the config and other region access are wrapped under
>    pm_runtime_resume_and_get() and pm_runtime_put(). So, if any
>    device access happens while the device is in the runtime suspended
>    state, then the device will be resumed first before access. Once the
>    access has been finished, then the device will again go into the
>    runtime suspended state.
> 
> 6. The memory region access through mmap will not be allowed in the low
>    power state. Since __vfio_pci_memory_enabled() is a common function,
>    so check for 'pm_runtime_engaged' has been added explicitly in
>    vfio_pci_mmap_fault() to block only mmap'ed access.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@...dia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 151 +++++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h    |   1 +
>  2 files changed, 144 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 9517645acfa6..726a6f282496 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -259,11 +259,98 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	return ret;
>  }
>  
> +static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
> +{
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	if (vdev->pm_runtime_engaged) {
> +		up_write(&vdev->memory_lock);
> +		return -EINVAL;
> +	}

Awkward that we zap memory for the error path here, but optimizing
performance for a user that can't remember they've already activated
low power for a device doesn't seem like a priority ;)

> +
> +	vdev->pm_runtime_engaged = true;
> +	pm_runtime_put_noidle(&vdev->pdev->dev);
> +	up_write(&vdev->memory_lock);
> +
> +	return 0;
> +}
> +
> +static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
> +				  void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
> +	if (ret != 1)
> +		return ret;
> +
> +	/*
> +	 * Inside vfio_pci_runtime_pm_entry(), only the runtime PM usage count
> +	 * will be decremented. The pm_runtime_put() will be invoked again
> +	 * while returning from the ioctl and then the device can go into
> +	 * runtime suspended state.
> +	 */
> +	return vfio_pci_runtime_pm_entry(vdev);
> +}
> +
> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
> +{
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	down_write(&vdev->memory_lock);
> +	if (vdev->pm_runtime_engaged) {
> +		vdev->pm_runtime_engaged = false;
> +		pm_runtime_get_noresume(&vdev->pdev->dev);
> +	}
> +
> +	up_write(&vdev->memory_lock);
> +}
> +
> +static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
> +				 void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
> +	if (ret != 1)
> +		return ret;
> +
> +	/*
> +	 * The device should already be resumed by the vfio core layer.
> +	 * vfio_pci_runtime_pm_exit() will internally increment the usage
> +	 * count corresponding to pm_runtime_put() called during low power
> +	 * feature entry.
> +	 */
> +	vfio_pci_runtime_pm_exit(vdev);
> +	return 0;
> +}
> +
>  #ifdef CONFIG_PM
>  static int vfio_pci_core_runtime_suspend(struct device *dev)
>  {
>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>  
> +	down_write(&vdev->memory_lock);
> +	/*
> +	 * The user can move the device into D3hot state before invoking
> +	 * power management IOCTL. Move the device into D0 state here and then
> +	 * the pci-driver core runtime PM suspend function will move the device
> +	 * into the low power state. Also, for the devices which have
> +	 * NoSoftRst-, it will help in restoring the original state
> +	 * (saved locally in 'vdev->pm_save').
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +	up_write(&vdev->memory_lock);
> +
>  	/*
>  	 * If INTx is enabled, then mask INTx before going into the runtime
>  	 * suspended state and unmask the same in the runtime resume.
> @@ -393,6 +480,18 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  
>  	/*
>  	 * This function can be invoked while the power state is non-D0.
> +	 * This non-D0 power state can be with or without runtime PM.
> +	 * vfio_pci_runtime_pm_exit() will internally increment the usage
> +	 * count corresponding to pm_runtime_put() called during low power
> +	 * feature entry and then pm_runtime_resume() will wake up the device,
> +	 * if the device has already gone into the suspended state. Otherwise,
> +	 * the vfio_pci_set_power_state() will change the device power state
> +	 * to D0.
> +	 */
> +	vfio_pci_runtime_pm_exit(vdev);
> +	pm_runtime_resume(&pdev->dev);
> +
> +	/*
>  	 * This function calls __pci_reset_function_locked() which internally
>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>  	 * fail if the power state is non-D0. Also, for the devices which
> @@ -1224,6 +1323,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
> +		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
> +		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -1234,31 +1337,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  			   size_t count, loff_t *ppos, bool iswrite)
>  {
>  	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret;
>  
>  	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>  		return -EINVAL;
>  
> +	ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
> +	if (ret < 0) {

if (ret) {

Thanks,
Alex

> +		pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
> +				     ret);
> +		return -EIO;
> +	}
> +
>  	switch (index) {
>  	case VFIO_PCI_CONFIG_REGION_INDEX:
> -		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_ROM_REGION_INDEX:
>  		if (iswrite)
> -			return -EINVAL;
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +			ret = -EINVAL;
> +		else
> +			ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +		break;
>  
>  	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_VGA_REGION_INDEX:
> -		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		break;
> +
>  	default:
>  		index -= VFIO_PCI_NUM_REGIONS;
> -		return vdev->region[index].ops->rw(vdev, buf,
> +		ret = vdev->region[index].ops->rw(vdev, buf,
>  						   count, ppos, iswrite);
> +		break;
>  	}
>  
> -	return -EINVAL;
> +	pm_runtime_put(&vdev->pdev->dev);
> +	return ret;
>  }
>  
>  ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> @@ -1453,7 +1572,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  	mutex_lock(&vdev->vma_lock);
>  	down_read(&vdev->memory_lock);
>  
> -	if (!__vfio_pci_memory_enabled(vdev)) {
> +	/*
> +	 * Memory region cannot be accessed if the low power feature is engaged
> +	 * or memory access is disabled.
> +	 */
> +	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) {
>  		ret = VM_FAULT_SIGBUS;
>  		goto up_out;
>  	}
> @@ -2164,6 +2287,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		goto err_unlock;
>  	}
>  
> +	/*
> +	 * Some of the devices in the dev_set can be in the runtime suspended
> +	 * state. Increment the usage count for all the devices in the dev_set
> +	 * before reset and decrement the same after reset.
> +	 */
> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
> +	if (ret)
> +		goto err_unlock;
> +
>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>  		/*
>  		 * Test whether all the affected devices are contained by the
> @@ -2219,6 +2351,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		else
>  			mutex_unlock(&cur->vma_lock);
>  	}
> +
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +		pm_runtime_put(&cur->pdev->dev);
>  err_unlock:
>  	mutex_unlock(&dev_set->lock);
>  	return ret;
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index e96cc3081236..7ec81271bd05 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -125,6 +125,7 @@ struct vfio_pci_core_device {
>  	bool			nointx;
>  	bool			needs_pm_restore;
>  	bool			pm_intx_masked;
> +	bool			pm_runtime_engaged;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ