lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 6 Jul 2022 09:40:41 -0600
From:   Alex Williamson <alex.williamson@...hat.com>
To:     Abhishek Sahu <abhsahu@...dia.com>
Cc:     Cornelia Huck <cohuck@...hat.com>,
        Yishai Hadas <yishaih@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
        Kevin Tian <kevin.tian@...el.com>,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        Max Gurtovoy <mgurtovoy@...dia.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        <linux-kernel@...r.kernel.org>, <kvm@...r.kernel.org>,
        <linux-pm@...r.kernel.org>, <linux-pci@...r.kernel.org>
Subject: Re: [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state

On Fri, 1 Jul 2022 16:38:12 +0530
Abhishek Sahu <abhsahu@...dia.com> wrote:

> Currently, if the runtime power management is enabled for vfio-pci
> based devices in the guest OS, then the guest OS will do the register
> write for PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write of
> PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework, then
> we can achieve the D3cold state (on the supported systems) which will
> help in saving maximum power.
> 
> 1. D3cold state can't be achieved by writing PCI standard
>    PM config registers. This patch implements the newly added
>    'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' device feature which
>     can be used for putting the device into the D3cold state.
> 
> 2. The hypervisors can implement virtual ACPI methods. For example,
>    in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
>    resources with _ON/_OFF method, then guest linux OS invokes
>    the _OFF method during D3cold transition and then _ON during D0
>    transition. The hypervisor can tap these virtual ACPI calls and then
>    call the  'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' with respective flags.
> 
> 3. The vfio-pci driver uses runtime PM framework to achieve the
>    D3cold state. For the D3cold transition, decrement the usage count and
>    for the D0 transition, increment the usage count.
> 
> 4. If the D3cold state is not supported, then the device will
>    still be in the D3hot state. But with the runtime PM, the root port
>    can now also go into the suspended state.
> 
> 5. The 'pm_runtime_engaged' flag tracks the entry and exit to
>    runtime PM. This flag is protected with 'memory_lock' semaphore.
> 
> 6. During exit time, the flag clearing and usage count increment
>    are protected with 'memory_lock'. The actual wake-up is happening
>    outside 'memory_lock' since 'memory_lock' will be needed inside
>    runtime_resume callback also in subsequent patches.
> 
> 7. In D3cold, all kinds of device-related access (BAR read/write,
>    config read/write, etc.) need to be disabled. For BAR-related access,
>    we can use existing D3hot memory disable support. During the low power
>    entry, invalidate the mmap mappings and add the check for
>    'pm_runtime_engaged' flag.

Not disabled, just wrapped in pm-get/put.  If the device is indefinitely
in low-power without a wake-up eventfd, mmap faults are fatal to the
user.
 
> 8. For config space, ideally, we need to return an error whenever
>    there is any config access from the user side once the user moved the
>    device into low power state. But adding a check for
>    'pm_runtime_engaged' flag alone won't be sufficient due to the
>    following possible scenario from the user side where config space
>    access happens parallelly with the low power entry IOCTL.
> 
>    a. Config space access happens and vfio_pci_config_rw() will be
>       called.
>    b. The IOCTL to move into low power state is called.
>    c. The IOCTL will move the device into d3cold.
>    d. Exit from vfio_pci_config_rw() happened.
> 
>    Now, if we just check 'pm_runtime_engaged', then in the above
>    sequence the config space access will happen when the device already
>    is in the low power state. To prevent this situation, we increment the
>    usage count before any config space access and decrement the same
>    after completing the access. Also, to prevent any similar cases for
>    other types of access, the usage count will be incremented for all
>    kinds of access.

Unnecessary, just wrap in pm-get/put.  Thanks,

Alex

> Signed-off-by: Abhishek Sahu <abhsahu@...dia.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c |   2 +-
>  drivers/vfio/pci/vfio_pci_core.c   | 169 +++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h      |   1 +
>  3 files changed, 164 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 9343f597182d..21a4743d011f 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -408,7 +408,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
>  	 * PF SR-IOV capability, there's therefore no need to trigger
>  	 * faults based on the virtual value.
>  	 */
> -	return pdev->current_state < PCI_D3hot &&
> +	return !vdev->pm_runtime_engaged && pdev->current_state < PCI_D3hot &&
>  	       (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
>  }
>  
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 5948d930449b..8c17ca41d156 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -264,6 +264,18 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
>  {
>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>  
> +	down_write(&vdev->memory_lock);
> +	/*
> +	 * The user can move the device into D3hot state before invoking
> +	 * power management IOCTL. Move the device into D0 state here and then
> +	 * the pci-driver core runtime PM suspend function will move the device
> +	 * into the low power state. Also, for the devices which have
> +	 * NoSoftRst-, it will help in restoring the original state
> +	 * (saved locally in 'vdev->pm_save').
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +	up_write(&vdev->memory_lock);
> +
>  	/*
>  	 * If INTx is enabled, then mask INTx before going into the runtime
>  	 * suspended state and unmask the same in the runtime resume.
> @@ -386,6 +398,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	struct pci_dev *pdev = vdev->pdev;
>  	struct vfio_pci_dummy_resource *dummy_res, *tmp;
>  	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
> +	bool do_resume = false;
>  	int i, bar;
>  
>  	/* For needs_reset */
> @@ -393,6 +406,25 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  
>  	/*
>  	 * This function can be invoked while the power state is non-D0.
> +	 * This non-D0 power state can be with or without runtime PM.
> +	 * Increment the usage count corresponding to pm_runtime_put()
> +	 * called during setting of 'pm_runtime_engaged'. The device will
> +	 * wake up if it has already gone into the suspended state.
> +	 * Otherwise, the next vfio_pci_set_power_state() will change the
> +	 * device power state to D0.
> +	 */
> +	down_write(&vdev->memory_lock);
> +	if (vdev->pm_runtime_engaged) {
> +		vdev->pm_runtime_engaged = false;
> +		pm_runtime_get_noresume(&pdev->dev);
> +		do_resume = true;
> +	}
> +	up_write(&vdev->memory_lock);
> +
> +	if (do_resume)
> +		pm_runtime_resume(&pdev->dev);
> +
> +	/*
>  	 * This function calls __pci_reset_function_locked() which internally
>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>  	 * fail if the power state is non-D0. Also, for the devices which
> @@ -1190,6 +1222,99 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>  
> +static int vfio_pci_pm_validate_flags(u32 flags)
> +{
> +	if (!flags)
> +		return -EINVAL;
> +	/* Only valid flags should be set */
> +	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> +		return -EINVAL;
> +	/* Both enter and exit should not be set */
> +	if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
> +	    (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
> +				    void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct vfio_device_feature_power_management vfio_pm = { 0 };
> +	int ret = 0;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_SET |
> +				 VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(vfio_pm));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
> +		down_read(&vdev->memory_lock);
> +		if (vdev->pm_runtime_engaged)
> +			vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
> +		else
> +			vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
> +		up_read(&vdev->memory_lock);
> +
> +		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
> +			return -EFAULT;
> +
> +		return 0;
> +	}
> +
> +	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
> +		return -EFAULT;
> +
> +	ret = vfio_pci_pm_validate_flags(vfio_pm.flags);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	if (vfio_pm.flags & VFIO_PM_LOW_POWER_ENTER) {
> +		vfio_pci_zap_and_down_write_memory_lock(vdev);
> +		if (vdev->pm_runtime_engaged) {
> +			up_write(&vdev->memory_lock);
> +			return -EINVAL;
> +		}
> +
> +		vdev->pm_runtime_engaged = true;
> +		up_write(&vdev->memory_lock);
> +		pm_runtime_put(&pdev->dev);
> +	} else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
> +		down_write(&vdev->memory_lock);
> +		if (!vdev->pm_runtime_engaged) {
> +			up_write(&vdev->memory_lock);
> +			return -EINVAL;
> +		}
> +
> +		vdev->pm_runtime_engaged = false;
> +		pm_runtime_get_noresume(&pdev->dev);
> +		up_write(&vdev->memory_lock);
> +		ret = pm_runtime_resume(&pdev->dev);
> +		if (ret < 0) {
> +			down_write(&vdev->memory_lock);
> +			if (!vdev->pm_runtime_engaged) {
> +				vdev->pm_runtime_engaged = true;
> +				pm_runtime_put_noidle(&pdev->dev);
> +			}
> +			up_write(&vdev->memory_lock);
> +			return ret;
> +		}
> +	} else {
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
>  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>  				       void __user *arg, size_t argsz)
>  {
> @@ -1224,6 +1349,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
> +		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -1234,31 +1361,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  			   size_t count, loff_t *ppos, bool iswrite)
>  {
>  	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret;
>  
>  	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>  		return -EINVAL;
>  
> +	ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
> +	if (ret < 0) {
> +		pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
> +				     ret);
> +		return -EIO;
> +	}
> +
>  	switch (index) {
>  	case VFIO_PCI_CONFIG_REGION_INDEX:
> -		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_ROM_REGION_INDEX:
>  		if (iswrite)
> -			return -EINVAL;
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +			ret = -EINVAL;
> +		else
> +			ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +		break;
>  
>  	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_VGA_REGION_INDEX:
> -		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		break;
> +
>  	default:
>  		index -= VFIO_PCI_NUM_REGIONS;
> -		return vdev->region[index].ops->rw(vdev, buf,
> +		ret = vdev->region[index].ops->rw(vdev, buf,
>  						   count, ppos, iswrite);
> +		break;
>  	}
>  
> -	return -EINVAL;
> +	pm_runtime_put(&vdev->pdev->dev);
> +	return ret;
>  }
>  
>  ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> @@ -2157,6 +2300,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		goto err_unlock;
>  	}
>  
> +	/*
> +	 * Some of the devices in the dev_set can be in the runtime suspended
> +	 * state. Increment the usage count for all the devices in the dev_set
> +	 * before reset and decrement the same after reset.
> +	 */
> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
> +	if (ret)
> +		goto err_unlock;
> +
>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>  		/*
>  		 * Test whether all the affected devices are contained by the
> @@ -2212,6 +2364,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		else
>  			mutex_unlock(&cur->vma_lock);
>  	}
> +
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +		pm_runtime_put(&cur->pdev->dev);
>  err_unlock:
>  	mutex_unlock(&dev_set->lock);
>  	return ret;
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index cdfd328ba6b1..bf4823b008f9 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -125,6 +125,7 @@ struct vfio_pci_core_device {
>  	bool			nointx;
>  	bool			needs_pm_restore;
>  	bool			pm_intx_masked;
> +	bool			pm_runtime_engaged;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;

Powered by blists - more mailing lists