linux-kernel - Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68463d9b-98ee-b9ec-1a3e-1375e50a2ad2@nvidia.com>
Date:   Tue, 10 May 2022 18:56:02 +0530
From:   Abhishek Sahu <abhsahu@...dia.com>
To:     Alex Williamson <alex.williamson@...hat.com>
Cc:     Cornelia Huck <cohuck@...hat.com>,
        Yishai Hadas <yishaih@...dia.com>,
        Jason Gunthorpe <jgg@...dia.com>,
        Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
        Kevin Tian <kevin.tian@...el.com>,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        Max Gurtovoy <mgurtovoy@...dia.com>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
        linux-pm@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state

On 5/10/2022 3:18 AM, Alex Williamson wrote:
> On Thu, 5 May 2022 17:46:20 +0530
> Abhishek Sahu <abhsahu@...dia.com> wrote:
> 
>> On 5/5/2022 1:15 AM, Alex Williamson wrote:
>>> On Mon, 25 Apr 2022 14:56:15 +0530
>>> Abhishek Sahu <abhsahu@...dia.com> wrote:
>>>   
>>>> Currently, if the runtime power management is enabled for vfio pci
>>>> based device in the guest OS, then guest OS will do the register
>>>> write for PCI_PM_CTRL register. This write request will be handled in
>>>> vfio_pm_config_write() where it will do the actual register write
>>>> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
>>>> achieved for low power. If we can use the runtime PM framework,
>>>> then we can achieve the D3cold state which will help in saving
>>>> maximum power.
>>>>
>>>> 1. Since D3cold state can't be achieved by writing PCI standard
>>>>    PM config registers, so this patch adds a new feature in the
>>>>    existing VFIO_DEVICE_FEATURE IOCTL. This IOCTL can be used
>>>>    to change the PCI device from D3hot to D3cold state and
>>>>    then D3cold to D0 state. The device feature uses low power term
>>>>    instead of D3cold so that if other vfio driver wants to implement
>>>>    low power support, then the same IOCTL can be used.  
>>>
>>> How does this enable you to handle the full-off vs memory-refresh modes
>>> for NVIDIA GPUs?
>>>   
>>  
>>  Thanks Alex.
>>
>>  This patch series will just enable the full-off for nvidia GPU.
>>  The self-refresh mode won't work.
>>
>>  The self-refresh case is nvidia specific and needs driver
>>  involvement each time before going into d3cold. We are evaluating
>>  internally if we have enough use case for self-refresh mode and then
>>  I will plan separate patch series to support self-refresh mode use
>>  case, if required. But that will be independent of this patch series.
>>
>>  At the high level, we need some way to disable the PCI device access
>>  from the host side or forward the event to VM for every access on the
>>  host side if we want to support NVIDIA self-refresh use case inside VM.
>>  Otherwise, from the driver side, we can disable self-refresh mode if
>>  driver is running inside VM. In that case, if memory usage is higher than
>>  threshold then we don’t engage RTD3 itself. 
> 
> Disabling PCI access on the host seems impractical to me, but PM and
> PCI folks are welcome to weigh in.
> 
> We've also discussed that the GPU memory could exceed RAM + swap for a
> VM, leaving them with no practical means to make use of d3cold if we
> don't support this capability.  Also, existing drivers expect to have
> this capability and it's not uncommon for those in the gaming community
> making use of GPU assignment to attempt to hide the fact that they're
> running in a VM to avoid falsely triggering anti-cheat detection, DRM,
> or working around certain GPU vendors who previously restricted use of
> consumer GPUs in VMs.
> 
> That seems to suggest to me that our only option is along the lines of
> notifying the VM when the device returns to D0 and by default only
> re-entering d3cold under the direction of the VM.  We might also do some
> sort of negotiation based on device vendor and class code where we
> could enable the kernel to perform the transition back to d3cold.
> There's a fair chance that an AMD GPU might have similar requirements,
> do we know if they do?
> 

 That SW involvement before going into D3cold can be possible for
 other devices as well although I am not sure about the current
 AMD GPU implementation. For NVIDIA GPU, the firmware running on the
 GPU listens for PME_turn_Off and then do the handling for self-refresh.
 For other devices also, if they have firmware involvement before
 D3cold entry then the similar issue can come there also.

> I'd suggest perhaps splitting this patch series so that we can start
> taking advantage of using d3cold for idle devices while we figure out
> how to make use of VM directed d3cold without creating scenarios that
> don't break existing drivers.
>  

 Sure. I can make this patch series and will move the last 3
 patches in separate patch series along with the VM notification
 support for the wake-up triggered by host.

>>> The feature ioctl supports a probe, but here the probe only indicates
>>> that the ioctl is available, not what degree of low power support
>>> available.  Even if the host doesn't support d3cold for the device, we
>>> can still achieve root port d3hot, but can we provide further
>>> capability info to the user?
>>>  
>>
>>  I wanted to add more information here but was not sure which
>>  information will be helpful for user. There is no certain way to
>>  predict that the runtime suspend will use D3cold state only even
>>  on the supported systems. User can disable runtime power management from 
>>
>>  /sys/bus/pci/devices/…/power/control
>>
>>  Or disable d3cold itself 
>>
>>  /sys/bus/pci/devices/…/d3cold_allowed
>>
>>
>>  Even if all these are allowed, then platform_pci_choose_state()
>>  is the main function where the target low power state is selected
>>  in runtime.
>>
>>  Probably we can add pci_pr3_present() status to user which gives
>>  hint to user that required ACPI methods for d3cold is present in
>>  the platform. 
> 
> I expected that might be the answer.  The proposed interface name also
> avoids tying us directly to an ACPI implementation, so I imagine there
> could be a variety of backends supporting runtime power management in
> the host kernel.
> 
> In the VM I think the ACPI controls are at the root port, so we
> probably need to add power control to each root port regardless of what
> happens to be plugged into it at the time.  Maybe that means we can't
> really take advantage of knowing the degree of device support, we just
> need to wire it up as if it works regardless.
> 

 In the host side ACPI, the power resources will be mostly associated
 with root port but from the ACPI specification side, the power resources
 can be associated with the device itself. In the guest side,
 we need to do virtual implementation so either it can be associated
 with virtual root port or from the device itself.

 But with that also, the host level degree of support information
 won’t help much.

> We might also want to consider parallels to device hotplug here.  For
> example, if QEMU could know that a device does not retain state in
> d3cold, it might choose to unplug the device backend so that the device
> could be used elsewhere in the interim, or simply use the idle device
> handling for d3cold in vfio-pci.  That opens up a lot of questions
> regarding SLA contracts with management tools to be able to replace the
> device with a fungible substitute on demand, but I can imagine data
> center logistics might rather have that problem than VMs sitting on
> powered-off devices.
> 
>>>> 2. The hypervisors can implement virtual ACPI methods. For
>>>>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
>>>>    power resources with _ON/_OFF method, then guest linux OS makes the
>>>>    _OFF call during D3cold transition and then _ON during D0 transition.
>>>>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
>>>>    related IOCTL in the vfio driver.
>>>>
>>>> 3. The vfio driver uses runtime PM framework to achieve the
>>>>    D3cold state. For the D3cold transition, decrement the usage count and
>>>>    for the D0 transition, increment the usage count.
>>>>
>>>> 4. For D3cold, the device current power state should be D3hot.
>>>>    Then during runtime suspend, the pci_platform_power_transition() is
>>>>    required for D3cold state. If the D3cold state is not supported, then
>>>>    the device will still be in D3hot state. But with the runtime PM, the
>>>>    root port can now also go into suspended state.  
>>>
>>> Why do we create this requirement for the device to be in d3hot prior
>>> to entering low power   
>>
>>  This is mainly to make integration in the hypervisor with
>>  the PCI power management code flow.
>>
>>  If we see the power management steps, then following 2 steps
>>  are involved 
>>
>>  1. First move the device from D0 to D3hot state by writing
>>     into config register.
>>  2. Then invoke ACPI routines (mainly _PR3 OFF method) to
>>     move from D3hot to D3cold.
>>
>>  So, in the guest side, we can follow the same steps. The guest can
>>  do the config register write and then for step 2, the hypervisor
>>  can implement the virtual ACPI with _PR3/_PR0 power resources.
>>  Inside this virtual ACPI implementation, the hypervisor can invoke
>>  the power management IOCTL.
>>
>>  Also, if runtime PM has been disabled from the host side,
>>  then also the device will be in d3hot state. 
> 
> That's true regardless of us making it a requirement.  I don't see what
> it buys us to make this a requirement though.  If I trigger the _PR3
> method on bare metal, does ACPI care if the device is in D3hot first?
> At best that seems dependent on the ACPI implementation.
>

 Yes. That depends upon the ACPI implementation. 

>>> when our pm ops suspend function wakes the device do d0?  
>>
>>  The changing to D0 here is happening due to 2 reasons here,
>>
>>  1. First to preserve device state for the NoSoftRst-.
>>  2. To make use of PCI core layer generic code for runtime suspend,
>>     otherwise we need to do all handling here which is present in
>>     pci_pm_runtime_suspend().
> 
> What problem do we cause if we allow the user to trigger this ioctl
> from D0?  The restriction follows the expected use case, but otherwise
> imposing the restriction is arbitrary.
> 

 It seems then we can remove this restriction. It should be fine
 if user triggers this IOCTL from D0 and then the runtime power
 management itself will take care of device state itself.

>  
>>>> 5. For most of the systems, the D3cold is supported at the root
>>>>    port level. So, when root port will transition to D3cold state, then
>>>>    the vfio PCI device will go from D3hot to D3cold state during its
>>>>    runtime suspend. If root port does not support D3cold, then the root
>>>>    will go into D3hot state.
>>>>
>>>> 6. The runtime suspend callback can now happen for 2 cases: there
>>>>    are no users of vfio device and the case where user has initiated
>>>>    D3cold. The 'platform_pm_engaged' flag can help to distinguish
>>>>    between these 2 cases.  
>>>
>>> If this were the only use case we could rely on vfio_device.open_count
>>> instead.  I don't think it is though.    
>>
>>  platform_pm_engaged is mainly to track the user initiated
>>  low power entry with the IOCTL. So even if we use vfio_device.open_count
>>  here, we will still require platform_pm_engaged.
>>
>>>> 7. In D3cold, all kind of BAR related access needs to be disabled
>>>>    like D3hot. Additionally, the config space will also be disabled in
>>>>    D3cold state. To prevent access of config space in D3cold state, do
>>>>    increment the runtime PM usage count before doing any config space
>>>>    access.  
>>>
>>> Or we could actually prevent access to config space rather than waking
>>> the device for the access.  Addressed in further comment below.
>>>    
>>>> 8. If user has engaged low power entry through IOCTL, then user should
>>>>    do low power exit first. The user can issue config access or IOCTL
>>>>    after low power entry. We can add an explicit error check but since
>>>>    we are already waking-up device, so IOCTL and config access can be
>>>>    fulfilled. But 'power_state_d3' won't be cleared without issuing
>>>>    low power exit so all BAR related access will still return error till
>>>>    user do low power exit.  
>>>
>>> The fact that power_state_d3 no longer tracks the device power state
>>> when platform_pm_engaged is set is a confusing discontinuity.
>>>   
>>
>>  If we refer the power management steps (as mentioned in the above),
>>  then these 2 variable tracks different things.
>>
>>  1. power_state_d3 tracks the config space write.  
>>  2. platform_pm_engaged tracks the IOCTL call. In the IOCTL, we decrement
>>     the runtime usage count so we need to track that we have decremented
>>     it. 
>>
>>>> 9. Since multiple layers are involved, so following is the high level
>>>>    code flow for D3cold entry and exit.
>>>>
>>>> D3cold entry:
>>>>
>>>> a. User put the PCI device into D3hot by writing into standard config
>>>>    register (vfio_pm_config_write() -> vfio_lock_and_set_power_state() ->
>>>>    vfio_pci_set_power_state()). The device power state will be D3hot and
>>>>    power_state_d3 will be true.
>>>> b. Set vfio_device_feature_power_management::low_power_state =
>>>>    VFIO_DEVICE_LOW_POWER_STATE_ENTER and call VFIO_DEVICE_FEATURE IOCTL.
>>>> c. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
>>>>    will be called first which will make the usage count as 2 and then
>>>>    vfio_pci_core_ioctl_feature() will be invoked.
>>>> d. vfio_pci_core_feature_pm() will be called and it will go inside
>>>>    VFIO_DEVICE_LOW_POWER_STATE_ENTER switch case. platform_pm_engaged will
>>>>    be true and pm_runtime_put_noidle() will decrement the usage count
>>>>    to 1.
>>>> e. Inside vfio_device_fops_unl_ioctl() while returning the
>>>>    pm_runtime_put() will make the usage count to 0 and the runtime PM
>>>>    framework will engage the runtime suspend entry.
>>>> f. pci_pm_runtime_suspend() will be called and invokes driver runtime
>>>>    suspend callback.
>>>> g. vfio_pci_core_runtime_suspend() will change the power state to D0
>>>>    and do the INTx mask related handling.
>>>> h. pci_pm_runtime_suspend() will take care of saving the PCI state and
>>>>    all power management handling for D3cold.
>>>>
>>>> D3cold exit:
>>>>
>>>> a. Set vfio_device_feature_power_management::low_power_state =
>>>>    VFIO_DEVICE_LOW_POWER_STATE_EXIT and call VFIO_DEVICE_FEATURE IOCTL.
>>>> b. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
>>>>    will be called first which will make the usage count as 1.
>>>> c. pci_pm_runtime_resume() will take care of moving the device into D0
>>>>    state again and then vfio_pci_core_runtime_resume() will be called.
>>>> d. vfio_pci_core_runtime_resume() will do the INTx unmask related
>>>>    handling.
>>>> e. vfio_pci_core_ioctl_feature() will be invoked.
>>>> f. vfio_pci_core_feature_pm() will be called and it will go inside
>>>>    VFIO_DEVICE_LOW_POWER_STATE_EXIT switch case. platform_pm_engaged and
>>>>    power_state_d3 will be cleared and pm_runtime_get_noresume() will make
>>>>    the usage count as 2.
>>>> g. Inside vfio_device_fops_unl_ioctl() while returning the
>>>>    pm_runtime_put() will make the usage count to 1 and the device will
>>>>    be in D0 state only.
>>>>
>>>> Signed-off-by: Abhishek Sahu <abhsahu@...dia.com>
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci_config.c |  11 ++-
>>>>  drivers/vfio/pci/vfio_pci_core.c   | 131 ++++++++++++++++++++++++++++-
>>>>  include/linux/vfio_pci_core.h      |   1 +
>>>>  include/uapi/linux/vfio.h          |  18 ++++
>>>>  4 files changed, 159 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>>> index af0ae80ef324..65b1bc9586ab 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>>> @@ -25,6 +25,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/vfio.h>
>>>>  #include <linux/slab.h>
>>>> +#include <linux/pm_runtime.h>
>>>>  
>>>>  #include <linux/vfio_pci_core.h>
>>>>  
>>>> @@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>>>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>  			   size_t count, loff_t *ppos, bool iswrite)
>>>>  {
>>>> +	struct device *dev = &vdev->pdev->dev;
>>>>  	size_t done = 0;
>>>>  	int ret = 0;
>>>>  	loff_t pos = *ppos;
>>>>  
>>>>  	pos &= VFIO_PCI_OFFSET_MASK;
>>>>  
>>>> +	ret = pm_runtime_resume_and_get(dev);
>>>> +	if (ret < 0)
>>>> +		return ret;  
>>>
>>> Alternatively we could just check platform_pm_engaged here and return
>>> -EINVAL, right?  Why is waking the device the better option?
>>>   
>>
>>  This is mainly to prevent race condition where config space access
>>  happens parallelly with IOCTL access. So, lets consider the following case.
>>
>>  1. Config space access happens and vfio_pci_config_rw() will be called.
>>  2. The IOCTL to move into low power state is called.
>>  3. The IOCTL will move the device into d3cold.
>>  4. Exit from vfio_pci_config_rw() happened.
>>
>>  Now, if we just check platform_pm_engaged, then in the above
>>  sequence it won’t work. I checked this parallel access by writing
>>  a small program where I opened the 2 instances and then
>>  created 2 threads for config space and IOCTL.
>>  In my case, I got the above sequence.
>>
>>  The pm_runtime_resume_and_get() will make sure that device
>>  usage count keep incremented throughout the config space
>>  access (or IOCTL access in the previous patch) and the
>>  runtime PM framework will not move the device into suspended
>>  state.
> 
> I think we're inventing problems here.  If we define that config space
> is not accessible while the device is in low power and the only way to
> get the device out of low power is via ioctl, then we should be denying
> access to the device while in low power.  If the user races exiting the
> device from low power and a config space access, that's their problem.
> 

 But what about malicious user who intentionally tries to create
 this sequence. If the platform_pm_engaged check passed and
 then user put the device into low power state. In that case,
 there may be chances where config read happens while the device
 is in low power state.

 Can we prevent this concurrent access somehow or make sure
 that nothing else is running when the low power ioctl runs?

>>>> +
>>>>  	while (count) {
>>>>  		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
>>>> -		if (ret < 0)
>>>> +		if (ret < 0) {
>>>> +			pm_runtime_put(dev);
>>>>  			return ret;
>>>> +		}
>>>>  
>>>>  		count -= ret;
>>>>  		done += ret;
>>>> @@ -1953,6 +1961,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>  		pos += ret;
>>>>  	}
>>>>  
>>>> +	pm_runtime_put(dev);
>>>>  	*ppos += done;
>>>>  
>>>>  	return done;
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index 05a68ca9d9e7..beac6e05f97f 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -234,7 +234,14 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>>>  	ret = pci_set_power_state(pdev, state);
>>>>  
>>>>  	if (!ret) {
>>>> -		vdev->power_state_d3 = (pdev->current_state >= PCI_D3hot);
>>>> +		/*
>>>> +		 * If 'platform_pm_engaged' is true then 'power_state_d3' can
>>>> +		 * be cleared only when user makes the explicit request to
>>>> +		 * move out of low power state by using power management ioctl.
>>>> +		 */
>>>> +		if (!vdev->platform_pm_engaged)
>>>> +			vdev->power_state_d3 =
>>>> +				(pdev->current_state >= PCI_D3hot);  
>>>
>>> power_state_d3 is essentially only used as a secondary test to
>>> __vfio_pci_memory_enabled() to block r/w access to device regions and
>>> generate a fault on mmap access.  Its existence already seems a little
>>> questionable when we could just look at vdev->pdev->current_state, and
>>> we could incorporate that into __vfio_pci_memory_enabled().  So rather
>>> than creating this inconsistency, couldn't we just make that function
>>> return:
>>>
>>> !vdev->platform_pm_enagaged && pdev->current_state < PCI_D3hot &&
>>> (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY))
>>>   
>>
>>  The main reason for power_state_d3 is to get it under
>>  memory_lock semaphore. But pdev->current_state is not
>>  protected with any lock. So, will use of pdev->current_state
>>  here be safe?
> 
> If we're only testing and modifying pdev->current_state under
> memory_lock, isn't it equivalent?
>

 pdev->current_state can be modified by PCI runtime PM core
 layer itself like when user invokes lspci, config dump command
 but in that case, platform_pm_enagaged should block this access.
 While for config space writes, the PM core layer code should not
 touch the pdev->current_state. So, yes we can use pdev->current_state.
 I will make this change and update the other patch in this series.

>>>>  
>>>>  		/* D3 might be unsupported via quirk, skip unless in D3 */
>>>>  		if (needs_save && pdev->current_state >= PCI_D3hot) {
>>>> @@ -266,6 +273,25 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
>>>>  {
>>>>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>>>  
>>>> +	down_read(&vdev->memory_lock);
>>>> +
>>>> +	/* 'platform_pm_engaged' will be false if there are no users. */
>>>> +	if (!vdev->platform_pm_engaged) {
>>>> +		up_read(&vdev->memory_lock);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * The user will move the device into D3hot state first before invoking
>>>> +	 * power management ioctl. Move the device into D0 state here and then
>>>> +	 * the pci-driver core runtime PM suspend will move the device into
>>>> +	 * low power state. Also, for the devices which have NoSoftRst-,
>>>> +	 * it will help in restoring the original state (saved locally in
>>>> +	 * 'vdev->pm_save').
>>>> +	 */
>>>> +	vfio_pci_set_power_state(vdev, PCI_D0);
>>>> +	up_read(&vdev->memory_lock);
>>>> +
>>>>  	/*
>>>>  	 * If INTx is enabled, then mask INTx before going into runtime
>>>>  	 * suspended state and unmask the same in the runtime resume.
>>>> @@ -395,6 +421,19 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>>>  
>>>>  	/*
>>>>  	 * This function can be invoked while the power state is non-D0.
>>>> +	 * This non-D0 power state can be with or without runtime PM.
>>>> +	 * Increment the usage count corresponding to pm_runtime_put()
>>>> +	 * called during setting of 'platform_pm_engaged'. The device will
>>>> +	 * wake up if it has already went into suspended state. Otherwise,
>>>> +	 * the next vfio_pci_set_power_state() will change the
>>>> +	 * device power state to D0.
>>>> +	 */
>>>> +	if (vdev->platform_pm_engaged) {
>>>> +		pm_runtime_resume_and_get(&pdev->dev);
>>>> +		vdev->platform_pm_engaged = false;
>>>> +	}
>>>> +
>>>> +	/*
>>>>  	 * This function calls __pci_reset_function_locked() which internally
>>>>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>>>>  	 * fail if the power state is non-D0. Also, for the devices which
>>>> @@ -1192,6 +1231,80 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>>>>  
>>>> +#ifdef CONFIG_PM
>>>> +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
>>>> +				    void __user *arg, size_t argsz)
>>>> +{
>>>> +	struct vfio_pci_core_device *vdev =
>>>> +		container_of(device, struct vfio_pci_core_device, vdev);
>>>> +	struct pci_dev *pdev = vdev->pdev;
>>>> +	struct vfio_device_feature_power_management vfio_pm = { 0 };
>>>> +	int ret = 0;
>>>> +
>>>> +	ret = vfio_check_feature(flags, argsz,
>>>> +				 VFIO_DEVICE_FEATURE_SET |
>>>> +				 VFIO_DEVICE_FEATURE_GET,
>>>> +				 sizeof(vfio_pm));
>>>> +	if (ret != 1)
>>>> +		return ret;
>>>> +
>>>> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
>>>> +		down_read(&vdev->memory_lock);
>>>> +		vfio_pm.low_power_state = vdev->platform_pm_engaged ?
>>>> +				VFIO_DEVICE_LOW_POWER_STATE_ENTER :
>>>> +				VFIO_DEVICE_LOW_POWER_STATE_EXIT;
>>>> +		up_read(&vdev->memory_lock);
>>>> +		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
>>>> +			return -EFAULT;
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
>>>> +		return -EFAULT;
>>>> +
>>>> +	/*
>>>> +	 * The vdev power related fields are protected with memory_lock
>>>> +	 * semaphore.
>>>> +	 */
>>>> +	down_write(&vdev->memory_lock);
>>>> +	switch (vfio_pm.low_power_state) {
>>>> +	case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>> +		if (!vdev->power_state_d3 || vdev->platform_pm_engaged) {
>>>> +			ret = EINVAL;
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		vdev->platform_pm_engaged = true;
>>>> +
>>>> +		/*
>>>> +		 * The pm_runtime_put() will be called again while returning
>>>> +		 * from ioctl after which the device can go into runtime
>>>> +		 * suspended.
>>>> +		 */
>>>> +		pm_runtime_put_noidle(&pdev->dev);
>>>> +		break;
>>>> +
>>>> +	case VFIO_DEVICE_LOW_POWER_STATE_EXIT:
>>>> +		if (!vdev->platform_pm_engaged) {
>>>> +			ret = EINVAL;
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		vdev->platform_pm_engaged = false;
>>>> +		vdev->power_state_d3 = false;
>>>> +		pm_runtime_get_noresume(&pdev->dev);
>>>> +		break;
>>>> +
>>>> +	default:
>>>> +		ret = EINVAL;
>>>> +		break;
>>>> +	}
>>>> +
>>>> +	up_write(&vdev->memory_lock);
>>>> +	return ret;
>>>> +}
>>>> +#endif
>>>> +
>>>>  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>>>>  				       void __user *arg, size_t argsz)
>>>>  {
>>>> @@ -1226,6 +1339,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>>>>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>>>>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>>>>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
>>>> +#ifdef CONFIG_PM
>>>> +	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
>>>> +		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
>>>> +#endif
>>>>  	default:
>>>>  		return -ENOTTY;
>>>>  	}
>>>> @@ -2189,6 +2306,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>>>  		goto err_unlock;
>>>>  	}
>>>>  
>>>> +	/*
>>>> +	 * Some of the devices in the dev_set can be in the runtime suspended
>>>> +	 * state. Increment the usage count for all the devices in the dev_set
>>>> +	 * before reset and decrement the same after reset.
>>>> +	 */
>>>> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
>>>> +	if (ret)
>>>> +		goto err_unlock;
>>>> +
>>>>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>>>>  		/*
>>>>  		 * Test whether all the affected devices are contained by the
>>>> @@ -2244,6 +2370,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>>>  		else
>>>>  			mutex_unlock(&cur->vma_lock);
>>>>  	}
>>>> +
>>>> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
>>>> +		pm_runtime_put(&cur->pdev->dev);
>>>>  err_unlock:
>>>>  	mutex_unlock(&dev_set->lock);
>>>>  	return ret;
>>>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>>>> index e84f31e44238..337983a877d6 100644
>>>> --- a/include/linux/vfio_pci_core.h
>>>> +++ b/include/linux/vfio_pci_core.h
>>>> @@ -126,6 +126,7 @@ struct vfio_pci_core_device {
>>>>  	bool			needs_pm_restore;
>>>>  	bool			power_state_d3;
>>>>  	bool			pm_intx_masked;
>>>> +	bool			platform_pm_engaged;
>>>>  	struct pci_saved_state	*pci_saved_state;
>>>>  	struct pci_saved_state	*pm_save;
>>>>  	int			ioeventfds_nr;
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index fea86061b44e..53ff890dbd27 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -986,6 +986,24 @@ enum vfio_device_mig_state {
>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>  };
>>>>  
>>>> +/*
>>>> + * Use platform-based power management for moving the device into low power
>>>> + * state.  This low power state is device specific.
>>>> + *
>>>> + * For PCI, this low power state is D3cold.  The native PCI power management
>>>> + * does not support the D3cold power state.  For moving the device into D3cold
>>>> + * state, change the PCI state to D3hot with standard configuration registers
>>>> + * and then call this IOCTL to setting the D3cold state.  Similarly, if the
>>>> + * device in D3cold state, then call this IOCTL to exit from D3cold state.
>>>> + */
>>>> +struct vfio_device_feature_power_management {
>>>> +#define VFIO_DEVICE_LOW_POWER_STATE_EXIT	0x0
>>>> +#define VFIO_DEVICE_LOW_POWER_STATE_ENTER	0x1
>>>> +	__u64	low_power_state;
>>>> +};
>>>> +
>>>> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3  
>>>
>>> __u8 seems more than sufficient here.  Thanks,
>>>
>>> Alex
>>>  
>>
>>  I have used __u64 mainly to get this structure 64 bit aligned.
>>  I was impression that the ioctl structure should be 64 bit aligned
>>  but in this case since we will have just have __u8 member so
>>  alignment should not be required?
> 
> We can add a directive to enforce an alignment regardless of the field
> size.  I believe the feature ioctl header is already going to be eight
> byte aligned, so it's probably not strictly necessary, but Jason seems
> to be adding more of these directives elsewhere, so probably a good
> idea regardless.  Thanks,
> 
> Alex
> 

So, should I change it like

__u8    low_power_state __attribute__((aligned(8)));

 Or

__aligned_u64 low_power_state

In the existing code, there are very few references for the
first one.

Thanks,
Abhishek