[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <81f1ec94-bdb2-a49a-3903-33939ab88196@nvidia.com>
Date: Fri, 8 Jul 2022 15:15:33 +0530
From: Abhishek Sahu <abhsahu@...dia.com>
To: Alex Williamson <alex.williamson@...hat.com>
Cc: Cornelia Huck <cohuck@...hat.com>,
Yishai Hadas <yishaih@...dia.com>,
Jason Gunthorpe <jgg@...dia.com>,
Shameer Kolothum <shameerali.kolothum.thodi@...wei.com>,
Kevin Tian <kevin.tian@...el.com>,
"Rafael J . Wysocki" <rafael@...nel.org>,
Max Gurtovoy <mgurtovoy@...dia.com>,
Bjorn Helgaas <bhelgaas@...gle.com>,
linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
linux-pm@...r.kernel.org, linux-pci@...r.kernel.org
Subject: Re: [PATCH v4 6/6] vfio/pci: Add support for virtual PME
On 7/6/2022 9:10 PM, Alex Williamson wrote:
> On Fri, 1 Jul 2022 16:38:14 +0530
> Abhishek Sahu <abhsahu@...dia.com> wrote:
>
>> If the PCI device is in low power state and the device requires
>> wake-up, then it can generate PME (Power Management Events). Mostly
>> these PME events will be propagated to the root port and then the
>> root port will generate the system interrupt. Then the OS should
>> identify the device which generated the PME and should resume
>> the device.
>>
>> We can implement a similar virtual PME framework where if the device
>> already went into the runtime suspended state and then there is any
>> wake-up on the host side, then it will send the virtual PME
>> notification to the guest. This virtual PME will be helpful for the cases
>> where the device will not be suspended again if there is any wake-up
>> triggered by the host. Following is the overall approach regarding
>> the virtual PME.
>>
>> 1. Add one more event like VFIO_PCI_ERR_IRQ_INDEX named
>> VFIO_PCI_PME_IRQ_INDEX and do the required code changes to get/set
>> this new IRQ.
>>
>> 2. From the guest side, the guest needs to enable eventfd for the
>> virtual PME notification.
>>
>> 3. In the vfio-pci driver, the PME support bits are currently
>> virtualized and set to 0. We can set PME capability support for all
>> the power states. This PME capability support is independent of the
>> physical PME support.
>>
>> 4. The PME enable (PME_En bit in Power Management Control/Status
>> Register) and PME status (PME_Status bit in Power Management
>> Control/Status Register) are also virtualized currently.
>> The write support for PME_En bit can be enabled.
>>
>> 5. The PME_Status bit is a write-1-clear bit where the write with
>> zero value will have no effect and write with 1 value will clear the
>> bit. The write for this bit will be trapped inside
>> vfio_pm_config_write() similar to PCI_PM_CTRL write for PM_STATES.
>>
>> 6. When the host gets a request for resuming the device other than from
>> low power exit feature IOCTL, then PME_Status bit will be set.
>> According to [PCIe v5 7.5.2.2],
>> "PME_Status - This bit is Set when the Function would normally
>> generate a PME signal. The value of this bit is not affected by
>> the value of the PME_En bit."
>>
>> So even if PME_En bit is not set, we can set PME_Status bit.
>>
>> 7. If the guest has enabled PME_En and registered for PME events
>> through eventfd, then the usage count will be incremented to prevent
>> the device to go into the suspended state and notify the guest through
>> eventfd trigger.
>>
>> The virtual PME can help in handling physical PME also. When
>> physical PME comes, then also the runtime resume will be called. If
>> the guest has registered for virtual PME, then it will be sent in this
>> case also.
>>
>> * Implementation for handling the virtual PME on the hypervisor:
>>
>> If we take the implementation in Linux OS, then during runtime suspend
>> time, then it calls __pci_enable_wake(). It internally enables PME
>> through pci_pme_active() and also enables the ACPI side wake-up
>> through platform_pci_set_wakeup(). To handle the PME, the hypervisor has
>> the following two options:
>>
>> 1. Create a virtual root port for the VFIO device and trigger
>> interrupt when the PME comes. It will call pcie_pme_irq() which will
>> resume the device.
>>
>> 2. Create a virtual ACPI _PRW resource and associate it with the device
>> itself. In _PRW, any GPE (General Purpose Event) can be assigned for
>> the wake-up. When PME comes, then GPE can be triggered by the
>> hypervisor. GPE interrupt will call pci_acpi_wake_dev() function
>> internally and it will resume the device.
>
> Do we really need to implement PME emulation in the kernel or is it
> sufficient for userspace to simply register a one-shot eventfd when
> SET'ing the low power feature and QEMU can provide the PME emulation
> based on that signaling? Thanks,
>
> Alex
>
It seems it can be implemented in QEMU instead of kernel.
I will drop this patch.
Thanks,
Abhishek
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@...dia.com>
>> ---
>> drivers/vfio/pci/vfio_pci_config.c | 39 +++++++++++++++++++++------
>> drivers/vfio/pci/vfio_pci_core.c | 43 ++++++++++++++++++++++++------
>> drivers/vfio/pci/vfio_pci_intrs.c | 18 +++++++++++++
>> include/linux/vfio_pci_core.h | 2 ++
>> include/uapi/linux/vfio.h | 1 +
>> 5 files changed, 87 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>> index 21a4743d011f..a06375a03758 100644
>> --- a/drivers/vfio/pci/vfio_pci_config.c
>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>> @@ -719,6 +719,20 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
>> if (count < 0)
>> return count;
>>
>> + /*
>> + * PME_STATUS is write-1-clear bit. If PME_STATUS is 1, then clear the
>> + * bit in vconfig. The PME_STATUS is in the upper byte of the control
>> + * register and user can do single byte write also.
>> + */
>> + if (offset <= PCI_PM_CTRL + 1 && offset + count > PCI_PM_CTRL + 1) {
>> + if (le32_to_cpu(val) &
>> + (PCI_PM_CTRL_PME_STATUS >> (offset - PCI_PM_CTRL) * 8)) {
>> + __le16 *ctrl = (__le16 *)&vdev->vconfig
>> + [vdev->pm_cap_offset + PCI_PM_CTRL];
>> + *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
>> + }
>> + }
>> +
>> if (offset == PCI_PM_CTRL) {
>> pci_power_t state;
>>
>> @@ -771,14 +785,16 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
>> * the user change power state, but we trap and initiate the
>> * change ourselves, so the state bits are read-only.
>> *
>> - * The guest can't process PME from D3cold so virtualize PME_Status
>> - * and PME_En bits. The vconfig bits will be cleared during device
>> - * capability initialization.
>> + * The guest can't process physical PME from D3cold so virtualize
>> + * PME_Status and PME_En bits. These bits will be used for the
>> + * virtual PME between host and guest. The vconfig bits will be
>> + * updated during device capability initialization. PME_Status is
>> + * write-1-clear bit, so it is read-only. We trap and update the
>> + * vconfig bit manually during write.
>> */
>> p_setd(perm, PCI_PM_CTRL,
>> PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
>> - ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
>> - PCI_PM_CTRL_STATE_MASK));
>> + ~(PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_PME_STATUS));
>>
>> return 0;
>> }
>> @@ -1454,8 +1470,13 @@ static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
>> __le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
>> __le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
>>
>> - /* Clear vconfig PME_Support, PME_Status, and PME_En bits */
>> - *pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
>> + /*
>> + * Set the vconfig PME_Support bits. The PME_Status is being used for
>> + * virtual PME support and is not dependent upon the physical
>> + * PME support.
>> + */
>> + *pmc |= cpu_to_le16(PCI_PM_CAP_PME_MASK);
>> + /* Clear vconfig PME_Support and PME_En bits */
>> *ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
>> }
>>
>> @@ -1582,8 +1603,10 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
>> if (ret)
>> return ret;
>>
>> - if (cap == PCI_CAP_ID_PM)
>> + if (cap == PCI_CAP_ID_PM) {
>> + vdev->pm_cap_offset = pos;
>> vfio_update_pm_vconfig_bytes(vdev, pos);
>> + }
>>
>> prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
>> pos = next;
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 1ddaaa6ccef5..6c1225bc2aeb 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -319,14 +319,35 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
>> * the low power state or closed the device.
>> * - If there is device access on the host side.
>> *
>> - * For the second case, check if re-entry to the low power state is
>> - * allowed. If not, then increment the usage count so that runtime PM
>> - * framework won't suspend the device and set the 'pm_runtime_resumed'
>> - * flag.
>> + * For the second case:
>> + * - The virtual PME_STATUS bit will be set. If PME_ENABLE bit is set
>> + * and user has registered for virtual PME events, then send the PME
>> + * virtual PME event.
>> + * - Check if re-entry to the low power state is not allowed.
>> + *
>> + * For the above conditions, increment the usage count so that
>> + * runtime PM framework won't suspend the device and set the
>> + * 'pm_runtime_resumed' flag.
>> */
>> - if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
>> - pm_runtime_get_noresume(dev);
>> - vdev->pm_runtime_resumed = true;
>> + if (vdev->pm_runtime_engaged) {
>> + bool pme_triggered = false;
>> + __le16 *ctrl = (__le16 *)&vdev->vconfig
>> + [vdev->pm_cap_offset + PCI_PM_CTRL];
>> +
>> + *ctrl |= cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
>> + if (le16_to_cpu(*ctrl) & PCI_PM_CTRL_PME_ENABLE) {
>> + mutex_lock(&vdev->igate);
>> + if (vdev->pme_trigger) {
>> + pme_triggered = true;
>> + eventfd_signal(vdev->pme_trigger, 1);
>> + }
>> + mutex_unlock(&vdev->igate);
>> + }
>> +
>> + if (!vdev->pm_runtime_reentry_allowed || pme_triggered) {
>> + pm_runtime_get_noresume(dev);
>> + vdev->pm_runtime_resumed = true;
>> + }
>> }
>> up_write(&vdev->memory_lock);
>>
>> @@ -586,6 +607,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>> eventfd_ctx_put(vdev->req_trigger);
>> vdev->req_trigger = NULL;
>> }
>> + if (vdev->pme_trigger) {
>> + eventfd_ctx_put(vdev->pme_trigger);
>> + vdev->pme_trigger = NULL;
>> + }
>> mutex_unlock(&vdev->igate);
>> }
>> EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
>> @@ -639,7 +664,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>> } else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
>> if (pci_is_pcie(vdev->pdev))
>> return 1;
>> - } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
>> + } else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX ||
>> + irq_type == VFIO_PCI_PME_IRQ_INDEX) {
>> return 1;
>> }
>>
>> @@ -985,6 +1011,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>> switch (info.index) {
>> case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
>> case VFIO_PCI_REQ_IRQ_INDEX:
>> + case VFIO_PCI_PME_IRQ_INDEX:
>> break;
>> case VFIO_PCI_ERR_IRQ_INDEX:
>> if (pci_is_pcie(vdev->pdev))
>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>> index 1a37db99df48..db4180687a74 100644
>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>> @@ -639,6 +639,17 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev,
>> count, flags, data);
>> }
>>
>> +static int vfio_pci_set_pme_trigger(struct vfio_pci_core_device *vdev,
>> + unsigned index, unsigned start,
>> + unsigned count, uint32_t flags, void *data)
>> +{
>> + if (index != VFIO_PCI_PME_IRQ_INDEX || start != 0 || count > 1)
>> + return -EINVAL;
>> +
>> + return vfio_pci_set_ctx_trigger_single(&vdev->pme_trigger,
>> + count, flags, data);
>> +}
>> +
>> int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
>> unsigned index, unsigned start, unsigned count,
>> void *data)
>> @@ -688,6 +699,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
>> break;
>> }
>> break;
>> + case VFIO_PCI_PME_IRQ_INDEX:
>> + switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
>> + case VFIO_IRQ_SET_ACTION_TRIGGER:
>> + func = vfio_pci_set_pme_trigger;
>> + break;
>> + }
>> + break;
>> }
>>
>> if (!func)
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index 18cc83b767b8..ee2646d820c2 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -102,6 +102,7 @@ struct vfio_pci_core_device {
>> bool bar_mmap_supported[PCI_STD_NUM_BARS];
>> u8 *pci_config_map;
>> u8 *vconfig;
>> + u8 pm_cap_offset;
>> struct perm_bits *msi_perm;
>> spinlock_t irqlock;
>> struct mutex igate;
>> @@ -133,6 +134,7 @@ struct vfio_pci_core_device {
>> int ioeventfds_nr;
>> struct eventfd_ctx *err_trigger;
>> struct eventfd_ctx *req_trigger;
>> + struct eventfd_ctx *pme_trigger;
>> struct list_head dummy_resources_list;
>> struct mutex ioeventfds_lock;
>> struct list_head ioeventfds_list;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 7e00de5c21ea..08170950d655 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -621,6 +621,7 @@ enum {
>> VFIO_PCI_MSIX_IRQ_INDEX,
>> VFIO_PCI_ERR_IRQ_INDEX,
>> VFIO_PCI_REQ_IRQ_INDEX,
>> + VFIO_PCI_PME_IRQ_INDEX,
>> VFIO_PCI_NUM_IRQS
>> };
>>
>
Powered by blists - more mailing lists